[PATCH 4.9 03/43] net_sched: reject silly cell_log in qdisc_get_rtab()
From: Eric Dumazet commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com Signed-off-by: Jakub Kicinski [sudip: adjust context] Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_api.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -393,7 +393,8 @@ struct qdisc_rate_table *qdisc_get_rtab( { struct qdisc_rate_table *rtab; - if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) return NULL;
[PATCH 4.4 01/38] net_sched: reject silly cell_log in qdisc_get_rtab()
From: Eric Dumazet commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com Signed-off-by: Jakub Kicinski [sudip: adjust context] Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_api.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -391,7 +391,8 @@ struct qdisc_rate_table *qdisc_get_rtab( { struct qdisc_rate_table *rtab; - if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) return NULL;
[PATCH 4.14 04/15] net_sched: reject silly cell_log in qdisc_get_rtab()
From: Eric Dumazet commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com Signed-off-by: Jakub Kicinski [sudip: adjust context] Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_api.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -397,7 +397,8 @@ struct qdisc_rate_table *qdisc_get_rtab( { struct qdisc_rate_table *rtab; - if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) return NULL;
[PATCH 5.10 174/199] net_sched: reject silly cell_log in qdisc_get_rtab()
From: Eric Dumazet commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream. iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_api.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -412,7 +412,8 @@ struct qdisc_rate_table *qdisc_get_rtab( { struct qdisc_rate_table *rtab; - if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) { NL_SET_ERR_MSG(extack, "Invalid rate table parameters for searching"); return NULL;
[PATCH 4.19 54/58] net_sched: reject silly cell_log in qdisc_get_rtab()
From: Eric Dumazet commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream. iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_api.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -398,7 +398,8 @@ struct qdisc_rate_table *qdisc_get_rtab( { struct qdisc_rate_table *rtab; - if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) { NL_SET_ERR_MSG(extack, "Invalid rate table parameters for searching"); return NULL;
[PATCH 5.4 80/86] net_sched: reject silly cell_log in qdisc_get_rtab()
From: Eric Dumazet commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream. iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_api.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -409,7 +409,8 @@ struct qdisc_rate_table *qdisc_get_rtab( { struct qdisc_rate_table *rtab; - if (tab == NULL || r->rate == 0 || r->cell_log == 0 || + if (tab == NULL || r->rate == 0 || + r->cell_log == 0 || r->cell_log >= 32 || nla_len(tab) != TC_RTAB_SIZE) { NL_SET_ERR_MSG(extack, "Invalid rate table parameters for searching"); return NULL;
[PATCH 4.9 36/86] tcp: md5: do not send silly options in SYNCOOKIES
From: Eric Dumazet [ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ] Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet Cc: Florian Westphal Cc: Mathieu Desnoyers Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp_output.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -615,7 +615,8 @@ static unsigned int tcp_synack_options(s unsigned int mss, struct sk_buff *skb, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, - struct tcp_fastopen_cookie *foc) + struct tcp_fastopen_cookie *foc, + enum tcp_synack_type synack_type) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -630,7 +631,8 @@ static unsigned int tcp_synack_options(s * rather than TS in order to fit in better with old, * buggy kernels, but that was deemed to be unnecessary. */ - ireq->tstamp_ok &= !ireq->sack_ok; + if (synack_type != TCP_SYNACK_COOKIE) + ireq->tstamp_ok &= !ireq->sack_ok; } #endif @@ -3165,8 +3167,8 @@ struct sk_buff *tcp_make_synack(const st md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); - tcp_header_size = tcp_synack_options(req, mss, skb, , md5, foc) + - sizeof(*th); + tcp_header_size = tcp_synack_options(req, mss, skb, , md5, +foc, synack_type) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb);
[PATCH 5.7 017/244] tcp: md5: do not send silly options in SYNCOOKIES
From: Eric Dumazet [ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ] Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet Cc: Florian Westphal Cc: Mathieu Desnoyers Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp_output.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -700,7 +700,8 @@ static unsigned int tcp_synack_options(c unsigned int mss, struct sk_buff *skb, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, - struct tcp_fastopen_cookie *foc) + struct tcp_fastopen_cookie *foc, + enum tcp_synack_type synack_type) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -715,7 +716,8 @@ static unsigned int tcp_synack_options(c * rather than TS in order to fit in better with old, * buggy kernels, but that was deemed to be unnecessary. */ - ireq->tstamp_ok &= !ireq->sack_ok; + if (synack_type != TCP_SYNACK_COOKIE) + ireq->tstamp_ok &= !ireq->sack_ok; } #endif @@ -3388,7 +3390,7 @@ struct sk_buff *tcp_make_synack(const st #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); tcp_header_size = tcp_synack_options(sk, req, mss, skb, , md5, -foc) + sizeof(*th); +foc, synack_type) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb);
[PATCH 5.4 020/215] tcp: md5: do not send silly options in SYNCOOKIES
From: Eric Dumazet [ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ] Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet Cc: Florian Westphal Cc: Mathieu Desnoyers Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp_output.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -662,7 +662,8 @@ static unsigned int tcp_synack_options(c unsigned int mss, struct sk_buff *skb, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, - struct tcp_fastopen_cookie *foc) + struct tcp_fastopen_cookie *foc, + enum tcp_synack_type synack_type) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -677,7 +678,8 @@ static unsigned int tcp_synack_options(c * rather than TS in order to fit in better with old, * buggy kernels, but that was deemed to be unnecessary. */ - ireq->tstamp_ok &= !ireq->sack_ok; + if (synack_type != TCP_SYNACK_COOKIE) + ireq->tstamp_ok &= !ireq->sack_ok; } #endif @@ -3326,7 +3328,7 @@ struct sk_buff *tcp_make_synack(const st #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); tcp_header_size = tcp_synack_options(sk, req, mss, skb, , md5, -foc) + sizeof(*th); +foc, synack_type) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb);
[PATCH 4.19 013/133] tcp: md5: do not send silly options in SYNCOOKIES
From: Eric Dumazet [ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ] Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet Cc: Florian Westphal Cc: Mathieu Desnoyers Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp_output.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -648,7 +648,8 @@ static unsigned int tcp_synack_options(c unsigned int mss, struct sk_buff *skb, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, - struct tcp_fastopen_cookie *foc) + struct tcp_fastopen_cookie *foc, + enum tcp_synack_type synack_type) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -663,7 +664,8 @@ static unsigned int tcp_synack_options(c * rather than TS in order to fit in better with old, * buggy kernels, but that was deemed to be unnecessary. */ - ireq->tstamp_ok &= !ireq->sack_ok; + if (synack_type != TCP_SYNACK_COOKIE) + ireq->tstamp_ok &= !ireq->sack_ok; } #endif @@ -3246,7 +3248,7 @@ struct sk_buff *tcp_make_synack(const st #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); tcp_header_size = tcp_synack_options(sk, req, mss, skb, , md5, -foc) + sizeof(*th); +foc, synack_type) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb);
[PATCH 4.14 052/125] tcp: md5: do not send silly options in SYNCOOKIES
From: Eric Dumazet [ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ] Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet Cc: Florian Westphal Cc: Mathieu Desnoyers Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp_output.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -616,7 +616,8 @@ static unsigned int tcp_synack_options(s unsigned int mss, struct sk_buff *skb, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, - struct tcp_fastopen_cookie *foc) + struct tcp_fastopen_cookie *foc, + enum tcp_synack_type synack_type) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -631,7 +632,8 @@ static unsigned int tcp_synack_options(s * rather than TS in order to fit in better with old, * buggy kernels, but that was deemed to be unnecessary. */ - ireq->tstamp_ok &= !ireq->sack_ok; + if (synack_type != TCP_SYNACK_COOKIE) + ireq->tstamp_ok &= !ireq->sack_ok; } #endif @@ -3252,8 +3254,8 @@ struct sk_buff *tcp_make_synack(const st md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); - tcp_header_size = tcp_synack_options(req, mss, skb, , md5, foc) + - sizeof(*th); + tcp_header_size = tcp_synack_options(req, mss, skb, , md5, +foc, synack_type) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb);
[PATCH 4.14 02/78] net: be more gentle about silly gso requests coming from user
From: Eric Dumazet commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 upstream. Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests. When --mss=XXX option is set, packetdrill always provide gso_type & gso_size for its inbound packets, regardless of packet size. if (packet->tcp && packet->mss) { if (packet->ipv4) gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; gso.gso_size = packet->mss; } Since many other programs could do the same, relax virtio_net_hdr_to_skb() to no longer return an error, but instead ignore gso settings. This keeps Willem intent to make sure no malicious packet could reach gso stack. Note that TCP stack has a special logic in tcp_set_skb_tso_segs() to clear gso_size for small packets. Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry") Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Acked-by: Willem de Bruijn Signed-off-by: David S. Miller Cc: Guenter Roeck Signed-off-by: Greg Kroah-Hartman --- include/linux/virtio_net.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index a16e0bdf77511..d19bfdcf77498 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -107,16 +107,17 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + struct skb_shared_info *shinfo = skb_shinfo(skb); - if (skb->len - p_off <= gso_size) - return -EINVAL; - - skb_shinfo(skb)->gso_size = gso_size; - skb_shinfo(skb)->gso_type = gso_type; + /* Too small packets are not really GSO ones. */ + if (skb->len - p_off > gso_size) { + shinfo->gso_size = gso_size; + shinfo->gso_type = gso_type; - /* Header must be checked, and gso_segs computed. */ - skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; - skb_shinfo(skb)->gso_segs = 0; + /* Header must be checked, and gso_segs computed. */ + shinfo->gso_type |= SKB_GSO_DODGY; + shinfo->gso_segs = 0; + } } return 0; -- 2.25.1
[PATCH 4.19 001/131] net: be more gentle about silly gso requests coming from user
From: Eric Dumazet commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 upstream. Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests. When --mss=XXX option is set, packetdrill always provide gso_type & gso_size for its inbound packets, regardless of packet size. if (packet->tcp && packet->mss) { if (packet->ipv4) gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; gso.gso_size = packet->mss; } Since many other programs could do the same, relax virtio_net_hdr_to_skb() to no longer return an error, but instead ignore gso settings. This keeps Willem intent to make sure no malicious packet could reach gso stack. Note that TCP stack has a special logic in tcp_set_skb_tso_segs() to clear gso_size for small packets. Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry") Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Acked-by: Willem de Bruijn Signed-off-by: David S. Miller Cc: Guenter Roeck Signed-off-by: Greg Kroah-Hartman --- include/linux/virtio_net.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 1c296f370e461..f32fe7080d2ec 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -109,16 +109,17 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + struct skb_shared_info *shinfo = skb_shinfo(skb); - if (skb->len - p_off <= gso_size) - return -EINVAL; - - skb_shinfo(skb)->gso_size = gso_size; - skb_shinfo(skb)->gso_type = gso_type; + /* Too small packets are not really GSO ones. */ + if (skb->len - p_off > gso_size) { + shinfo->gso_size = gso_size; + shinfo->gso_type = gso_type; - /* Header must be checked, and gso_segs computed. */ - skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; - skb_shinfo(skb)->gso_segs = 0; + /* Header must be checked, and gso_segs computed. */ + shinfo->gso_type |= SKB_GSO_DODGY; + shinfo->gso_segs = 0; + } } return 0; -- 2.25.1
[PATCH 5.4 11/34] net: be more gentle about silly gso requests coming from user
From: Eric Dumazet [ Upstream commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 ] Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests. When --mss=XXX option is set, packetdrill always provide gso_type & gso_size for its inbound packets, regardless of packet size. if (packet->tcp && packet->mss) { if (packet->ipv4) gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; gso.gso_size = packet->mss; } Since many other programs could do the same, relax virtio_net_hdr_to_skb() to no longer return an error, but instead ignore gso settings. This keeps Willem intent to make sure no malicious packet could reach gso stack. Note that TCP stack has a special logic in tcp_set_skb_tso_segs() to clear gso_size for small packets. Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry") Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Acked-by: Willem de Bruijn Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- include/linux/virtio_net.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -109,16 +109,17 @@ retry: if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + struct skb_shared_info *shinfo = skb_shinfo(skb); - if (skb->len - p_off <= gso_size) - return -EINVAL; + /* Too small packets are not really GSO ones. */ + if (skb->len - p_off > gso_size) { + shinfo->gso_size = gso_size; + shinfo->gso_type = gso_type; - skb_shinfo(skb)->gso_size = gso_size; - skb_shinfo(skb)->gso_type = gso_type; - - /* Header must be checked, and gso_segs computed. */ - skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; - skb_shinfo(skb)->gso_segs = 0; + /* Header must be checked, and gso_segs computed. */ + shinfo->gso_type |= SKB_GSO_DODGY; + shinfo->gso_segs = 0; + } } return 0;
[PATCH 5.6 13/41] net: be more gentle about silly gso requests coming from user
From: Eric Dumazet [ Upstream commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 ] Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests. When --mss=XXX option is set, packetdrill always provide gso_type & gso_size for its inbound packets, regardless of packet size. if (packet->tcp && packet->mss) { if (packet->ipv4) gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6; gso.gso_size = packet->mss; } Since many other programs could do the same, relax virtio_net_hdr_to_skb() to no longer return an error, but instead ignore gso settings. This keeps Willem intent to make sure no malicious packet could reach gso stack. Note that TCP stack has a special logic in tcp_set_skb_tso_segs() to clear gso_size for small packets. Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry") Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Acked-by: Willem de Bruijn Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- include/linux/virtio_net.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -109,16 +109,17 @@ retry: if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size); + struct skb_shared_info *shinfo = skb_shinfo(skb); - if (skb->len - p_off <= gso_size) - return -EINVAL; + /* Too small packets are not really GSO ones. */ + if (skb->len - p_off > gso_size) { + shinfo->gso_size = gso_size; + shinfo->gso_type = gso_type; - skb_shinfo(skb)->gso_size = gso_size; - skb_shinfo(skb)->gso_type = gso_type; - - /* Header must be checked, and gso_segs computed. */ - skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; - skb_shinfo(skb)->gso_segs = 0; + /* Header must be checked, and gso_segs computed. */ + shinfo->gso_type |= SKB_GSO_DODGY; + shinfo->gso_segs = 0; + } } return 0;
[PATCH 4.9 08/90] sch_sfq: validate silly quantum values
From: Eric Dumazet [ Upstream commit df4953e4e997e273501339f607b77953772e3559 ] syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_sfq.c |9 + 1 file changed, 9 insertions(+) --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -635,6 +635,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
[PATCH 4.14 008/114] sch_sfq: validate silly quantum values
From: Eric Dumazet [ Upstream commit df4953e4e997e273501339f607b77953772e3559 ] syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_sfq.c |9 + 1 file changed, 9 insertions(+) --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -639,6 +639,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
[PATCH 4.4 05/86] sch_sfq: validate silly quantum values
From: Eric Dumazet [ Upstream commit df4953e4e997e273501339f607b77953772e3559 ] syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_sfq.c |9 + 1 file changed, 9 insertions(+) --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -635,6 +635,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
[PATCH 5.6 036/118] sch_sfq: validate silly quantum values
From: Eric Dumazet [ Upstream commit df4953e4e997e273501339f607b77953772e3559 ] syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_sfq.c |9 + 1 file changed, 9 insertions(+) --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -637,6 +637,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
[PATCH 5.4 28/90] sch_sfq: validate silly quantum values
From: Eric Dumazet [ Upstream commit df4953e4e997e273501339f607b77953772e3559 ] syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_sfq.c |9 + 1 file changed, 9 insertions(+) --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -637,6 +637,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
[PATCH 4.19 11/48] sch_sfq: validate silly quantum values
From: Eric Dumazet [ Upstream commit df4953e4e997e273501339f607b77953772e3559 ] syzbot managed to set up sfq so that q->scaled_quantum was zero, triggering an infinite loop in sfq_dequeue() More generally, we must only accept quantum between 1 and 2^18 - 7, meaning scaled_quantum must be in [1, 0x7FFF] range. Otherwise, we also could have a loop in sfq_dequeue() if scaled_quantum happens to be 0x8000, since slot->allot could indefinitely switch between 0 and 0x8000. Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair") Signed-off-by: Eric Dumazet Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com Cc: Jason A. Donenfeld Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/sched/sch_sfq.c |9 + 1 file changed, 9 insertions(+) --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -641,6 +641,15 @@ static int sfq_change(struct Qdisc *sch, if (ctl->divisor && (!is_power_of_2(ctl->divisor) || ctl->divisor > 65536)) return -EINVAL; + + /* slot->allot is a short, make sure quantum is not too big. */ + if (ctl->quantum) { + unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum); + + if (scaled <= 0 || scaled > SHRT_MAX) + return -EINVAL; + } + if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max, ctl_v1->Wlog)) return -EINVAL;
[PATCH 3.16 18/86] locking/static_keys: Fix a silly typo
3.16.68-rc1 review patch. If anyone has any objections, please let me know. -- From: Jonathan Corbet commit edcd591c77a48da753456f92daf8bb50fe9bac93 upstream. Commit: 412758cb2670 ("jump label, locking/static_keys: Update docs") introduced a typo that might as well get fixed. Signed-off-by: Jonathan Corbet Cc: Andrew Morton Cc: Jason Baron Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20150907131803.54c02...@lwn.net Signed-off-by: Ingo Molnar Signed-off-by: Ben Hutchings --- Documentation/static-keys.txt | 2 +- include/linux/jump_label.h| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- a/Documentation/static-keys.txt +++ b/Documentation/static-keys.txt @@ -16,7 +16,7 @@ The updated API replacements are: DEFINE_STATIC_KEY_TRUE(key); DEFINE_STATIC_KEY_FALSE(key); static_key_likely() -statick_key_unlikely() +static_key_unlikely() 0) Abstract --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -22,7 +22,7 @@ * DEFINE_STATIC_KEY_TRUE(key); * DEFINE_STATIC_KEY_FALSE(key); * static_key_likely() - * statick_key_unlikely() + * static_key_unlikely() * * Jump labels provide an interface to generate dynamic branches using * self-modifying code. Assuming toolchain and architecture support, if we
Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()
On Sun, Dec 02, 2018 at 12:02:54PM +0300, Alexey Dobriyan wrote: > Empty function will be inlined so asmlinkage doesn't do anything. Yes, that is an example of a perfect explanation to have in the commit message :) Ack from me after that addition. Acked-by: Joey Pabalinas -- Cheers, Joey Pabalinas signature.asc Description: PGP signature
Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()
On Sun, Dec 02, 2018 at 12:02:54PM +0300, Alexey Dobriyan wrote: > Empty function will be inlined so asmlinkage doesn't do anything. Yes, that is an example of a perfect explanation to have in the commit message :) Ack from me after that addition. Acked-by: Joey Pabalinas -- Cheers, Joey Pabalinas signature.asc Description: PGP signature
Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()
On Sat, Dec 01, 2018 at 09:33:38PM -1000, Joey Pabalinas wrote: > On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote: > > -static inline asmlinkage void dump_stack(void) > > +static inline void dump_stack(void) > > Why is it "silly"? An explanation in the commit message would be useful. Empty function will be inlined so asmlinkage doesn't do anything.
Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()
On Sat, Dec 01, 2018 at 09:33:38PM -1000, Joey Pabalinas wrote: > On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote: > > -static inline asmlinkage void dump_stack(void) > > +static inline void dump_stack(void) > > Why is it "silly"? An explanation in the commit message would be useful. Empty function will be inlined so asmlinkage doesn't do anything.
Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()
On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote: > -static inline asmlinkage void dump_stack(void) > +static inline void dump_stack(void) Why is it "silly"? An explanation in the commit message would be useful. > Signed-off-by: Alexey Dobriyan > --- > > include/linux/printk.h |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/include/linux/printk.h > +++ b/include/linux/printk.h > @@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char > *log_lvl) > { > } > > -static inline asmlinkage void dump_stack(void) > +static inline void dump_stack(void) > { > } > -- Cheers, Joey Pabalinas signature.asc Description: PGP signature
Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()
On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote: > -static inline asmlinkage void dump_stack(void) > +static inline void dump_stack(void) Why is it "silly"? An explanation in the commit message would be useful. > Signed-off-by: Alexey Dobriyan > --- > > include/linux/printk.h |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/include/linux/printk.h > +++ b/include/linux/printk.h > @@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char > *log_lvl) > { > } > > -static inline asmlinkage void dump_stack(void) > +static inline void dump_stack(void) > { > } > -- Cheers, Joey Pabalinas signature.asc Description: PGP signature
[PATCH] drop silly "static inline asmlinkage" from dump_stack()
Signed-off-by: Alexey Dobriyan --- include/linux/printk.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char *log_lvl) { } -static inline asmlinkage void dump_stack(void) +static inline void dump_stack(void) { }
[PATCH] drop silly "static inline asmlinkage" from dump_stack()
Signed-off-by: Alexey Dobriyan --- include/linux/printk.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char *log_lvl) { } -static inline asmlinkage void dump_stack(void) +static inline void dump_stack(void) { }
[PATCH 4.14 060/146] ipv6: mcast: better catch silly mtu values
4.14-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet <eduma...@google.com> [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> Tested-by: Xin Long <lucien@gmail.com> Signed-off-by: David S. Miller <da...@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1774,12 +1779,12 @@ static struct sk_buff *add_grec(struct s
[PATCH 4.14 060/146] ipv6: mcast: better catch silly mtu values
4.14-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet Reported-by: syzbot Tested-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1774,12 +1779,12 @@ static struct sk_buff *add_grec(struct s pgr->grec_nsrcs = htons(scount); if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->m
[PATCH 4.9 25/75] ipv6: mcast: better catch silly mtu values
4.9-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet <eduma...@google.com> [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> Tested-by: Xin Long <lucien@gmail.com> Signed-off-by: David S. Miller <da...@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1774,12 +1779,12 @@ static struct sk_buff *add_
[PATCH 4.9 25/75] ipv6: mcast: better catch silly mtu values
4.9-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet Reported-by: syzbot Tested-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1774,12 +1779,12 @@ static struct sk_buff *add_grec(struct s pgr->grec_nsrcs = htons(scount); if (skb) mld_sendpack(skb); -
[PATCH 4.4 35/63] ipv6: mcast: better catch silly mtu values
4.4-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet <eduma...@google.com> [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> Tested-by: Xin Long <lucien@gmail.com> Signed-off-by: David S. Miller <da...@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1668,16 +1668,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1700,10 +1700,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1724,7 +1729,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1751,12 +1756,12 @@ static struct sk_buff *add_
[PATCH 4.4 35/63] ipv6: mcast: better catch silly mtu values
4.4-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet Reported-by: syzbot Tested-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1668,16 +1668,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1700,10 +1700,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1724,7 +1729,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1751,12 +1756,12 @@ static struct sk_buff *add_grec(struct s pgr->grec_nsrcs = htons(scount); if (skb) mld_sendpack(skb); -
[PATCH 3.18 19/32] ipv6: mcast: better catch silly mtu values
3.18-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet <eduma...@google.com> [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> Tested-by: Xin Long <lucien@gmail.com> Signed-off-by: David S. Miller <da...@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1670,16 +1670,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1702,10 +1702,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1726,7 +1731,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1753,12 +1758,12 @@ static struct sk_buff *add_
[PATCH 3.18 19/32] ipv6: mcast: better catch silly mtu values
3.18-stable review patch. If anyone has any objections, please let me know. -- From: Eric Dumazet [ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ] syzkaller reported crashes in IPv6 stack [1] Xin Long found that lo MTU was set to silly values. IPv6 stack reacts to changes to small MTU, by disabling itself under RTNL. But there is a window where threads not using RTNL can see a wrong device mtu. This can lead to surprises, in mld code where it is assumed the mtu is suitable. Fix this by reading device mtu once and checking IPv6 minimal MTU. [1] skbuff: skb_over_panic: text:10b86b8d len:196 put:20 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo [ cut here ] kernel BUG at net/core/skbuff.c:104! invalid opcode: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100 RSP: 0018:8801db307508 EFLAGS: 00010286 RAX: 0082 RBX: 8801c517e840 RCX: RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95 RBP: 8801db307570 R08: 11003b660e23 R09: R10: R11: R12: 85bd4020 R13: 84754ed2 R14: 0014 R15: 8801c4e26540 FS: () GS:8801db30() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: skb_over_panic net/core/skbuff.c:109 [inline] skb_put+0x181/0x1c0 net/core/skbuff.c:1694 add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695 add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817 mld_send_cr net/ipv6/mcast.c:1903 [inline] mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448 call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1d3/0x210 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920 Signed-off-by: Eric Dumazet Reported-by: syzbot Tested-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv6/mcast.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1670,16 +1670,16 @@ static int grec_size(struct ifmcaddr6 *p } static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, struct mld2_grec **ppgr) + int type, struct mld2_grec **ppgr, unsigned int mtu) { - struct net_device *dev = pmc->idev->dev; struct mld2_report *pmr; struct mld2_grec *pgr; - if (!skb) - skb = mld_newpack(pmc->idev, dev->mtu); - if (!skb) - return NULL; + if (!skb) { + skb = mld_newpack(pmc->idev, mtu); + if (!skb) + return NULL; + } pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec)); pgr->grec_type = type; pgr->grec_auxwords = 0; @@ -1702,10 +1702,15 @@ static struct sk_buff *add_grec(struct s struct mld2_grec *pgr = NULL; struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; int scount, stotal, first, isquery, truncate; + unsigned int mtu; if (pmc->mca_flags & MAF_NOREPORT) return skb; + mtu = READ_ONCE(dev->mtu); + if (mtu < IPV6_MIN_MTU) + return skb; + isquery = type == MLD2_MODE_IS_INCLUDE || type == MLD2_MODE_IS_EXCLUDE; truncate = type == MLD2_MODE_IS_EXCLUDE || @@ -1726,7 +1731,7 @@ static struct sk_buff *add_grec(struct s AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) { if (skb) mld_sendpack(skb); - skb = mld_newpack(idev, dev->mtu); + skb = mld_newpack(idev, mtu); } } first = 1; @@ -1753,12 +1758,12 @@ static struct sk_buff *add_grec(struct s pgr->grec_nsrcs = htons(scount); if (skb) mld_sendpack(skb); -
[PATCH 4.12 37/43] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme
4.12-stable review patch. If anyone has any objections, please let me know. -- From: Christoph Hellwigcommit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream. Intel AHCI controllers that also hide NVMe devices in their bar can't use MSI interrupts, so disable them. Reported-by: John Loy Tested-by: John Loy Signed-off-by: Christoph Hellwig Fixes: d684a90d38e2 ("ahci: per-port msix support") Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman --- drivers/ata/ahci.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1467,7 +1467,14 @@ static void ahci_remap_check(struct pci_ return; dev_warn(>dev, "Found %d remapped NVMe devices.\n", count); - dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use them.\n"); + dev_warn(>dev, +"Switch your BIOS from RAID to AHCI mode to use them.\n"); + + /* +* Don't rely on the msi-x capability in the remap case, +* share the legacy interrupt across ahci and remapped devices. +*/ + hpriv->flags |= AHCI_HFLAG_NO_MSI; } static int ahci_get_irq_vector(struct ata_host *host, int port)
[PATCH 4.12 37/43] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme
4.12-stable review patch. If anyone has any objections, please let me know. -- From: Christoph Hellwig commit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream. Intel AHCI controllers that also hide NVMe devices in their bar can't use MSI interrupts, so disable them. Reported-by: John Loy Tested-by: John Loy Signed-off-by: Christoph Hellwig Fixes: d684a90d38e2 ("ahci: per-port msix support") Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman --- drivers/ata/ahci.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1467,7 +1467,14 @@ static void ahci_remap_check(struct pci_ return; dev_warn(>dev, "Found %d remapped NVMe devices.\n", count); - dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use them.\n"); + dev_warn(>dev, +"Switch your BIOS from RAID to AHCI mode to use them.\n"); + + /* +* Don't rely on the msi-x capability in the remap case, +* share the legacy interrupt across ahci and remapped devices. +*/ + hpriv->flags |= AHCI_HFLAG_NO_MSI; } static int ahci_get_irq_vector(struct ata_host *host, int port)
[PATCH 4.13 41/47] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme
4.13-stable review patch. If anyone has any objections, please let me know. -- From: Christoph Hellwigcommit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream. Intel AHCI controllers that also hide NVMe devices in their bar can't use MSI interrupts, so disable them. Reported-by: John Loy Tested-by: John Loy Signed-off-by: Christoph Hellwig Fixes: d684a90d38e2 ("ahci: per-port msix support") Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman --- drivers/ata/ahci.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1469,7 +1469,14 @@ static void ahci_remap_check(struct pci_ return; dev_warn(>dev, "Found %d remapped NVMe devices.\n", count); - dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use them.\n"); + dev_warn(>dev, +"Switch your BIOS from RAID to AHCI mode to use them.\n"); + + /* +* Don't rely on the msi-x capability in the remap case, +* share the legacy interrupt across ahci and remapped devices. +*/ + hpriv->flags |= AHCI_HFLAG_NO_MSI; } static int ahci_get_irq_vector(struct ata_host *host, int port)
[PATCH 4.13 41/47] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme
4.13-stable review patch. If anyone has any objections, please let me know. -- From: Christoph Hellwig commit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream. Intel AHCI controllers that also hide NVMe devices in their bar can't use MSI interrupts, so disable them. Reported-by: John Loy Tested-by: John Loy Signed-off-by: Christoph Hellwig Fixes: d684a90d38e2 ("ahci: per-port msix support") Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman --- drivers/ata/ahci.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1469,7 +1469,14 @@ static void ahci_remap_check(struct pci_ return; dev_warn(>dev, "Found %d remapped NVMe devices.\n", count); - dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use them.\n"); + dev_warn(>dev, +"Switch your BIOS from RAID to AHCI mode to use them.\n"); + + /* +* Don't rely on the msi-x capability in the remap case, +* share the legacy interrupt across ahci and remapped devices. +*/ + hpriv->flags |= AHCI_HFLAG_NO_MSI; } static int ahci_get_irq_vector(struct ata_host *host, int port)
Moved Low-Jitter Doom 3 video and other silly videos.
For the interested, I moved the video I did of my low-jitter inspired configurations, of Doom 3, which was a very jitter sensitive game. A quick summary: By removing and reducing unecessary code, and codepaths, such as reducing the HZ to 90, choosing low-jitter components in configuration, and renicing X, to remove its bottleneck aspect, I was able to run the 3-pass layer engine of Doom 3, at 72.7x3 FPS, without any hickup (supposedly thought to be disk reads etc.) The video is now here: https://www.youtube.com/watch?v=xDct6vVvFxA Now also with an upscale filter for pseudo 4K. Also on here is an Unreal video, with an undocumented antialiasing mode, with maximal pixelinformation for 4K: https://www.youtube.com/watch?v=MKqfAbPqleo Also a fun filter for taking 160x200 to a whole 4K :o) https://www.youtube.com/watch?v=E0tkH7kWB6A=33s Best Regards, Ove Kent Karlsen
Moved Low-Jitter Doom 3 video and other silly videos.
For the interested, I moved the video I did of my low-jitter inspired configurations, of Doom 3, which was a very jitter sensitive game. A quick summary: By removing and reducing unecessary code, and codepaths, such as reducing the HZ to 90, choosing low-jitter components in configuration, and renicing X, to remove its bottleneck aspect, I was able to run the 3-pass layer engine of Doom 3, at 72.7x3 FPS, without any hickup (supposedly thought to be disk reads etc.) The video is now here: https://www.youtube.com/watch?v=xDct6vVvFxA Now also with an upscale filter for pseudo 4K. Also on here is an Unreal video, with an undocumented antialiasing mode, with maximal pixelinformation for 4K: https://www.youtube.com/watch?v=MKqfAbPqleo Also a fun filter for taking 160x200 to a whole 4K :o) https://www.youtube.com/watch?v=E0tkH7kWB6A=33s Best Regards, Ove Kent Karlsen
Re: [PATCH] serial: core: remove silly test for uart_state
On Fri, Jan 6, 2017 at 4:45 AM, Thadeu Lima de Souza Cascardowrote: > The polling functions were checking for a NULL uart_state, which is > indexed from uart_driver->state. It should be always allocated and > non-NULL when the tty_driver is registered, and line should not be > larger than the tty_driver->num anyways. I'm not sure this is guaranteed. Let Peter to comment on it. Peter, what is your opinion? P.S. If Peter is okay with it, I don't see any problems with code itself either. > > Signed-off-by: Thadeu Lima de Souza Cascardo > --- > drivers/tty/serial/serial_core.c | 16 > 1 file changed, 4 insertions(+), 12 deletions(-) > > diff --git a/drivers/tty/serial/serial_core.c > b/drivers/tty/serial/serial_core.c > index 9939c3d9912b..6f7247114ef8 100644 > --- a/drivers/tty/serial/serial_core.c > +++ b/drivers/tty/serial/serial_core.c > @@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, > int line, char *options) > int flow = 'n'; > int ret = 0; > > - if (!state) > - return -1; > - > tport = >port; > mutex_lock(>mutex); > > @@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver > *driver, int line) > struct uart_port *port; > int ret = -1; > > - if (state) { > - port = uart_port_ref(state); > - if (port) { > - ret = port->ops->poll_get_char(port); > - uart_port_deref(port); > - } > + port = uart_port_ref(state); > + if (port) { > + ret = port->ops->poll_get_char(port); > + uart_port_deref(port); > } > return ret; > } > @@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver > *driver, int line, char ch) > struct uart_state *state = drv->state + line; > struct uart_port *port; > > - if (!state) > - return; > - > port = uart_port_ref(state); > if (!port) > return; > -- > 2.11.0 > -- With Best Regards, Andy Shevchenko
Re: [PATCH] serial: core: remove silly test for uart_state
On Fri, Jan 6, 2017 at 4:45 AM, Thadeu Lima de Souza Cascardo wrote: > The polling functions were checking for a NULL uart_state, which is > indexed from uart_driver->state. It should be always allocated and > non-NULL when the tty_driver is registered, and line should not be > larger than the tty_driver->num anyways. I'm not sure this is guaranteed. Let Peter to comment on it. Peter, what is your opinion? P.S. If Peter is okay with it, I don't see any problems with code itself either. > > Signed-off-by: Thadeu Lima de Souza Cascardo > --- > drivers/tty/serial/serial_core.c | 16 > 1 file changed, 4 insertions(+), 12 deletions(-) > > diff --git a/drivers/tty/serial/serial_core.c > b/drivers/tty/serial/serial_core.c > index 9939c3d9912b..6f7247114ef8 100644 > --- a/drivers/tty/serial/serial_core.c > +++ b/drivers/tty/serial/serial_core.c > @@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, > int line, char *options) > int flow = 'n'; > int ret = 0; > > - if (!state) > - return -1; > - > tport = >port; > mutex_lock(>mutex); > > @@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver > *driver, int line) > struct uart_port *port; > int ret = -1; > > - if (state) { > - port = uart_port_ref(state); > - if (port) { > - ret = port->ops->poll_get_char(port); > - uart_port_deref(port); > - } > + port = uart_port_ref(state); > + if (port) { > + ret = port->ops->poll_get_char(port); > + uart_port_deref(port); > } > return ret; > } > @@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver > *driver, int line, char ch) > struct uart_state *state = drv->state + line; > struct uart_port *port; > > - if (!state) > - return; > - > port = uart_port_ref(state); > if (!port) > return; > -- > 2.11.0 > -- With Best Regards, Andy Shevchenko
[PATCH] serial: core: remove silly test for uart_state
The polling functions were checking for a NULL uart_state, which is indexed from uart_driver->state. It should be always allocated and non-NULL when the tty_driver is registered, and line should not be larger than the tty_driver->num anyways. Signed-off-by: Thadeu Lima de Souza Cascardo--- drivers/tty/serial/serial_core.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c index 9939c3d9912b..6f7247114ef8 100644 --- a/drivers/tty/serial/serial_core.c +++ b/drivers/tty/serial/serial_core.c @@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, int line, char *options) int flow = 'n'; int ret = 0; - if (!state) - return -1; - tport = >port; mutex_lock(>mutex); @@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver *driver, int line) struct uart_port *port; int ret = -1; - if (state) { - port = uart_port_ref(state); - if (port) { - ret = port->ops->poll_get_char(port); - uart_port_deref(port); - } + port = uart_port_ref(state); + if (port) { + ret = port->ops->poll_get_char(port); + uart_port_deref(port); } return ret; } @@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver *driver, int line, char ch) struct uart_state *state = drv->state + line; struct uart_port *port; - if (!state) - return; - port = uart_port_ref(state); if (!port) return; -- 2.11.0
[PATCH] serial: core: remove silly test for uart_state
The polling functions were checking for a NULL uart_state, which is indexed from uart_driver->state. It should be always allocated and non-NULL when the tty_driver is registered, and line should not be larger than the tty_driver->num anyways. Signed-off-by: Thadeu Lima de Souza Cascardo --- drivers/tty/serial/serial_core.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c index 9939c3d9912b..6f7247114ef8 100644 --- a/drivers/tty/serial/serial_core.c +++ b/drivers/tty/serial/serial_core.c @@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, int line, char *options) int flow = 'n'; int ret = 0; - if (!state) - return -1; - tport = >port; mutex_lock(>mutex); @@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver *driver, int line) struct uart_port *port; int ret = -1; - if (state) { - port = uart_port_ref(state); - if (port) { - ret = port->ops->poll_get_char(port); - uart_port_deref(port); - } + port = uart_port_ref(state); + if (port) { + ret = port->ops->poll_get_char(port); + uart_port_deref(port); } return ret; } @@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver *driver, int line, char ch) struct uart_state *state = drv->state + line; struct uart_port *port; - if (!state) - return; - port = uart_port_ref(state); if (!port) return; -- 2.11.0
[tip:perf/core] perf/x86/intel/bts: Kill a silly warning
Commit-ID: ef9ef3befa0d76008e988a9ed9fe439e803351b9 Gitweb: http://git.kernel.org/tip/ef9ef3befa0d76008e988a9ed9fe439e803351b9 Author: Alexander Shishkin <alexander.shish...@linux.intel.com> AuthorDate: Tue, 6 Sep 2016 16:23:53 +0300 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Sat, 10 Sep 2016 11:15:38 +0200 perf/x86/intel/bts: Kill a silly warning At the moment, intel_bts will WARN() out if there is more than one event writing to the same ring buffer, via SET_OUTPUT, and will only send data from one event to a buffer. There is no reason to have this warning in, so kill it. Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Arnaldo Carvalho de Melo <a...@infradead.org> Cc: Arnaldo Carvalho de Melo <a...@redhat.com> Cc: Jiri Olsa <jo...@redhat.com> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Stephane Eranian <eran...@google.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: vi...@deater.net Link: http://lkml.kernel.org/r/20160906132353.19887-6-alexander.shish...@linux.intel.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- arch/x86/events/intel/bts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 9233edf..bdcd651 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle) return 0; head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1); - if (WARN_ON_ONCE(head != local_read(>head))) - return -EINVAL; phys = >buf[buf->cur_buf]; space = phys->offset + phys->displacement + phys->size - head;
[tip:perf/core] perf/x86/intel/bts: Kill a silly warning
Commit-ID: ef9ef3befa0d76008e988a9ed9fe439e803351b9 Gitweb: http://git.kernel.org/tip/ef9ef3befa0d76008e988a9ed9fe439e803351b9 Author: Alexander Shishkin AuthorDate: Tue, 6 Sep 2016 16:23:53 +0300 Committer: Ingo Molnar CommitDate: Sat, 10 Sep 2016 11:15:38 +0200 perf/x86/intel/bts: Kill a silly warning At the moment, intel_bts will WARN() out if there is more than one event writing to the same ring buffer, via SET_OUTPUT, and will only send data from one event to a buffer. There is no reason to have this warning in, so kill it. Signed-off-by: Alexander Shishkin Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: vi...@deater.net Link: http://lkml.kernel.org/r/20160906132353.19887-6-alexander.shish...@linux.intel.com Signed-off-by: Ingo Molnar --- arch/x86/events/intel/bts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 9233edf..bdcd651 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle) return 0; head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1); - if (WARN_ON_ONCE(head != local_read(>head))) - return -EINVAL; phys = >buf[buf->cur_buf]; space = phys->offset + phys->displacement + phys->size - head;
[PATCH v2 5/5] perf/x86/intel/bts: Kill a silly warning
At the moment, intel_bts will WARN() out if there is more than one event writing to the same ring buffer, via SET_OUTPUT, and will only send data from one event to a buffer. There is no reason to have this warning in, so kill it. Signed-off-by: Alexander Shishkin--- arch/x86/events/intel/bts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 9233edf993..bdcd651099 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle) return 0; head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1); - if (WARN_ON_ONCE(head != local_read(>head))) - return -EINVAL; phys = >buf[buf->cur_buf]; space = phys->offset + phys->displacement + phys->size - head; -- 2.9.3
[PATCH v2 5/5] perf/x86/intel/bts: Kill a silly warning
At the moment, intel_bts will WARN() out if there is more than one event writing to the same ring buffer, via SET_OUTPUT, and will only send data from one event to a buffer. There is no reason to have this warning in, so kill it. Signed-off-by: Alexander Shishkin --- arch/x86/events/intel/bts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 9233edf993..bdcd651099 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle) return 0; head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1); - if (WARN_ON_ONCE(head != local_read(>head))) - return -EINVAL; phys = >buf[buf->cur_buf]; space = phys->offset + phys->displacement + phys->size - head; -- 2.9.3
[PATCH 2/3] perf/x86/intel/bts: Kill a silly warning
At the moment, intel_bts will WARN() out if there is more than one event writing to the same ring buffer, via SET_OUTPUT, and will only send data from one event to a buffer. There is no reason to have this warning in, so kill it. Signed-off-by: Alexander Shishkin--- arch/x86/events/intel/bts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 0b8ccff564..41975c6c2d 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -363,8 +363,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle) return 0; head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1); - if (WARN_ON_ONCE(head != local_read(>head))) - return -EINVAL; phys = >buf[buf->cur_buf]; space = phys->offset + phys->displacement + phys->size - head; -- 2.8.1
[PATCH 2/3] perf/x86/intel/bts: Kill a silly warning
At the moment, intel_bts will WARN() out if there is more than one event writing to the same ring buffer, via SET_OUTPUT, and will only send data from one event to a buffer. There is no reason to have this warning in, so kill it. Signed-off-by: Alexander Shishkin --- arch/x86/events/intel/bts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c index 0b8ccff564..41975c6c2d 100644 --- a/arch/x86/events/intel/bts.c +++ b/arch/x86/events/intel/bts.c @@ -363,8 +363,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct perf_output_handle *handle) return 0; head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1); - if (WARN_ON_ONCE(head != local_read(>head))) - return -EINVAL; phys = >buf[buf->cur_buf]; space = phys->offset + phys->displacement + phys->size - head; -- 2.8.1
Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs
On Sun, Aug 21, 2016 at 12:53:15PM +0300, Jan Varho wrote: > On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'owrote: > > +static inline void maybe_reseed_primary_crng(void) > > +{ > > + if (crng_init > 2 && > > + time_after(jiffies, primary_crng.init_time + > > CRNG_RESEED_INTERVAL)) > > + crng_reseed(_crng, _pool); > > +} > > Is the above function (which is now in 4.8-rc2) supposed to do > something? It seems to have no callers and the maximum value of > crng_init is 2. It's dead code. Its function got moved into _extra_crng(), and you're right, these days crng_init never gets above 2. Thanks for pointing that out. I'll take it out as a cleanup patch. - Ted
Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs
On Sun, Aug 21, 2016 at 12:53:15PM +0300, Jan Varho wrote: > On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o wrote: > > +static inline void maybe_reseed_primary_crng(void) > > +{ > > + if (crng_init > 2 && > > + time_after(jiffies, primary_crng.init_time + > > CRNG_RESEED_INTERVAL)) > > + crng_reseed(_crng, _pool); > > +} > > Is the above function (which is now in 4.8-rc2) supposed to do > something? It seems to have no callers and the maximum value of > crng_init is 2. It's dead code. Its function got moved into _extra_crng(), and you're right, these days crng_init never gets above 2. Thanks for pointing that out. I'll take it out as a cleanup patch. - Ted
Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs
On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'owrote: > +static inline void maybe_reseed_primary_crng(void) > +{ > + if (crng_init > 2 && > + time_after(jiffies, primary_crng.init_time + > CRNG_RESEED_INTERVAL)) > + crng_reseed(_crng, _pool); > +} Hi, Is the above function (which is now in 4.8-rc2) supposed to do something? It seems to have no callers and the maximum value of crng_init is 2. -- Jan Varho
Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs
On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o wrote: > +static inline void maybe_reseed_primary_crng(void) > +{ > + if (crng_init > 2 && > + time_after(jiffies, primary_crng.init_time + > CRNG_RESEED_INTERVAL)) > + crng_reseed(_crng, _pool); > +} Hi, Is the above function (which is now in 4.8-rc2) supposed to do something? It seems to have no callers and the maximum value of crng_init is 2. -- Jan Varho
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Thu, Jul 28, 2016 at 6:41 AM, Theodore Ts'owrote: > On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote: >> >> Oh, I just realized that Linus pulled your changes. Actually I was hoping >> we could get this fixed before the broken code would be merged. >> Could you please make sure the bug fix gets included as soon as possible? > > Yes, I'll send the pull request to ASAP. Also broke ia64. Same fix works for me. Tested-by: Tony Luck
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Thu, Jul 28, 2016 at 6:41 AM, Theodore Ts'o wrote: > On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote: >> >> Oh, I just realized that Linus pulled your changes. Actually I was hoping >> we could get this fixed before the broken code would be merged. >> Could you please make sure the bug fix gets included as soon as possible? > > Yes, I'll send the pull request to ASAP. Also broke ia64. Same fix works for me. Tested-by: Tony Luck
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Wed, 2016-07-27 at 23:46 -0400, Theodore Ts'o wrote: > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > > > > it looks like your patch "random: make /dev/urandom scalable for silly > > userspace programs" within linux-next seems to be a bit broken: > > > > It causes this allocation failure and subsequent crash on s390 with fake > > NUMA enabled > Thanks for reporting this.  This patch fixes things for you, yes? trivia: > diff --git a/drivers/char/random.c b/drivers/char/random.c [] > @@ -1668,13 +1668,12 @@ static int rand_initialize(void) >  #ifdef CONFIG_NUMA >  pool = kmalloc(num_nodes * sizeof(void *), >     GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); The __GFP_ZERO is unusual and this could use kcalloc instead. > - for (i=0; i < num_nodes; i++) { > + for_each_online_node(i) { >  crng = kmalloc_node(sizeof(struct crng_state), >  GFP_KERNEL | __GFP_NOFAIL, i); >  spin_lock_init(>lock); >  crng_initialize(crng); >  pool[i] = crng; > - >  } >  mb(); >  crng_node_pool = pool;
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Wed, 2016-07-27 at 23:46 -0400, Theodore Ts'o wrote: > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > > > > it looks like your patch "random: make /dev/urandom scalable for silly > > userspace programs" within linux-next seems to be a bit broken: > > > > It causes this allocation failure and subsequent crash on s390 with fake > > NUMA enabled > Thanks for reporting this.  This patch fixes things for you, yes? trivia: > diff --git a/drivers/char/random.c b/drivers/char/random.c [] > @@ -1668,13 +1668,12 @@ static int rand_initialize(void) >  #ifdef CONFIG_NUMA >  pool = kmalloc(num_nodes * sizeof(void *), >     GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); The __GFP_ZERO is unusual and this could use kcalloc instead. > - for (i=0; i < num_nodes; i++) { > + for_each_online_node(i) { >  crng = kmalloc_node(sizeof(struct crng_state), >  GFP_KERNEL | __GFP_NOFAIL, i); >  spin_lock_init(>lock); >  crng_initialize(crng); >  pool[i] = crng; > - >  } >  mb(); >  crng_node_pool = pool;
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote: > > Oh, I just realized that Linus pulled your changes. Actually I was hoping > we could get this fixed before the broken code would be merged. > Could you please make sure the bug fix gets included as soon as possible? Yes, I'll send the pull request to ASAP. - Ted
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote: > > Oh, I just realized that Linus pulled your changes. Actually I was hoping > we could get this fixed before the broken code would be merged. > Could you please make sure the bug fix gets included as soon as possible? Yes, I'll send the pull request to ASAP. - Ted
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Thu, Jul 28, 2016 at 07:55:48AM +0200, Heiko Carstens wrote: > On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote: > > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > > > it looks like your patch "random: make /dev/urandom scalable for silly > > > userspace programs" within linux-next seems to be a bit broken: > > > > > > It causes this allocation failure and subsequent crash on s390 with fake > > > NUMA enabled > > > > Thanks for reporting this. This patch fixes things for you, yes? > > > > - Ted > > Yes, it does. It's actually the same what I did to fix this ;) Oh, I just realized that Linus pulled your changes. Actually I was hoping we could get this fixed before the broken code would be merged. Could you please make sure the bug fix gets included as soon as possible? Right now booting a kernel with any defconfig on s390 will crash because of this. I will also change the fake NUMA code on s390, since it doesn't make sense to have possible but not online nodes (in this case). > > commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613 > > Author: Theodore Ts'o <ty...@mit.edu> > > Date: Wed Jul 27 23:30:25 2016 -0400 > > > > random: use for_each_online_node() to iterate over NUMA nodes > > > > This fixes a crash on s390 with fake NUMA enabled. > > > > Reported-by: Heiko Carstens <heiko.carst...@de.ibm.com> > > Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly > > userspace programs") > > Signed-off-by: Theodore Ts'o <ty...@mit.edu> > > > > diff --git a/drivers/char/random.c b/drivers/char/random.c > > index 8d0af74..7f06224 100644 > > --- a/drivers/char/random.c > > +++ b/drivers/char/random.c > > @@ -1668,13 +1668,12 @@ static int rand_initialize(void) > > #ifdef CONFIG_NUMA > > pool = kmalloc(num_nodes * sizeof(void *), > >GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); > > - for (i=0; i < num_nodes; i++) { > > + for_each_online_node(i) { > > crng = kmalloc_node(sizeof(struct crng_state), > > GFP_KERNEL | __GFP_NOFAIL, i); > > spin_lock_init(>lock); > > crng_initialize(crng); > > pool[i] = crng; > > - > > } > > mb(); > > crng_node_pool = pool; > > >
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Thu, Jul 28, 2016 at 07:55:48AM +0200, Heiko Carstens wrote: > On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote: > > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > > > it looks like your patch "random: make /dev/urandom scalable for silly > > > userspace programs" within linux-next seems to be a bit broken: > > > > > > It causes this allocation failure and subsequent crash on s390 with fake > > > NUMA enabled > > > > Thanks for reporting this. This patch fixes things for you, yes? > > > > - Ted > > Yes, it does. It's actually the same what I did to fix this ;) Oh, I just realized that Linus pulled your changes. Actually I was hoping we could get this fixed before the broken code would be merged. Could you please make sure the bug fix gets included as soon as possible? Right now booting a kernel with any defconfig on s390 will crash because of this. I will also change the fake NUMA code on s390, since it doesn't make sense to have possible but not online nodes (in this case). > > commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613 > > Author: Theodore Ts'o > > Date: Wed Jul 27 23:30:25 2016 -0400 > > > > random: use for_each_online_node() to iterate over NUMA nodes > > > > This fixes a crash on s390 with fake NUMA enabled. > > > > Reported-by: Heiko Carstens > > Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly > > userspace programs") > > Signed-off-by: Theodore Ts'o > > > > diff --git a/drivers/char/random.c b/drivers/char/random.c > > index 8d0af74..7f06224 100644 > > --- a/drivers/char/random.c > > +++ b/drivers/char/random.c > > @@ -1668,13 +1668,12 @@ static int rand_initialize(void) > > #ifdef CONFIG_NUMA > > pool = kmalloc(num_nodes * sizeof(void *), > >GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); > > - for (i=0; i < num_nodes; i++) { > > + for_each_online_node(i) { > > crng = kmalloc_node(sizeof(struct crng_state), > > GFP_KERNEL | __GFP_NOFAIL, i); > > spin_lock_init(>lock); > > crng_initialize(crng); > > pool[i] = crng; > > - > > } > > mb(); > > crng_node_pool = pool; > > >
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote: > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > > it looks like your patch "random: make /dev/urandom scalable for silly > > userspace programs" within linux-next seems to be a bit broken: > > > > It causes this allocation failure and subsequent crash on s390 with fake > > NUMA enabled > > Thanks for reporting this. This patch fixes things for you, yes? > > - Ted Yes, it does. It's actually the same what I did to fix this ;) > commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613 > Author: Theodore Ts'o <ty...@mit.edu> > Date: Wed Jul 27 23:30:25 2016 -0400 > > random: use for_each_online_node() to iterate over NUMA nodes > > This fixes a crash on s390 with fake NUMA enabled. > > Reported-by: Heiko Carstens <heiko.carst...@de.ibm.com> > Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly > userspace programs") > Signed-off-by: Theodore Ts'o <ty...@mit.edu> > > diff --git a/drivers/char/random.c b/drivers/char/random.c > index 8d0af74..7f06224 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -1668,13 +1668,12 @@ static int rand_initialize(void) > #ifdef CONFIG_NUMA > pool = kmalloc(num_nodes * sizeof(void *), > GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); > - for (i=0; i < num_nodes; i++) { > + for_each_online_node(i) { > crng = kmalloc_node(sizeof(struct crng_state), > GFP_KERNEL | __GFP_NOFAIL, i); > spin_lock_init(>lock); > crng_initialize(crng); > pool[i] = crng; > - > } > mb(); > crng_node_pool = pool; >
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote: > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > > it looks like your patch "random: make /dev/urandom scalable for silly > > userspace programs" within linux-next seems to be a bit broken: > > > > It causes this allocation failure and subsequent crash on s390 with fake > > NUMA enabled > > Thanks for reporting this. This patch fixes things for you, yes? > > - Ted Yes, it does. It's actually the same what I did to fix this ;) > commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613 > Author: Theodore Ts'o > Date: Wed Jul 27 23:30:25 2016 -0400 > > random: use for_each_online_node() to iterate over NUMA nodes > > This fixes a crash on s390 with fake NUMA enabled. > > Reported-by: Heiko Carstens > Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly > userspace programs") > Signed-off-by: Theodore Ts'o > > diff --git a/drivers/char/random.c b/drivers/char/random.c > index 8d0af74..7f06224 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -1668,13 +1668,12 @@ static int rand_initialize(void) > #ifdef CONFIG_NUMA > pool = kmalloc(num_nodes * sizeof(void *), > GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); > - for (i=0; i < num_nodes; i++) { > + for_each_online_node(i) { > crng = kmalloc_node(sizeof(struct crng_state), > GFP_KERNEL | __GFP_NOFAIL, i); > spin_lock_init(>lock); > crng_initialize(crng); > pool[i] = crng; > - > } > mb(); > crng_node_pool = pool; >
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > it looks like your patch "random: make /dev/urandom scalable for silly > userspace programs" within linux-next seems to be a bit broken: > > It causes this allocation failure and subsequent crash on s390 with fake > NUMA enabled Thanks for reporting this. This patch fixes things for you, yes? - Ted commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613 Author: Theodore Ts'o <ty...@mit.edu> Date: Wed Jul 27 23:30:25 2016 -0400 random: use for_each_online_node() to iterate over NUMA nodes This fixes a crash on s390 with fake NUMA enabled. Reported-by: Heiko Carstens <heiko.carst...@de.ibm.com> Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly userspace programs") Signed-off-by: Theodore Ts'o <ty...@mit.edu> diff --git a/drivers/char/random.c b/drivers/char/random.c index 8d0af74..7f06224 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1668,13 +1668,12 @@ static int rand_initialize(void) #ifdef CONFIG_NUMA pool = kmalloc(num_nodes * sizeof(void *), GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); - for (i=0; i < num_nodes; i++) { + for_each_online_node(i) { crng = kmalloc_node(sizeof(struct crng_state), GFP_KERNEL | __GFP_NOFAIL, i); spin_lock_init(>lock); crng_initialize(crng); pool[i] = crng; - } mb(); crng_node_pool = pool;
Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote: > it looks like your patch "random: make /dev/urandom scalable for silly > userspace programs" within linux-next seems to be a bit broken: > > It causes this allocation failure and subsequent crash on s390 with fake > NUMA enabled Thanks for reporting this. This patch fixes things for you, yes? - Ted commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613 Author: Theodore Ts'o Date: Wed Jul 27 23:30:25 2016 -0400 random: use for_each_online_node() to iterate over NUMA nodes This fixes a crash on s390 with fake NUMA enabled. Reported-by: Heiko Carstens Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly userspace programs") Signed-off-by: Theodore Ts'o diff --git a/drivers/char/random.c b/drivers/char/random.c index 8d0af74..7f06224 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1668,13 +1668,12 @@ static int rand_initialize(void) #ifdef CONFIG_NUMA pool = kmalloc(num_nodes * sizeof(void *), GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); - for (i=0; i < num_nodes; i++) { + for_each_online_node(i) { crng = kmalloc_node(sizeof(struct crng_state), GFP_KERNEL | __GFP_NOFAIL, i); spin_lock_init(>lock); crng_initialize(crng); pool[i] = crng; - } mb(); crng_node_pool = pool;
[BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
Hi Ted, it looks like your patch "random: make /dev/urandom scalable for silly userspace programs" within linux-next seems to be a bit broken: It causes this allocation failure and subsequent crash on s390 with fake NUMA enabled: [0.533195] SLUB: Unable to allocate memory on node 1, gfp=0x24008c0(GFP_KERNEL|__GFP_NOFAIL) [0.533198] cache: kmalloc-192, object size: 192, buffer size: 528, defaul order: 3, min order: 0 [0.533202] node 0: slabs: 2, objs: 124, free: 17 [0.533208] Unable to handle kernel pointer dereference in virtual kernel address space [0.533211] Failing address: TEID: 0483 ... [0.533276] Krnl PSW : 0704e0018000 001a853e (lockdep_init_map+0x1e/0x220) [0.533281]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 00a23400 370c8008 0060 00bedc90 [0.533285]02070800 0001 [0.533287]3743d3f8 3743d408 02070800 00bedc90 [0.533289]0048 009c2030 370cfd00 370cfcc0 [0.533295] Krnl Code: 001a852e: a7840001brc 8,1a8530 001a8532: e3f0ffc0ff71 lay %r15,-64(%r15) #001a8538: e3e0f0980024 stg %r14,152(%r15) >001a853e: e5482008 mvghi 8(%r2),0 001a8544: e5482010 mvghi 16(%r2),0 001a854a: 58100370 l %r1,880 001a854e: 50102020 st %r1,32(%r2) 001a8552: b90400c2 lgr %r12,%r2 [0.533313] Call Trace: [0.533315] ([<0001>] 0x1) [0.533318] ([<001b4220>] __raw_spin_lock_init+0x50/0x80) [0.533320] ([<00759e7a>] rand_initialize+0xc2/0xf0) [0.533322] ([<001002cc>] do_one_initcall+0xb4/0x140) [0.533325] ([<00ef2cc0>] kernel_init_freeable+0x140/0x2d8) [0.533328] ([<009b07ea>] kernel_init+0x2a/0x150) [0.50] ([<009bd782>] kernel_thread_starter+0x6/0xc) [0.52] ([<009bd77c>] kernel_thread_starter+0x0/0xc) To me it looks rand_initialize is broken with CONFIG_NUMA: static int rand_initialize(void) { #ifdef CONFIG_NUMA int i; int num_nodes = num_possible_nodes(); struct crng_state *crng; struct crng_state **pool; #endif init_std_data(_pool); init_std_data(_pool); crng_initialize(_crng); #ifdef CONFIG_NUMA pool = kmalloc(num_nodes * sizeof(void *), GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); for (i=0; i < num_nodes; i++) { crng = kmalloc_node(sizeof(struct crng_state), GFP_KERNEL | __GFP_NOFAIL, i); spin_lock_init(>lock); crng_initialize(crng); pool[i] = crng; } mb(); crng_node_pool = pool; #endif return 0; } early_initcall(rand_initialize); First the for loop should use for_each_node() to skip not possible nodes, no? However that wouldn't be enough, since in this case it crashed because node 1 is in the possible map, but it isn't online and doesn't have any memory, which explains why the allocation fails and the subsequent crash when calling spin_lock_init(). I think the proper fix would be to simply use for_each_online_node(); at least that fixes the crash on s390.
[BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash
Hi Ted, it looks like your patch "random: make /dev/urandom scalable for silly userspace programs" within linux-next seems to be a bit broken: It causes this allocation failure and subsequent crash on s390 with fake NUMA enabled: [0.533195] SLUB: Unable to allocate memory on node 1, gfp=0x24008c0(GFP_KERNEL|__GFP_NOFAIL) [0.533198] cache: kmalloc-192, object size: 192, buffer size: 528, defaul order: 3, min order: 0 [0.533202] node 0: slabs: 2, objs: 124, free: 17 [0.533208] Unable to handle kernel pointer dereference in virtual kernel address space [0.533211] Failing address: TEID: 0483 ... [0.533276] Krnl PSW : 0704e0018000 001a853e (lockdep_init_map+0x1e/0x220) [0.533281]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 00a23400 370c8008 0060 00bedc90 [0.533285]02070800 0001 [0.533287]3743d3f8 3743d408 02070800 00bedc90 [0.533289]0048 009c2030 370cfd00 370cfcc0 [0.533295] Krnl Code: 001a852e: a7840001brc 8,1a8530 001a8532: e3f0ffc0ff71 lay %r15,-64(%r15) #001a8538: e3e0f0980024 stg %r14,152(%r15) >001a853e: e5482008 mvghi 8(%r2),0 001a8544: e5482010 mvghi 16(%r2),0 001a854a: 58100370 l %r1,880 001a854e: 50102020 st %r1,32(%r2) 001a8552: b90400c2 lgr %r12,%r2 [0.533313] Call Trace: [0.533315] ([<0001>] 0x1) [0.533318] ([<001b4220>] __raw_spin_lock_init+0x50/0x80) [0.533320] ([<00759e7a>] rand_initialize+0xc2/0xf0) [0.533322] ([<001002cc>] do_one_initcall+0xb4/0x140) [0.533325] ([<00ef2cc0>] kernel_init_freeable+0x140/0x2d8) [0.533328] ([<009b07ea>] kernel_init+0x2a/0x150) [0.50] ([<009bd782>] kernel_thread_starter+0x6/0xc) [0.52] ([<009bd77c>] kernel_thread_starter+0x0/0xc) To me it looks rand_initialize is broken with CONFIG_NUMA: static int rand_initialize(void) { #ifdef CONFIG_NUMA int i; int num_nodes = num_possible_nodes(); struct crng_state *crng; struct crng_state **pool; #endif init_std_data(_pool); init_std_data(_pool); crng_initialize(_crng); #ifdef CONFIG_NUMA pool = kmalloc(num_nodes * sizeof(void *), GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); for (i=0; i < num_nodes; i++) { crng = kmalloc_node(sizeof(struct crng_state), GFP_KERNEL | __GFP_NOFAIL, i); spin_lock_init(>lock); crng_initialize(crng); pool[i] = crng; } mb(); crng_node_pool = pool; #endif return 0; } early_initcall(rand_initialize); First the for loop should use for_each_node() to skip not possible nodes, no? However that wouldn't be enough, since in this case it crashed because node 1 is in the possible map, but it isn't online and doesn't have any memory, which explains why the allocation fails and the subsequent crash when calling spin_lock_init(). I think the proper fix would be to simply use for_each_online_node(); at least that fixes the crash on s390.
[PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application threads were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application should have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi KleenSigned-off-by: Theodore Ts'o --- drivers/char/random.c | 62 +++ 1 file changed, 58 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 841f9a8..d640865 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -434,6 +434,8 @@ struct crng_state primary_crng = { */ static int crng_init = 0; #define crng_ready() (likely(crng_init > 0)) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]); static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]); static void process_random_ready_list(void); @@ -754,6 +756,16 @@ static void credit_entropy_bits_safe(struct entropy_store *r, int nbits) static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + static void crng_initialize(struct crng_state *crng) { int i; @@ -815,7 +827,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) if (num == 0) return; } else - extract_crng(buf.block); + _extract_crng(_crng, buf.block); spin_lock_irqsave(_crng.lock, flags); for (i = 0; i < 8; i++) { unsigned long rv; @@ -835,19 +847,26 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags); } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_crng, _pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; if (crng_init > 1 && time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(crng, _pool); + crng_reseed(crng, crng == _crng ? _pool : NULL); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -857,6 +876,19 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ + struct crng_state *crng = NULL; + +#ifdef CONFIG_NUMA + if (crng_node_pool) + crng = crng_node_pool[numa_node_id()]; + if (crng == NULL) +#endif + crng = _crng; + _extract_crng(crng, out); +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1575,9 +1607,31 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + struct crng_state **pool; +#endif + init_std_data(_pool); init_std_data(_pool); crng_initialize(_crng); + +#ifdef CONFIG_NUMA + pool = kmalloc(num_nodes * sizeof(void *), + GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); + for (i=0; i < num_nodes; i++) { + crng = kmalloc_node(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL, i); + spin_lock_init(>lock); + crng_initialize(crng); + pool[i] = crng; + + } + mb(); + crng_node_pool = pool; +#endif return 0; } early_initcall(rand_initialize); -- 2.5.0
[PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application threads were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application should have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi Kleen Signed-off-by: Theodore Ts'o --- drivers/char/random.c | 62 +++ 1 file changed, 58 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 841f9a8..d640865 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -434,6 +434,8 @@ struct crng_state primary_crng = { */ static int crng_init = 0; #define crng_ready() (likely(crng_init > 0)) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]); static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]); static void process_random_ready_list(void); @@ -754,6 +756,16 @@ static void credit_entropy_bits_safe(struct entropy_store *r, int nbits) static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + static void crng_initialize(struct crng_state *crng) { int i; @@ -815,7 +827,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) if (num == 0) return; } else - extract_crng(buf.block); + _extract_crng(_crng, buf.block); spin_lock_irqsave(_crng.lock, flags); for (i = 0; i < 8; i++) { unsigned long rv; @@ -835,19 +847,26 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags); } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_crng, _pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; if (crng_init > 1 && time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(crng, _pool); + crng_reseed(crng, crng == _crng ? _pool : NULL); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -857,6 +876,19 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ + struct crng_state *crng = NULL; + +#ifdef CONFIG_NUMA + if (crng_node_pool) + crng = crng_node_pool[numa_node_id()]; + if (crng == NULL) +#endif + crng = _crng; + _extract_crng(crng, out); +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1575,9 +1607,31 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + struct crng_state **pool; +#endif + init_std_data(_pool); init_std_data(_pool); crng_initialize(_crng); + +#ifdef CONFIG_NUMA + pool = kmalloc(num_nodes * sizeof(void *), + GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO); + for (i=0; i < num_nodes; i++) { + crng = kmalloc_node(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL, i); + spin_lock_init(>lock); + crng_initialize(crng); + pool[i] = crng; + + } + mb(); + crng_node_pool = pool; +#endif return 0; } early_initcall(rand_initialize); -- 2.5.0
Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs
On Mon, May 30, 2016 at 08:03:59AM +0200, Stephan Mueller wrote: > > static int rand_initialize(void) > > { > > +#ifdef CONFIG_NUMA > > + int i; > > + int num_nodes = num_possible_nodes(); > > + struct crng_state *crng; > > + > > + crng_node_pool = kmalloc(num_nodes * sizeof(void *), > > +GFP_KERNEL|__GFP_NOFAIL); > > + > > + for (i=0; i < num_nodes; i++) { > > + crng = kmalloc(sizeof(struct crng_state), > > + GFP_KERNEL | __GFP_NOFAIL); > > + initialize_crng(crng); > > Could you please help me understand the logic flow here: The NUMA secondary > DRNGs are initialized before the input/blocking pools and the primary DRNG. > > The initialization call uses get_random_bytes() for the secondary DRNGs. But > since the primary DRNG is not yet initialized, where does the > get_random_bytes > gets its randomness from? Yeah, I screwed up. The hunk of code staring with "crng_node_pool = kmalloc(..." and the for loop afterwards should be moved to after _initialize_crng(). Serves me right for not testing CONFIG_NUMA before sending out the patches. This is *not* about adding entropy; as you've noted, this is done very early in boot up, before there has been any chance for any kind of entropy to be collected in any of the input pools. It's more of a insurance policy just in case on some platform, if it turns out that assuming a bit's worth of entropy per interrupt was hopelessly over-optimistic, at least the starting point will be different across different kernels (and maybe different boot times, but on the sorts of platforms where I'm most concerned, there may not be a real-time clock and almost certainly not a architectural hwrng in the CPU). - Ted
Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs
On Mon, May 30, 2016 at 08:03:59AM +0200, Stephan Mueller wrote: > > static int rand_initialize(void) > > { > > +#ifdef CONFIG_NUMA > > + int i; > > + int num_nodes = num_possible_nodes(); > > + struct crng_state *crng; > > + > > + crng_node_pool = kmalloc(num_nodes * sizeof(void *), > > +GFP_KERNEL|__GFP_NOFAIL); > > + > > + for (i=0; i < num_nodes; i++) { > > + crng = kmalloc(sizeof(struct crng_state), > > + GFP_KERNEL | __GFP_NOFAIL); > > + initialize_crng(crng); > > Could you please help me understand the logic flow here: The NUMA secondary > DRNGs are initialized before the input/blocking pools and the primary DRNG. > > The initialization call uses get_random_bytes() for the secondary DRNGs. But > since the primary DRNG is not yet initialized, where does the > get_random_bytes > gets its randomness from? Yeah, I screwed up. The hunk of code staring with "crng_node_pool = kmalloc(..." and the for loop afterwards should be moved to after _initialize_crng(). Serves me right for not testing CONFIG_NUMA before sending out the patches. This is *not* about adding entropy; as you've noted, this is done very early in boot up, before there has been any chance for any kind of entropy to be collected in any of the input pools. It's more of a insurance policy just in case on some platform, if it turns out that assuming a bit's worth of entropy per interrupt was hopelessly over-optimistic, at least the starting point will be different across different kernels (and maybe different boot times, but on the sorts of platforms where I'm most concerned, there may not be a real-time clock and almost certainly not a architectural hwrng in the CPU). - Ted
Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs
Am Montag, 30. Mai 2016, 01:39:22 schrieb Theodore Ts'o: Hi Theodore, > On a system with a 4 socket (NUMA) system where a large number of > application threads were all trying to read from /dev/urandom, this > can result in the system spending 80% of its time contending on the > global urandom spinlock. The application should have used its own > PRNG, but let's try to help it from running, lemming-like, straight > over the locking cliff. > > Reported-by: Andi Kleen> Signed-off-by: Theodore Ts'o > --- > drivers/char/random.c | 57 > +++ 1 file changed, 53 > insertions(+), 4 deletions(-) > > diff --git a/drivers/char/random.c b/drivers/char/random.c > index 3a4d865..1778c84 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -434,6 +434,8 @@ struct crng_state primary_crng = { > */ > static int crng_init = 0; > #define crng_ready() (likely(crng_init >= 2)) > +static void _extract_crng(struct crng_state *crng, > + __u8 out[CHACHA20_BLOCK_SIZE]); > static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]); > static void process_random_ready_list(void); > > @@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct > entropy_store *r, int nbits) > > static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); > > +#ifdef CONFIG_NUMA > +/* > + * Hack to deal with crazy userspace progams when they are all trying > + * to access /dev/urandom in parallel. The programs are almost > + * certainly doing something terribly wrong, but we'll work around > + * their brain damage. > + */ > +static struct crng_state **crng_node_pool __read_mostly; > +#endif > + > + > static void _initialize_crng(struct crng_state *crng) > { > int i; > @@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng) > crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1; > } > > +#ifdef CONFIG_NUMA > static void initialize_crng(struct crng_state *crng) > { > _initialize_crng(crng); > spin_lock_init(>lock); > } > +#endif > > static int crng_fast_load(__u32 pool[4]) > { > @@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct > entropy_store *r) if (num == 0) > return; > } else > - extract_crng(buf.block); > + _extract_crng(_crng, buf.block); > spin_lock_irqsave(_crng.lock, flags); > for (i = 0; i < 8; i++) { > unsigned long rv; > @@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng, > struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags); > } > > +static inline void maybe_reseed_primary_crng(void) > +{ > + if (crng_init > 2 && > + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) > + crng_reseed(_crng, _pool); > +} > + > static inline void crng_wait_ready(void) > { > wait_event_interruptible(crng_init_wait, crng_ready()); > } > > -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > +static void _extract_crng(struct crng_state *crng, > + __u8 out[CHACHA20_BLOCK_SIZE]) > { > unsigned long v, flags; > - struct crng_state *crng = _crng; > > if (crng_init > 2 && > time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) > - crng_reseed(crng, _pool); > + crng_reseed(crng, crng == _crng ? _pool : NULL); > spin_lock_irqsave(>lock, flags); > if (arch_get_random_long()) > crng->state[14] ^= v; > @@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > spin_unlock_irqrestore(>lock, flags); > } > > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > +{ > +#ifndef CONFIG_NUMA > + struct crng_state *crng = _crng; > +#else > + struct crng_state *crng = crng_node_pool[numa_node_id()]; > +#endif > + > + _extract_crng(crng, out); > +} > + > static ssize_t extract_crng_user(void __user *buf, size_t nbytes) > { > ssize_t ret = 0, i; > @@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r) > */ > static int rand_initialize(void) > { > +#ifdef CONFIG_NUMA > + int i; > + int num_nodes = num_possible_nodes(); > + struct crng_state *crng; > + > + crng_node_pool = kmalloc(num_nodes * sizeof(void *), > + GFP_KERNEL|__GFP_NOFAIL); > + > + for (i=0; i < num_nodes; i++) { > + crng = kmalloc(sizeof(struct crng_state), > +GFP_KERNEL | __GFP_NOFAIL); > + initialize_crng(crng); Could you please help me understand the logic flow here: The NUMA secondary DRNGs are initialized before the input/blocking pools and the primary DRNG. The initialization call uses get_random_bytes() for the secondary DRNGs. But since the primary DRNG is not yet initialized, where does the get_random_bytes gets its randomness from? > +
Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs
Am Montag, 30. Mai 2016, 01:39:22 schrieb Theodore Ts'o: Hi Theodore, > On a system with a 4 socket (NUMA) system where a large number of > application threads were all trying to read from /dev/urandom, this > can result in the system spending 80% of its time contending on the > global urandom spinlock. The application should have used its own > PRNG, but let's try to help it from running, lemming-like, straight > over the locking cliff. > > Reported-by: Andi Kleen > Signed-off-by: Theodore Ts'o > --- > drivers/char/random.c | 57 > +++ 1 file changed, 53 > insertions(+), 4 deletions(-) > > diff --git a/drivers/char/random.c b/drivers/char/random.c > index 3a4d865..1778c84 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -434,6 +434,8 @@ struct crng_state primary_crng = { > */ > static int crng_init = 0; > #define crng_ready() (likely(crng_init >= 2)) > +static void _extract_crng(struct crng_state *crng, > + __u8 out[CHACHA20_BLOCK_SIZE]); > static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]); > static void process_random_ready_list(void); > > @@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct > entropy_store *r, int nbits) > > static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); > > +#ifdef CONFIG_NUMA > +/* > + * Hack to deal with crazy userspace progams when they are all trying > + * to access /dev/urandom in parallel. The programs are almost > + * certainly doing something terribly wrong, but we'll work around > + * their brain damage. > + */ > +static struct crng_state **crng_node_pool __read_mostly; > +#endif > + > + > static void _initialize_crng(struct crng_state *crng) > { > int i; > @@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng) > crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1; > } > > +#ifdef CONFIG_NUMA > static void initialize_crng(struct crng_state *crng) > { > _initialize_crng(crng); > spin_lock_init(>lock); > } > +#endif > > static int crng_fast_load(__u32 pool[4]) > { > @@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct > entropy_store *r) if (num == 0) > return; > } else > - extract_crng(buf.block); > + _extract_crng(_crng, buf.block); > spin_lock_irqsave(_crng.lock, flags); > for (i = 0; i < 8; i++) { > unsigned long rv; > @@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng, > struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags); > } > > +static inline void maybe_reseed_primary_crng(void) > +{ > + if (crng_init > 2 && > + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) > + crng_reseed(_crng, _pool); > +} > + > static inline void crng_wait_ready(void) > { > wait_event_interruptible(crng_init_wait, crng_ready()); > } > > -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > +static void _extract_crng(struct crng_state *crng, > + __u8 out[CHACHA20_BLOCK_SIZE]) > { > unsigned long v, flags; > - struct crng_state *crng = _crng; > > if (crng_init > 2 && > time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) > - crng_reseed(crng, _pool); > + crng_reseed(crng, crng == _crng ? _pool : NULL); > spin_lock_irqsave(>lock, flags); > if (arch_get_random_long()) > crng->state[14] ^= v; > @@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > spin_unlock_irqrestore(>lock, flags); > } > > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > +{ > +#ifndef CONFIG_NUMA > + struct crng_state *crng = _crng; > +#else > + struct crng_state *crng = crng_node_pool[numa_node_id()]; > +#endif > + > + _extract_crng(crng, out); > +} > + > static ssize_t extract_crng_user(void __user *buf, size_t nbytes) > { > ssize_t ret = 0, i; > @@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r) > */ > static int rand_initialize(void) > { > +#ifdef CONFIG_NUMA > + int i; > + int num_nodes = num_possible_nodes(); > + struct crng_state *crng; > + > + crng_node_pool = kmalloc(num_nodes * sizeof(void *), > + GFP_KERNEL|__GFP_NOFAIL); > + > + for (i=0; i < num_nodes; i++) { > + crng = kmalloc(sizeof(struct crng_state), > +GFP_KERNEL | __GFP_NOFAIL); > + initialize_crng(crng); Could you please help me understand the logic flow here: The NUMA secondary DRNGs are initialized before the input/blocking pools and the primary DRNG. The initialization call uses get_random_bytes() for the secondary DRNGs. But since the primary DRNG is not yet initialized, where does the get_random_bytes gets its randomness from? > + crng_node_pool[i] = crng; > + > +
[PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application threads were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application should have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi KleenSigned-off-by: Theodore Ts'o --- drivers/char/random.c | 57 +++ 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 3a4d865..1778c84 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -434,6 +434,8 @@ struct crng_state primary_crng = { */ static int crng_init = 0; #define crng_ready() (likely(crng_init >= 2)) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]); static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]); static void process_random_ready_list(void); @@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct entropy_store *r, int nbits) static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + + static void _initialize_crng(struct crng_state *crng) { int i; @@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng) crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1; } +#ifdef CONFIG_NUMA static void initialize_crng(struct crng_state *crng) { _initialize_crng(crng); spin_lock_init(>lock); } +#endif static int crng_fast_load(__u32 pool[4]) { @@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) if (num == 0) return; } else - extract_crng(buf.block); + _extract_crng(_crng, buf.block); spin_lock_irqsave(_crng.lock, flags); for (i = 0; i < 8; i++) { unsigned long rv; @@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags); } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_crng, _pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; if (crng_init > 2 && time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(crng, _pool); + crng_reseed(crng, crng == _crng ? _pool : NULL); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ +#ifndef CONFIG_NUMA + struct crng_state *crng = _crng; +#else + struct crng_state *crng = crng_node_pool[numa_node_id()]; +#endif + + _extract_crng(crng, out); +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + + crng_node_pool = kmalloc(num_nodes * sizeof(void *), +GFP_KERNEL|__GFP_NOFAIL); + + for (i=0; i < num_nodes; i++) { + crng = kmalloc(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL); + initialize_crng(crng); + crng_node_pool[i] = crng; + + } +#endif init_std_data(_pool); init_std_data(_pool); _initialize_crng(_crng); -- 2.5.0
[PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application threads were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application should have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi Kleen Signed-off-by: Theodore Ts'o --- drivers/char/random.c | 57 +++ 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 3a4d865..1778c84 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -434,6 +434,8 @@ struct crng_state primary_crng = { */ static int crng_init = 0; #define crng_ready() (likely(crng_init >= 2)) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]); static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]); static void process_random_ready_list(void); @@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct entropy_store *r, int nbits) static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + + static void _initialize_crng(struct crng_state *crng) { int i; @@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng) crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1; } +#ifdef CONFIG_NUMA static void initialize_crng(struct crng_state *crng) { _initialize_crng(crng); spin_lock_init(>lock); } +#endif static int crng_fast_load(__u32 pool[4]) { @@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) if (num == 0) return; } else - extract_crng(buf.block); + _extract_crng(_crng, buf.block); spin_lock_irqsave(_crng.lock, flags); for (i = 0; i < 8; i++) { unsigned long rv; @@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags); } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_crng, _pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; if (crng_init > 2 && time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(crng, _pool); + crng_reseed(crng, crng == _crng ? _pool : NULL); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ +#ifndef CONFIG_NUMA + struct crng_state *crng = _crng; +#else + struct crng_state *crng = crng_node_pool[numa_node_id()]; +#endif + + _extract_crng(crng, out); +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + + crng_node_pool = kmalloc(num_nodes * sizeof(void *), +GFP_KERNEL|__GFP_NOFAIL); + + for (i=0; i < num_nodes; i++) { + crng = kmalloc(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL); + initialize_crng(crng); + crng_node_pool[i] = crng; + + } +#endif init_std_data(_pool); init_std_data(_pool); _initialize_crng(_crng); -- 2.5.0
[PATCH 2/4] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application threads were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application should have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi KleenSigned-off-by: Theodore Ts'o --- drivers/char/random.c | 67 +++ 1 file changed, 62 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 91d5c2a..b970db6 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -749,6 +749,17 @@ struct crng_state primary_crng = { }; static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + + static void _initialize_crng(struct crng_state *crng) { int i; @@ -764,11 +775,13 @@ static void _initialize_crng(struct crng_state *crng) crng->init_time = jiffies - CRNG_RESEED_INTERVAL; } +#ifdef CONFIG_NUMA static void initialize_crng(struct crng_state *crng) { _initialize_crng(crng); spin_lock_init(>lock); } +#endif static int crng_fast_load(__u32 pool[4]) { @@ -823,19 +836,23 @@ out: return ret; } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; - if (crng_init > 2 && - time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(_pool); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -845,6 +862,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ +#ifndef CONFIG_NUMA + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); +#else + int node_id = numa_node_id(); + struct crng_state *crng = crng_node_pool[node_id]; + + if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) { + unsigned long flags; + + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); + spin_lock_irqsave(>lock, flags); + memcpy(>state[4], out, CHACHA20_KEY_SIZE); + crng->state[15] = node_id; + crng->init_time = jiffies; + spin_unlock_irqrestore(>lock, flags); + } + _extract_crng(crng, out); +#endif +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1549,6 +1590,22 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + + crng_node_pool = kmalloc(num_nodes * sizeof(void *), +GFP_KERNEL|__GFP_NOFAIL); + + for (i=0; i < num_nodes; i++) { + crng = kmalloc(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL); + initialize_crng(crng); + crng_node_pool[i] = crng; + + } +#endif init_std_data(_pool); init_std_data(_pool); _initialize_crng(_crng); -- 2.5.0
[PATCH 2/4] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application threads were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application should have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi Kleen Signed-off-by: Theodore Ts'o --- drivers/char/random.c | 67 +++ 1 file changed, 62 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 91d5c2a..b970db6 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -749,6 +749,17 @@ struct crng_state primary_crng = { }; static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + + static void _initialize_crng(struct crng_state *crng) { int i; @@ -764,11 +775,13 @@ static void _initialize_crng(struct crng_state *crng) crng->init_time = jiffies - CRNG_RESEED_INTERVAL; } +#ifdef CONFIG_NUMA static void initialize_crng(struct crng_state *crng) { _initialize_crng(crng); spin_lock_init(>lock); } +#endif static int crng_fast_load(__u32 pool[4]) { @@ -823,19 +836,23 @@ out: return ret; } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; - if (crng_init > 2 && - time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(_pool); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -845,6 +862,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ +#ifndef CONFIG_NUMA + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); +#else + int node_id = numa_node_id(); + struct crng_state *crng = crng_node_pool[node_id]; + + if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) { + unsigned long flags; + + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); + spin_lock_irqsave(>lock, flags); + memcpy(>state[4], out, CHACHA20_KEY_SIZE); + crng->state[15] = node_id; + crng->init_time = jiffies; + spin_unlock_irqrestore(>lock, flags); + } + _extract_crng(crng, out); +#endif +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1549,6 +1590,22 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + + crng_node_pool = kmalloc(num_nodes * sizeof(void *), +GFP_KERNEL|__GFP_NOFAIL); + + for (i=0; i < num_nodes; i++) { + crng = kmalloc(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL); + initialize_crng(crng); + crng_node_pool[i] = crng; + + } +#endif init_std_data(_pool); init_std_data(_pool); _initialize_crng(_crng); -- 2.5.0
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
Am Montag, 2. Mai 2016, 09:48:57 schrieb Theodore Ts'o: Hi Theodore, > On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote: > > > - entropy pool draining: when having a timer-based reseeding on a quiet > > > system, the entropy pool can be drained during the expiry of the timer. > > > So, I tried to handle that by increasing the timer by, say, 100 seconds > > > for each new NUMA node. Note, even the baseline of 300 seconds with > > > CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM > > > test system and left it quiet, entropy pool draining was prevented at > > > around 500 seconds. > > One other thought. If your KVM test system was completely quiet, then > all of the entropy was coming from timer interrupts. It is an open > question whether an adversary could predict the bit of "entropy" you > are generating with better than 50% probability if both the host and > the guest system are quiescent. And if they can, then maybe assuming > one bit of entropy per interrupt might be a too optimistic. It is not entirely quiet -- systemd likes to dump data on the disk once in a while. So, it is no timer interrupt that I see. Note, my test system runs as tickless kernel. > > This is especially true on bare metal where very often, especially on > smaller machines, where there is a single oscillator from which all of > the clocks on the SOC or motherboard are derived. There is a reason > why I was being ultra conservative in sampling 64 interrupts into a > 32-bit fast-mix pool before mixing it into the input pool, and only > crediting the pool with a single bit of entropy each time I did this. As I do not think that we see any timer interrupts, I think this argument may not count. Besides, I have not seen any timer interrupts lately (with or without a tickless kernel). Thus, which interrupt do you think is a timer interrupt? Ciao Stephan
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
Am Montag, 2. Mai 2016, 09:48:57 schrieb Theodore Ts'o: Hi Theodore, > On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote: > > > - entropy pool draining: when having a timer-based reseeding on a quiet > > > system, the entropy pool can be drained during the expiry of the timer. > > > So, I tried to handle that by increasing the timer by, say, 100 seconds > > > for each new NUMA node. Note, even the baseline of 300 seconds with > > > CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM > > > test system and left it quiet, entropy pool draining was prevented at > > > around 500 seconds. > > One other thought. If your KVM test system was completely quiet, then > all of the entropy was coming from timer interrupts. It is an open > question whether an adversary could predict the bit of "entropy" you > are generating with better than 50% probability if both the host and > the guest system are quiescent. And if they can, then maybe assuming > one bit of entropy per interrupt might be a too optimistic. It is not entirely quiet -- systemd likes to dump data on the disk once in a while. So, it is no timer interrupt that I see. Note, my test system runs as tickless kernel. > > This is especially true on bare metal where very often, especially on > smaller machines, where there is a single oscillator from which all of > the clocks on the SOC or motherboard are derived. There is a reason > why I was being ultra conservative in sampling 64 interrupts into a > 32-bit fast-mix pool before mixing it into the input pool, and only > crediting the pool with a single bit of entropy each time I did this. As I do not think that we see any timer interrupts, I think this argument may not count. Besides, I have not seen any timer interrupts lately (with or without a tickless kernel). Thus, which interrupt do you think is a timer interrupt? Ciao Stephan
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote: > > - entropy pool draining: when having a timer-based reseeding on a quiet > > system, the entropy pool can be drained during the expiry of the timer. So, > > I > > tried to handle that by increasing the timer by, say, 100 seconds for each > > new > > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL > > is > > low. When I experimented with that on a KVM test system and left it quiet, > > entropy pool draining was prevented at around 500 seconds. One other thought. If your KVM test system was completely quiet, then all of the entropy was coming from timer interrupts. It is an open question whether an adversary could predict the bit of "entropy" you are generating with better than 50% probability if both the host and the guest system are quiescent. And if they can, then maybe assuming one bit of entropy per interrupt might be a too optimistic. This is especially true on bare metal where very often, especially on smaller machines, where there is a single oscillator from which all of the clocks on the SOC or motherboard are derived. There is a reason why I was being ultra conservative in sampling 64 interrupts into a 32-bit fast-mix pool before mixing it into the input pool, and only crediting the pool with a single bit of entropy each time I did this. (It's also because of this conservatism that I was comfortable with having add_disk_randomness giving some extra credit for interrupts that are probably more likely to be hard-to-predict by an adversary. Especially if the interrupts are coming from a device with spinning rust platters.) - Ted
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote: > > - entropy pool draining: when having a timer-based reseeding on a quiet > > system, the entropy pool can be drained during the expiry of the timer. So, > > I > > tried to handle that by increasing the timer by, say, 100 seconds for each > > new > > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL > > is > > low. When I experimented with that on a KVM test system and left it quiet, > > entropy pool draining was prevented at around 500 seconds. One other thought. If your KVM test system was completely quiet, then all of the entropy was coming from timer interrupts. It is an open question whether an adversary could predict the bit of "entropy" you are generating with better than 50% probability if both the host and the guest system are quiescent. And if they can, then maybe assuming one bit of entropy per interrupt might be a too optimistic. This is especially true on bare metal where very often, especially on smaller machines, where there is a single oscillator from which all of the clocks on the SOC or motherboard are derived. There is a reason why I was being ultra conservative in sampling 64 interrupts into a 32-bit fast-mix pool before mixing it into the input pool, and only crediting the pool with a single bit of entropy each time I did this. (It's also because of this conservatism that I was comfortable with having add_disk_randomness giving some extra credit for interrupts that are probably more likely to be hard-to-predict by an adversary. Especially if the interrupts are coming from a device with spinning rust platters.) - Ted
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
On Mon, May 02, 2016 at 09:00:22AM +0200, Stephan Mueller wrote: > - reseed avalanche: I see that you added a time-based reseed code too (I am > glad about that one). What I fear is that there is a reseed avalanche when > the > various RNGs are seeded initially closely after each other (and thus the > reseed timer will expire at the same time). That would mean that they can be > reseeded all at the same time again when the timer based threshold expires > and > drain the input_pool such that if you have many nodes, the input pool will > not > have sufficient capacity (I am not speaking about entropy, but the potential > to store entropy) to satisfy all RNGs at the same time. Hence, we would then > have the potential to have entropy-starved RNGs. The crng is a CRNG, not an entropy pool. So we don't pretend to track entropy on the CRNG's at all. The current rule is that when you draw from a crng, if it has been over 5 mintues, it will reseed from its "parent" source. In the case of the primary_crng will draw between 128 and 256 bits of entropy from the input pool. In the per-NUMA node case, they draw from the primary_crng. So if there are many secondary (per-NUMA node) CRNG's that are seeded within five minutes of each other, the input pool only gets drawn down once to seed the primary_crng. The per-NUMA node crng's feed from the primary crng, and absent some catastrophic security breach where the adversary can read kernel memory (at which point you're toast anyway) the output of the primary_crng is never exposed directly outside of the system. So even if you have some crazy SGI system with 1024 NUMA nodes, the primary_crng will only be generating at most 32k worth of data to seed the secondary crng's before it gets reseed --- and the input pool is only going to be debited at most 128-256 bits of entropy each time. I thought about using the primary_crng to serve double duty as the CRNG for NUMA node 0, but I decided that on a NUMA system you have TB's and TB's of memory, and so blowing another 80 bytes or so on a separate primary_crng state makes the security analysis much simpler, and the code much simpler. I also thought about only dynamically initializing a node_id's CRNG if a spin_trylock on node 0's CRNG failed, but again, decided against it in the itnerests of keeping things simple and that NUMA people can afford to be profligate with memory --- and they're blowing way more than 80 bytes per NUMA node anyway. Besides, manufactuers of crazy-expensive NUMA systems have to feed their children, too. :-) > - entropy pool draining: when having a timer-based reseeding on a quiet > system, the entropy pool can be drained during the expiry of the timer. So, I > tried to handle that by increasing the timer by, say, 100 seconds for each > new > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL > is > low. When I experimented with that on a KVM test system and left it quiet, > entropy pool draining was prevented at around 500 seconds. Sure, but if no one is actually *using* the system, who cares about whether the input pool's entropy is getting drawn down? The usual reason why we might want to worry about reseeding frequently is if the system is generating a huge amount of randomness for some reason. This might be a good reason (you're running a IPSEC server and generating lots of IKE session keys) or it might be for a really stupid reason (dd if=/dev/urandom of=/dev/sdX bs=4k), but either way, there will be lots of disk or networking interrupts to feed the input pool. I have thought about adding something a bit more sophisticated to control the reseed logic (either tracking amount of data used, or making the reseed interval adjustable, or dynamically adjustable), but this was the simplest thing to do as a starting point. Besides for the people who believe that it's realistic to write academic papers about recovering from catastrophic security exposures where the bad guy can read arbitrary kernel memory, and somehow _not_ managed to bootstrap that into a full privilege escalation attack and installed a backdoor into your BIOS so that you are permanently pwned, they might be happy that we will be trying to recover within 5 minutes. :-) - Ted
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
On Mon, May 02, 2016 at 09:00:22AM +0200, Stephan Mueller wrote: > - reseed avalanche: I see that you added a time-based reseed code too (I am > glad about that one). What I fear is that there is a reseed avalanche when > the > various RNGs are seeded initially closely after each other (and thus the > reseed timer will expire at the same time). That would mean that they can be > reseeded all at the same time again when the timer based threshold expires > and > drain the input_pool such that if you have many nodes, the input pool will > not > have sufficient capacity (I am not speaking about entropy, but the potential > to store entropy) to satisfy all RNGs at the same time. Hence, we would then > have the potential to have entropy-starved RNGs. The crng is a CRNG, not an entropy pool. So we don't pretend to track entropy on the CRNG's at all. The current rule is that when you draw from a crng, if it has been over 5 mintues, it will reseed from its "parent" source. In the case of the primary_crng will draw between 128 and 256 bits of entropy from the input pool. In the per-NUMA node case, they draw from the primary_crng. So if there are many secondary (per-NUMA node) CRNG's that are seeded within five minutes of each other, the input pool only gets drawn down once to seed the primary_crng. The per-NUMA node crng's feed from the primary crng, and absent some catastrophic security breach where the adversary can read kernel memory (at which point you're toast anyway) the output of the primary_crng is never exposed directly outside of the system. So even if you have some crazy SGI system with 1024 NUMA nodes, the primary_crng will only be generating at most 32k worth of data to seed the secondary crng's before it gets reseed --- and the input pool is only going to be debited at most 128-256 bits of entropy each time. I thought about using the primary_crng to serve double duty as the CRNG for NUMA node 0, but I decided that on a NUMA system you have TB's and TB's of memory, and so blowing another 80 bytes or so on a separate primary_crng state makes the security analysis much simpler, and the code much simpler. I also thought about only dynamically initializing a node_id's CRNG if a spin_trylock on node 0's CRNG failed, but again, decided against it in the itnerests of keeping things simple and that NUMA people can afford to be profligate with memory --- and they're blowing way more than 80 bytes per NUMA node anyway. Besides, manufactuers of crazy-expensive NUMA systems have to feed their children, too. :-) > - entropy pool draining: when having a timer-based reseeding on a quiet > system, the entropy pool can be drained during the expiry of the timer. So, I > tried to handle that by increasing the timer by, say, 100 seconds for each > new > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL > is > low. When I experimented with that on a KVM test system and left it quiet, > entropy pool draining was prevented at around 500 seconds. Sure, but if no one is actually *using* the system, who cares about whether the input pool's entropy is getting drawn down? The usual reason why we might want to worry about reseeding frequently is if the system is generating a huge amount of randomness for some reason. This might be a good reason (you're running a IPSEC server and generating lots of IKE session keys) or it might be for a really stupid reason (dd if=/dev/urandom of=/dev/sdX bs=4k), but either way, there will be lots of disk or networking interrupts to feed the input pool. I have thought about adding something a bit more sophisticated to control the reseed logic (either tracking amount of data used, or making the reseed interval adjustable, or dynamically adjustable), but this was the simplest thing to do as a starting point. Besides for the people who believe that it's realistic to write academic papers about recovering from catastrophic security exposures where the bad guy can read arbitrary kernel memory, and somehow _not_ managed to bootstrap that into a full privilege escalation attack and installed a backdoor into your BIOS so that you are permanently pwned, they might be happy that we will be trying to recover within 5 minutes. :-) - Ted
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
Am Montag, 2. Mai 2016, 02:26:52 schrieb Theodore Ts'o: Hi Theodore, I have not digested the patch set yet, but I have the following questions to your patch set. > On a system with a 4 socket (NUMA) system where a large number of > application processes were all trying to read from /dev/urandom, this > can result in the system spending 80% of its time contending on the > global urandom spinlock. The application have used its own PRNG, but > let's try to help it from running, lemming-like, straight over the > locking cliff. - initialization: In my DRBG based patch-set I tried serialize the initialization of the per-NUMA node RNGs as follows: first the node 0 pool is seeded completely, followed by the other nodes in a completely serial fashion. If during that initialization time, say, node 3 wants some random number, but the RNG for node 3 is not yet fully seeded, it goes back to the "default" RNG of node 0. This way, it is ensured that we try to have properly seeded RNGs even during heavy load at boot time. Would that make sense here? - reseed avalanche: I see that you added a time-based reseed code too (I am glad about that one). What I fear is that there is a reseed avalanche when the various RNGs are seeded initially closely after each other (and thus the reseed timer will expire at the same time). That would mean that they can be reseeded all at the same time again when the timer based threshold expires and drain the input_pool such that if you have many nodes, the input pool will not have sufficient capacity (I am not speaking about entropy, but the potential to store entropy) to satisfy all RNGs at the same time. Hence, we would then have the potential to have entropy-starved RNGs. - entropy pool draining: when having a timer-based reseeding on a quiet system, the entropy pool can be drained during the expiry of the timer. So, I tried to handle that by increasing the timer by, say, 100 seconds for each new NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM test system and left it quiet, entropy pool draining was prevented at around 500 seconds. Ciao Stephan
Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
Am Montag, 2. Mai 2016, 02:26:52 schrieb Theodore Ts'o: Hi Theodore, I have not digested the patch set yet, but I have the following questions to your patch set. > On a system with a 4 socket (NUMA) system where a large number of > application processes were all trying to read from /dev/urandom, this > can result in the system spending 80% of its time contending on the > global urandom spinlock. The application have used its own PRNG, but > let's try to help it from running, lemming-like, straight over the > locking cliff. - initialization: In my DRBG based patch-set I tried serialize the initialization of the per-NUMA node RNGs as follows: first the node 0 pool is seeded completely, followed by the other nodes in a completely serial fashion. If during that initialization time, say, node 3 wants some random number, but the RNG for node 3 is not yet fully seeded, it goes back to the "default" RNG of node 0. This way, it is ensured that we try to have properly seeded RNGs even during heavy load at boot time. Would that make sense here? - reseed avalanche: I see that you added a time-based reseed code too (I am glad about that one). What I fear is that there is a reseed avalanche when the various RNGs are seeded initially closely after each other (and thus the reseed timer will expire at the same time). That would mean that they can be reseeded all at the same time again when the timer based threshold expires and drain the input_pool such that if you have many nodes, the input pool will not have sufficient capacity (I am not speaking about entropy, but the potential to store entropy) to satisfy all RNGs at the same time. Hence, we would then have the potential to have entropy-starved RNGs. - entropy pool draining: when having a timer-based reseeding on a quiet system, the entropy pool can be drained during the expiry of the timer. So, I tried to handle that by increasing the timer by, say, 100 seconds for each new NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM test system and left it quiet, entropy pool draining was prevented at around 500 seconds. Ciao Stephan
[PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application processes were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi KleenSigned-off-by: Theodore Ts'o --- drivers/char/random.c | 67 +++ 1 file changed, 62 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 95f4451..d5bb3b3 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -746,6 +746,17 @@ struct crng_state primary_crng = { }; static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + + static void _initialize_crng(struct crng_state *crng) { int i; @@ -761,11 +772,13 @@ static void _initialize_crng(struct crng_state *crng) crng->init_time = jiffies - CRNG_RESEED_INTERVAL; } +#ifdef CONFIG_NUMA static void initialize_crng(struct crng_state *crng) { _initialize_crng(crng); spin_lock_init(>lock); } +#endif static int crng_fast_load(__u32 pool[4]) { @@ -822,19 +835,23 @@ out: return ret; } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; - if (crng_init > 2 && - time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(_pool); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -844,6 +861,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ +#ifndef CONFIG_NUMA + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); +#else + int node_id = numa_node_id(); + struct crng_state *crng = crng_node_pool[node_id]; + + if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) { + unsigned long flags; + + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); + spin_lock_irqsave(>lock, flags); + memcpy(>state[4], out, CHACHA20_KEY_SIZE); + crng->state[15] = numa_node_id(); + crng->init_time = jiffies; + spin_unlock_irqrestore(>lock, flags); + } + _extract_crng(crng, out); +#endif +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1548,6 +1589,22 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + + crng_node_pool = kmalloc(num_nodes * sizeof(void *), +GFP_KERNEL|__GFP_NOFAIL); + + for (i=0; i < num_nodes; i++) { + crng = kmalloc(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL); + initialize_crng(crng); + crng_node_pool[i] = crng; + + } +#endif init_std_data(_pool); init_std_data(_pool); _initialize_crng(_crng); -- 2.5.0
[PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs
On a system with a 4 socket (NUMA) system where a large number of application processes were all trying to read from /dev/urandom, this can result in the system spending 80% of its time contending on the global urandom spinlock. The application have used its own PRNG, but let's try to help it from running, lemming-like, straight over the locking cliff. Reported-by: Andi Kleen Signed-off-by: Theodore Ts'o --- drivers/char/random.c | 67 +++ 1 file changed, 62 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 95f4451..d5bb3b3 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -746,6 +746,17 @@ struct crng_state primary_crng = { }; static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait); +#ifdef CONFIG_NUMA +/* + * Hack to deal with crazy userspace progams when they are all trying + * to access /dev/urandom in parallel. The programs are almost + * certainly doing something terribly wrong, but we'll work around + * their brain damage. + */ +static struct crng_state **crng_node_pool __read_mostly; +#endif + + static void _initialize_crng(struct crng_state *crng) { int i; @@ -761,11 +772,13 @@ static void _initialize_crng(struct crng_state *crng) crng->init_time = jiffies - CRNG_RESEED_INTERVAL; } +#ifdef CONFIG_NUMA static void initialize_crng(struct crng_state *crng) { _initialize_crng(crng); spin_lock_init(>lock); } +#endif static int crng_fast_load(__u32 pool[4]) { @@ -822,19 +835,23 @@ out: return ret; } +static inline void maybe_reseed_primary_crng(void) +{ + if (crng_init > 2 && + time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL)) + crng_reseed(_pool); +} + static inline void crng_wait_ready(void) { wait_event_interruptible(crng_init_wait, crng_ready()); } -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +static void _extract_crng(struct crng_state *crng, + __u8 out[CHACHA20_BLOCK_SIZE]) { unsigned long v, flags; - struct crng_state *crng = _crng; - if (crng_init > 2 && - time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) - crng_reseed(_pool); spin_lock_irqsave(>lock, flags); if (arch_get_random_long()) crng->state[14] ^= v; @@ -844,6 +861,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) spin_unlock_irqrestore(>lock, flags); } +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) +{ +#ifndef CONFIG_NUMA + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); +#else + int node_id = numa_node_id(); + struct crng_state *crng = crng_node_pool[node_id]; + + if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) { + unsigned long flags; + + maybe_reseed_primary_crng(); + _extract_crng(_crng, out); + spin_lock_irqsave(>lock, flags); + memcpy(>state[4], out, CHACHA20_KEY_SIZE); + crng->state[15] = numa_node_id(); + crng->init_time = jiffies; + spin_unlock_irqrestore(>lock, flags); + } + _extract_crng(crng, out); +#endif +} + static ssize_t extract_crng_user(void __user *buf, size_t nbytes) { ssize_t ret = 0, i; @@ -1548,6 +1589,22 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { +#ifdef CONFIG_NUMA + int i; + int num_nodes = num_possible_nodes(); + struct crng_state *crng; + + crng_node_pool = kmalloc(num_nodes * sizeof(void *), +GFP_KERNEL|__GFP_NOFAIL); + + for (i=0; i < num_nodes; i++) { + crng = kmalloc(sizeof(struct crng_state), + GFP_KERNEL | __GFP_NOFAIL); + initialize_crng(crng); + crng_node_pool[i] = crng; + + } +#endif init_std_data(_pool); init_std_data(_pool); _initialize_crng(_crng); -- 2.5.0
[tip:perf/core] perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups
Commit-ID: 827db839cd051cfde5a618fb4427e86dc91bc9aa Gitweb: http://git.kernel.org/tip/827db839cd051cfde5a618fb4427e86dc91bc9aa Author: Thomas Gleixner <t...@linutronix.de> AuthorDate: Mon, 22 Feb 2016 22:19:20 + Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Mon, 29 Feb 2016 09:35:21 +0100 perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups CQM is a strict per package facility. Use the proper cpumasks to lookup the readers. Signed-off-by: Thomas Gleixner <t...@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Andi Kleen <andi.kl...@intel.com> Cc: Arnaldo Carvalho de Melo <a...@redhat.com> Cc: Borislav Petkov <b...@alien8.de> Cc: Harish Chegondi <harish.chego...@intel.com> Cc: Jacob Pan <jacob.jun@linux.intel.com> Cc: Jiri Olsa <jo...@redhat.com> Cc: Kan Liang <kan.li...@intel.com> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Stephane Eranian <eran...@google.com> Cc: Vince Weaver <vincent.wea...@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/2016021012.054916...@linutronix.de Signed-off-by: Ingo Molnar <mi...@kernel.org> --- arch/x86/events/intel/cqm.c | 32 +++- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c index 1b064c4..93cb412 100644 --- a/arch/x86/events/intel/cqm.c +++ b/arch/x86/events/intel/cqm.c @@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = { static inline void cqm_pick_event_reader(int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int reader; - for_each_cpu(i, _cpumask) { - if (phys_id == topology_physical_package_id(i)) - return; /* already got reader for this socket */ - } - - cpumask_set_cpu(cpu, _cpumask); + /* First online cpu in package becomes the reader */ + reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu)); + if (reader >= nr_cpu_ids) + cpumask_set_cpu(cpu, _cpumask); } static void intel_cqm_cpu_starting(unsigned int cpu) @@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsigned int cpu) static void intel_cqm_cpu_exit(unsigned int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int target; - /* -* Is @cpu a designated cqm reader? -*/ + /* Is @cpu the current cqm reader for this package ? */ if (!cpumask_test_and_clear_cpu(cpu, _cpumask)) return; - for_each_online_cpu(i) { - if (i == cpu) - continue; + /* Find another online reader in this package */ + target = cpumask_any_but(topology_core_cpumask(cpu), cpu); - if (phys_id == topology_physical_package_id(i)) { - cpumask_set_cpu(i, _cpumask); - break; - } - } + if (target < nr_cpu_ids) + cpumask_set_cpu(target, _cpumask); } static int intel_cqm_cpu_notifier(struct notifier_block *nb,
[tip:perf/core] perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups
Commit-ID: 827db839cd051cfde5a618fb4427e86dc91bc9aa Gitweb: http://git.kernel.org/tip/827db839cd051cfde5a618fb4427e86dc91bc9aa Author: Thomas Gleixner AuthorDate: Mon, 22 Feb 2016 22:19:20 + Committer: Ingo Molnar CommitDate: Mon, 29 Feb 2016 09:35:21 +0100 perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups CQM is a strict per package facility. Use the proper cpumasks to lookup the readers. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Cc: Andi Kleen Cc: Arnaldo Carvalho de Melo Cc: Borislav Petkov Cc: Harish Chegondi Cc: Jacob Pan Cc: Jiri Olsa Cc: Kan Liang Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Vince Weaver Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/2016021012.054916...@linutronix.de Signed-off-by: Ingo Molnar --- arch/x86/events/intel/cqm.c | 32 +++- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c index 1b064c4..93cb412 100644 --- a/arch/x86/events/intel/cqm.c +++ b/arch/x86/events/intel/cqm.c @@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = { static inline void cqm_pick_event_reader(int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int reader; - for_each_cpu(i, _cpumask) { - if (phys_id == topology_physical_package_id(i)) - return; /* already got reader for this socket */ - } - - cpumask_set_cpu(cpu, _cpumask); + /* First online cpu in package becomes the reader */ + reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu)); + if (reader >= nr_cpu_ids) + cpumask_set_cpu(cpu, _cpumask); } static void intel_cqm_cpu_starting(unsigned int cpu) @@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsigned int cpu) static void intel_cqm_cpu_exit(unsigned int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int target; - /* -* Is @cpu a designated cqm reader? -*/ + /* Is @cpu the current cqm reader for this package ? */ if (!cpumask_test_and_clear_cpu(cpu, _cpumask)) return; - for_each_online_cpu(i) { - if (i == cpu) - continue; + /* Find another online reader in this package */ + target = cpumask_any_but(topology_core_cpumask(cpu), cpu); - if (phys_id == topology_physical_package_id(i)) { - cpumask_set_cpu(i, _cpumask); - break; - } - } + if (target < nr_cpu_ids) + cpumask_set_cpu(target, _cpumask); } static int intel_cqm_cpu_notifier(struct notifier_block *nb,
[patch V3 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups
CQM is a strict per package facility. Use the proper cpumasks to lookup the readers. Signed-off-by: Thomas Gleixner--- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++--- 1 file changed, 12 insertions(+), 22 deletions(-) Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c === --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = { static inline void cqm_pick_event_reader(int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int reader; - for_each_cpu(i, _cpumask) { - if (phys_id == topology_physical_package_id(i)) - return; /* already got reader for this socket */ - } - - cpumask_set_cpu(cpu, _cpumask); + /* First online cpu in package becomes the reader */ + reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu)); + if (reader >= nr_cpu_ids) + cpumask_set_cpu(cpu, _cpumask); } static void intel_cqm_cpu_starting(unsigned int cpu) @@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig static void intel_cqm_cpu_exit(unsigned int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int target; - /* -* Is @cpu a designated cqm reader? -*/ + /* Is @cpu the current cqm reader for this package ? */ if (!cpumask_test_and_clear_cpu(cpu, _cpumask)) return; - for_each_online_cpu(i) { - if (i == cpu) - continue; - - if (phys_id == topology_physical_package_id(i)) { - cpumask_set_cpu(i, _cpumask); - break; - } - } + /* Find another online reader in this package */ + target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + + if (target < nr_cpu_ids) + cpumask_set_cpu(target, _cpumask); } static int intel_cqm_cpu_notifier(struct notifier_block *nb,
[patch V3 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups
CQM is a strict per package facility. Use the proper cpumasks to lookup the readers. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++--- 1 file changed, 12 insertions(+), 22 deletions(-) Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c === --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = { static inline void cqm_pick_event_reader(int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int reader; - for_each_cpu(i, _cpumask) { - if (phys_id == topology_physical_package_id(i)) - return; /* already got reader for this socket */ - } - - cpumask_set_cpu(cpu, _cpumask); + /* First online cpu in package becomes the reader */ + reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu)); + if (reader >= nr_cpu_ids) + cpumask_set_cpu(cpu, _cpumask); } static void intel_cqm_cpu_starting(unsigned int cpu) @@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig static void intel_cqm_cpu_exit(unsigned int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int target; - /* -* Is @cpu a designated cqm reader? -*/ + /* Is @cpu the current cqm reader for this package ? */ if (!cpumask_test_and_clear_cpu(cpu, _cpumask)) return; - for_each_online_cpu(i) { - if (i == cpu) - continue; - - if (phys_id == topology_physical_package_id(i)) { - cpumask_set_cpu(i, _cpumask); - break; - } - } + /* Find another online reader in this package */ + target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + + if (target < nr_cpu_ids) + cpumask_set_cpu(target, _cpumask); } static int intel_cqm_cpu_notifier(struct notifier_block *nb,
[patch V2 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups
CQM is a strict per package facility. Use the proper cpumasks to lookup the readers. Signed-off-by: Thomas Gleixner--- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++--- 1 file changed, 12 insertions(+), 22 deletions(-) Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c === --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = { static inline void cqm_pick_event_reader(int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int reader; - for_each_cpu(i, _cpumask) { - if (phys_id == topology_physical_package_id(i)) - return; /* already got reader for this socket */ - } - - cpumask_set_cpu(cpu, _cpumask); + /* First online cpu in package becomes the reader */ + reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu)); + if (reader >= nr_cpu_ids) + cpumask_set_cpu(cpu, _cpumask); } static void intel_cqm_cpu_starting(unsigned int cpu) @@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig static void intel_cqm_cpu_exit(unsigned int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int target; - /* -* Is @cpu a designated cqm reader? -*/ + /* Is @cpu the current cqm reader for this package ? */ if (!cpumask_test_and_clear_cpu(cpu, _cpumask)) return; - for_each_online_cpu(i) { - if (i == cpu) - continue; - - if (phys_id == topology_physical_package_id(i)) { - cpumask_set_cpu(i, _cpumask); - break; - } - } + /* Find another online reader in this package */ + target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + + if (target < nr_cpu_ids) + cpumask_set_cpu(target, _cpumask); } static int intel_cqm_cpu_notifier(struct notifier_block *nb,
[patch V2 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups
CQM is a strict per package facility. Use the proper cpumasks to lookup the readers. Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++--- 1 file changed, 12 insertions(+), 22 deletions(-) Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c === --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c @@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = { static inline void cqm_pick_event_reader(int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int reader; - for_each_cpu(i, _cpumask) { - if (phys_id == topology_physical_package_id(i)) - return; /* already got reader for this socket */ - } - - cpumask_set_cpu(cpu, _cpumask); + /* First online cpu in package becomes the reader */ + reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu)); + if (reader >= nr_cpu_ids) + cpumask_set_cpu(cpu, _cpumask); } static void intel_cqm_cpu_starting(unsigned int cpu) @@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig static void intel_cqm_cpu_exit(unsigned int cpu) { - int phys_id = topology_physical_package_id(cpu); - int i; + int target; - /* -* Is @cpu a designated cqm reader? -*/ + /* Is @cpu the current cqm reader for this package ? */ if (!cpumask_test_and_clear_cpu(cpu, _cpumask)) return; - for_each_online_cpu(i) { - if (i == cpu) - continue; - - if (phys_id == topology_physical_package_id(i)) { - cpumask_set_cpu(i, _cpumask); - break; - } - } + /* Find another online reader in this package */ + target = cpumask_any_but(topology_core_cpumask(cpu), cpu); + + if (target < nr_cpu_ids) + cpumask_set_cpu(target, _cpumask); } static int intel_cqm_cpu_notifier(struct notifier_block *nb,
Re: [PATCH] x86/perf/intel/cqm: Get rid of the silly for_each_cpu lookups
On Thu, 18 Feb 2016, Thomas Gleixner wrote: On Wed, 17 Feb 2016, Thomas Gleixner wrote: On Wed, 17 Feb 2016, Vikas Shivappa wrote: Please stop top posting, finally! But we have an extra static - static to avoid having it in the stack.. It's not about the cpu mask on the stack. The reason was that with cpumask off stack cpumask_and_mask() requires an allocation, which then can't be used in the starting/dying callbacks. Darn, you are right to remind me. Now, the proper solution for this stuff is to provide a library function as we need that for several drivers. No point to duplicate that functionality. I'll cook something up and repost the uncore/cqm set tomorrow. Second thoughts on that. cpumask_any_but() is fine as is, if we feed it topology_core_cpumask(cpu). The worst case search is two bitmap_find_next() if the first search returned cpu. Now cpumask_any_and() does a search as well, but the number of bitmap_find_next() invocations is limited to the number of sockets if we feed the cqm_cpu_mask as first argument. So for 4 or 8 sockets that's still a reasonable limit. If the people with insane large machines care, we can revisit that topic. It's still faster than for_each_online_cpu() :) Agree. if we dont care about the large number of sockets this would still be far better than scanning each cpu. There could be some branches we avoid if we are too aggressive and remove 'all' loops (the 2nd search is always a success if 1st one fails in cpumask_any_but) by using the cpumask_and but they should not be much important/use in this case. Will send rapl patch separately. Thanks, Vikas Thanks, tglx
Re: [PATCH] x86/perf/intel/cqm: Get rid of the silly for_each_cpu lookups
On Thu, 18 Feb 2016, Thomas Gleixner wrote: On Wed, 17 Feb 2016, Thomas Gleixner wrote: On Wed, 17 Feb 2016, Vikas Shivappa wrote: Please stop top posting, finally! But we have an extra static - static to avoid having it in the stack.. It's not about the cpu mask on the stack. The reason was that with cpumask off stack cpumask_and_mask() requires an allocation, which then can't be used in the starting/dying callbacks. Darn, you are right to remind me. Now, the proper solution for this stuff is to provide a library function as we need that for several drivers. No point to duplicate that functionality. I'll cook something up and repost the uncore/cqm set tomorrow. Second thoughts on that. cpumask_any_but() is fine as is, if we feed it topology_core_cpumask(cpu). The worst case search is two bitmap_find_next() if the first search returned cpu. Now cpumask_any_and() does a search as well, but the number of bitmap_find_next() invocations is limited to the number of sockets if we feed the cqm_cpu_mask as first argument. So for 4 or 8 sockets that's still a reasonable limit. If the people with insane large machines care, we can revisit that topic. It's still faster than for_each_online_cpu() :) Agree. if we dont care about the large number of sockets this would still be far better than scanning each cpu. There could be some branches we avoid if we are too aggressive and remove 'all' loops (the 2nd search is always a success if 1st one fails in cpumask_any_but) by using the cpumask_and but they should not be much important/use in this case. Will send rapl patch separately. Thanks, Vikas Thanks, tglx