[PATCH 4.9 03/43] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-02-08 Thread Greg Kroah-Hartman
From: Eric Dumazet 

commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream

iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Acked-by: Cong Wang 
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com
Signed-off-by: Jakub Kicinski 
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_api.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -393,7 +393,8 @@ struct qdisc_rate_table *qdisc_get_rtab(
 {
struct qdisc_rate_table *rtab;
 
-   if (tab == NULL || r->rate == 0 || r->cell_log == 0 ||
+   if (tab == NULL || r->rate == 0 ||
+   r->cell_log == 0 || r->cell_log >= 32 ||
nla_len(tab) != TC_RTAB_SIZE)
return NULL;
 




[PATCH 4.4 01/38] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-02-08 Thread Greg Kroah-Hartman
From: Eric Dumazet 

commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream

iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Acked-by: Cong Wang 
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com
Signed-off-by: Jakub Kicinski 
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_api.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -391,7 +391,8 @@ struct qdisc_rate_table *qdisc_get_rtab(
 {
struct qdisc_rate_table *rtab;
 
-   if (tab == NULL || r->rate == 0 || r->cell_log == 0 ||
+   if (tab == NULL || r->rate == 0 ||
+   r->cell_log == 0 || r->cell_log >= 32 ||
nla_len(tab) != TC_RTAB_SIZE)
return NULL;
 




[PATCH 4.14 04/15] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-02-05 Thread Greg Kroah-Hartman
From: Eric Dumazet 

commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream

iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Acked-by: Cong Wang 
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com
Signed-off-by: Jakub Kicinski 
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_api.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -397,7 +397,8 @@ struct qdisc_rate_table *qdisc_get_rtab(
 {
struct qdisc_rate_table *rtab;
 
-   if (tab == NULL || r->rate == 0 || r->cell_log == 0 ||
+   if (tab == NULL || r->rate == 0 ||
+   r->cell_log == 0 || r->cell_log >= 32 ||
nla_len(tab) != TC_RTAB_SIZE)
return NULL;
 




[PATCH 5.10 174/199] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-01-26 Thread Greg Kroah-Hartman
From: Eric Dumazet 

commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream.

iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Acked-by: Cong Wang 
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com
Signed-off-by: Jakub Kicinski 
Signed-off-by: Greg Kroah-Hartman 

---
 net/sched/sch_api.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -412,7 +412,8 @@ struct qdisc_rate_table *qdisc_get_rtab(
 {
struct qdisc_rate_table *rtab;
 
-   if (tab == NULL || r->rate == 0 || r->cell_log == 0 ||
+   if (tab == NULL || r->rate == 0 ||
+   r->cell_log == 0 || r->cell_log >= 32 ||
nla_len(tab) != TC_RTAB_SIZE) {
NL_SET_ERR_MSG(extack, "Invalid rate table parameters for 
searching");
return NULL;




[PATCH 4.19 54/58] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-01-26 Thread Greg Kroah-Hartman
From: Eric Dumazet 

commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream.

iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Acked-by: Cong Wang 
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com
Signed-off-by: Jakub Kicinski 
Signed-off-by: Greg Kroah-Hartman 

---
 net/sched/sch_api.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -398,7 +398,8 @@ struct qdisc_rate_table *qdisc_get_rtab(
 {
struct qdisc_rate_table *rtab;
 
-   if (tab == NULL || r->rate == 0 || r->cell_log == 0 ||
+   if (tab == NULL || r->rate == 0 ||
+   r->cell_log == 0 || r->cell_log >= 32 ||
nla_len(tab) != TC_RTAB_SIZE) {
NL_SET_ERR_MSG(extack, "Invalid rate table parameters for 
searching");
return NULL;




[PATCH 5.4 80/86] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-01-25 Thread Greg Kroah-Hartman
From: Eric Dumazet 

commit e4bedf48aaa5552bc1f49703abd17606e7e6e82a upstream.

iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Acked-by: Cong Wang 
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.duma...@gmail.com
Signed-off-by: Jakub Kicinski 
Signed-off-by: Greg Kroah-Hartman 

---
 net/sched/sch_api.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -409,7 +409,8 @@ struct qdisc_rate_table *qdisc_get_rtab(
 {
struct qdisc_rate_table *rtab;
 
-   if (tab == NULL || r->rate == 0 || r->cell_log == 0 ||
+   if (tab == NULL || r->rate == 0 ||
+   r->cell_log == 0 || r->cell_log >= 32 ||
nla_len(tab) != TC_RTAB_SIZE) {
NL_SET_ERR_MSG(extack, "Invalid rate table parameters for 
searching");
return NULL;




[PATCH 4.9 36/86] tcp: md5: do not send silly options in SYNCOOKIES

2020-07-20 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ]

Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.

Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.

Before the fix a tcpdump would exhibit this wrong exchange :

10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, 
options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 
valid], length 0

After this patch the exchange looks saner :

11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], 
length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, 
options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 
1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.

Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Mathieu Desnoyers 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv4/tcp_output.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -615,7 +615,8 @@ static unsigned int tcp_synack_options(s
   unsigned int mss, struct sk_buff *skb,
   struct tcp_out_options *opts,
   const struct tcp_md5sig_key *md5,
-  struct tcp_fastopen_cookie *foc)
+  struct tcp_fastopen_cookie *foc,
+  enum tcp_synack_type synack_type)
 {
struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -630,7 +631,8 @@ static unsigned int tcp_synack_options(s
 * rather than TS in order to fit in better with old,
 * buggy kernels, but that was deemed to be unnecessary.
 */
-   ireq->tstamp_ok &= !ireq->sack_ok;
+   if (synack_type != TCP_SYNACK_COOKIE)
+   ireq->tstamp_ok &= !ireq->sack_ok;
}
 #endif
 
@@ -3165,8 +3167,8 @@ struct sk_buff *tcp_make_synack(const st
md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
 #endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
-   tcp_header_size = tcp_synack_options(req, mss, skb, , md5, foc) +
- sizeof(*th);
+   tcp_header_size = tcp_synack_options(req, mss, skb, , md5,
+foc, synack_type) + sizeof(*th);
 
skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);




[PATCH 5.7 017/244] tcp: md5: do not send silly options in SYNCOOKIES

2020-07-20 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ]

Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.

Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.

Before the fix a tcpdump would exhibit this wrong exchange :

10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, 
options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 
valid], length 0

After this patch the exchange looks saner :

11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], 
length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, 
options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 
1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.

Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Mathieu Desnoyers 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv4/tcp_output.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -700,7 +700,8 @@ static unsigned int tcp_synack_options(c
   unsigned int mss, struct sk_buff *skb,
   struct tcp_out_options *opts,
   const struct tcp_md5sig_key *md5,
-  struct tcp_fastopen_cookie *foc)
+  struct tcp_fastopen_cookie *foc,
+  enum tcp_synack_type synack_type)
 {
struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -715,7 +716,8 @@ static unsigned int tcp_synack_options(c
 * rather than TS in order to fit in better with old,
 * buggy kernels, but that was deemed to be unnecessary.
 */
-   ireq->tstamp_ok &= !ireq->sack_ok;
+   if (synack_type != TCP_SYNACK_COOKIE)
+   ireq->tstamp_ok &= !ireq->sack_ok;
}
 #endif
 
@@ -3388,7 +3390,7 @@ struct sk_buff *tcp_make_synack(const st
 #endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
tcp_header_size = tcp_synack_options(sk, req, mss, skb, , md5,
-foc) + sizeof(*th);
+foc, synack_type) + sizeof(*th);
 
skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);




[PATCH 5.4 020/215] tcp: md5: do not send silly options in SYNCOOKIES

2020-07-20 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ]

Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.

Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.

Before the fix a tcpdump would exhibit this wrong exchange :

10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, 
options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 
valid], length 0

After this patch the exchange looks saner :

11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], 
length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, 
options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 
1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.

Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Mathieu Desnoyers 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv4/tcp_output.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -662,7 +662,8 @@ static unsigned int tcp_synack_options(c
   unsigned int mss, struct sk_buff *skb,
   struct tcp_out_options *opts,
   const struct tcp_md5sig_key *md5,
-  struct tcp_fastopen_cookie *foc)
+  struct tcp_fastopen_cookie *foc,
+  enum tcp_synack_type synack_type)
 {
struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -677,7 +678,8 @@ static unsigned int tcp_synack_options(c
 * rather than TS in order to fit in better with old,
 * buggy kernels, but that was deemed to be unnecessary.
 */
-   ireq->tstamp_ok &= !ireq->sack_ok;
+   if (synack_type != TCP_SYNACK_COOKIE)
+   ireq->tstamp_ok &= !ireq->sack_ok;
}
 #endif
 
@@ -3326,7 +3328,7 @@ struct sk_buff *tcp_make_synack(const st
 #endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
tcp_header_size = tcp_synack_options(sk, req, mss, skb, , md5,
-foc) + sizeof(*th);
+foc, synack_type) + sizeof(*th);
 
skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);




[PATCH 4.19 013/133] tcp: md5: do not send silly options in SYNCOOKIES

2020-07-20 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ]

Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.

Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.

Before the fix a tcpdump would exhibit this wrong exchange :

10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, 
options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 
valid], length 0

After this patch the exchange looks saner :

11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], 
length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, 
options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 
1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.

Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Mathieu Desnoyers 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv4/tcp_output.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -648,7 +648,8 @@ static unsigned int tcp_synack_options(c
   unsigned int mss, struct sk_buff *skb,
   struct tcp_out_options *opts,
   const struct tcp_md5sig_key *md5,
-  struct tcp_fastopen_cookie *foc)
+  struct tcp_fastopen_cookie *foc,
+  enum tcp_synack_type synack_type)
 {
struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -663,7 +664,8 @@ static unsigned int tcp_synack_options(c
 * rather than TS in order to fit in better with old,
 * buggy kernels, but that was deemed to be unnecessary.
 */
-   ireq->tstamp_ok &= !ireq->sack_ok;
+   if (synack_type != TCP_SYNACK_COOKIE)
+   ireq->tstamp_ok &= !ireq->sack_ok;
}
 #endif
 
@@ -3246,7 +3248,7 @@ struct sk_buff *tcp_make_synack(const st
 #endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
tcp_header_size = tcp_synack_options(sk, req, mss, skb, , md5,
-foc) + sizeof(*th);
+foc, synack_type) + sizeof(*th);
 
skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);




[PATCH 4.14 052/125] tcp: md5: do not send silly options in SYNCOOKIES

2020-07-20 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ]

Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.

Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.

Before the fix a tcpdump would exhibit this wrong exchange :

10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, 
options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 
valid], length 0

After this patch the exchange looks saner :

11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options 
[nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], 
length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, 
options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 
1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options 
[nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 
valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.

Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Cc: Mathieu Desnoyers 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv4/tcp_output.c |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -616,7 +616,8 @@ static unsigned int tcp_synack_options(s
   unsigned int mss, struct sk_buff *skb,
   struct tcp_out_options *opts,
   const struct tcp_md5sig_key *md5,
-  struct tcp_fastopen_cookie *foc)
+  struct tcp_fastopen_cookie *foc,
+  enum tcp_synack_type synack_type)
 {
struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -631,7 +632,8 @@ static unsigned int tcp_synack_options(s
 * rather than TS in order to fit in better with old,
 * buggy kernels, but that was deemed to be unnecessary.
 */
-   ireq->tstamp_ok &= !ireq->sack_ok;
+   if (synack_type != TCP_SYNACK_COOKIE)
+   ireq->tstamp_ok &= !ireq->sack_ok;
}
 #endif
 
@@ -3252,8 +3254,8 @@ struct sk_buff *tcp_make_synack(const st
md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
 #endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
-   tcp_header_size = tcp_synack_options(req, mss, skb, , md5, foc) +
- sizeof(*th);
+   tcp_header_size = tcp_synack_options(req, mss, skb, , md5,
+foc, synack_type) + sizeof(*th);
 
skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);




[PATCH 4.14 02/78] net: be more gentle about silly gso requests coming from user

2020-06-29 Thread Sasha Levin
From: Eric Dumazet 

commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 upstream.

Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests.

When --mss=XXX option is set, packetdrill always provide gso_type & gso_size
for its inbound packets, regardless of packet size.

if (packet->tcp && packet->mss) {
if (packet->ipv4)
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
gso.gso_size = packet->mss;
}

Since many other programs could do the same, relax virtio_net_hdr_to_skb()
to no longer return an error, but instead ignore gso settings.

This keeps Willem intent to make sure no malicious packet could
reach gso stack.

Note that TCP stack has a special logic in tcp_set_skb_tso_segs()
to clear gso_size for small packets.

Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Acked-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Cc: Guenter Roeck 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/virtio_net.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index a16e0bdf77511..d19bfdcf77498 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -107,16 +107,17 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
*skb,
 
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
 
-   if (skb->len - p_off <= gso_size)
-   return -EINVAL;
-
-   skb_shinfo(skb)->gso_size = gso_size;
-   skb_shinfo(skb)->gso_type = gso_type;
+   /* Too small packets are not really GSO ones. */
+   if (skb->len - p_off > gso_size) {
+   shinfo->gso_size = gso_size;
+   shinfo->gso_type = gso_type;
 
-   /* Header must be checked, and gso_segs computed. */
-   skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
-   skb_shinfo(skb)->gso_segs = 0;
+   /* Header must be checked, and gso_segs computed. */
+   shinfo->gso_type |= SKB_GSO_DODGY;
+   shinfo->gso_segs = 0;
+   }
}
 
return 0;
-- 
2.25.1



[PATCH 4.19 001/131] net: be more gentle about silly gso requests coming from user

2020-06-29 Thread Sasha Levin
From: Eric Dumazet 

commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 upstream.

Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests.

When --mss=XXX option is set, packetdrill always provide gso_type & gso_size
for its inbound packets, regardless of packet size.

if (packet->tcp && packet->mss) {
if (packet->ipv4)
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
gso.gso_size = packet->mss;
}

Since many other programs could do the same, relax virtio_net_hdr_to_skb()
to no longer return an error, but instead ignore gso settings.

This keeps Willem intent to make sure no malicious packet could
reach gso stack.

Note that TCP stack has a special logic in tcp_set_skb_tso_segs()
to clear gso_size for small packets.

Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Acked-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Cc: Guenter Roeck 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/virtio_net.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c296f370e461..f32fe7080d2ec 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -109,16 +109,17 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
*skb,
 
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
 
-   if (skb->len - p_off <= gso_size)
-   return -EINVAL;
-
-   skb_shinfo(skb)->gso_size = gso_size;
-   skb_shinfo(skb)->gso_type = gso_type;
+   /* Too small packets are not really GSO ones. */
+   if (skb->len - p_off > gso_size) {
+   shinfo->gso_size = gso_size;
+   shinfo->gso_type = gso_type;
 
-   /* Header must be checked, and gso_segs computed. */
-   skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
-   skb_shinfo(skb)->gso_segs = 0;
+   /* Header must be checked, and gso_segs computed. */
+   shinfo->gso_type |= SKB_GSO_DODGY;
+   shinfo->gso_segs = 0;
+   }
}
 
return 0;
-- 
2.25.1



[PATCH 5.4 11/34] net: be more gentle about silly gso requests coming from user

2020-06-09 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 ]

Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests.

When --mss=XXX option is set, packetdrill always provide gso_type & gso_size
for its inbound packets, regardless of packet size.

if (packet->tcp && packet->mss) {
if (packet->ipv4)
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
gso.gso_size = packet->mss;
}

Since many other programs could do the same, relax virtio_net_hdr_to_skb()
to no longer return an error, but instead ignore gso settings.

This keeps Willem intent to make sure no malicious packet could
reach gso stack.

Note that TCP stack has a special logic in tcp_set_skb_tso_segs()
to clear gso_size for small packets.

Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Acked-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/virtio_net.h |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -109,16 +109,17 @@ retry:
 
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
 
-   if (skb->len - p_off <= gso_size)
-   return -EINVAL;
+   /* Too small packets are not really GSO ones. */
+   if (skb->len - p_off > gso_size) {
+   shinfo->gso_size = gso_size;
+   shinfo->gso_type = gso_type;
 
-   skb_shinfo(skb)->gso_size = gso_size;
-   skb_shinfo(skb)->gso_type = gso_type;
-
-   /* Header must be checked, and gso_segs computed. */
-   skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
-   skb_shinfo(skb)->gso_segs = 0;
+   /* Header must be checked, and gso_segs computed. */
+   shinfo->gso_type |= SKB_GSO_DODGY;
+   shinfo->gso_segs = 0;
+   }
}
 
return 0;




[PATCH 5.6 13/41] net: be more gentle about silly gso requests coming from user

2020-06-09 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit 7c6d2ecbda83150b2036a2b36b21381ad4667762 ]

Recent change in virtio_net_hdr_to_skb() broke some packetdrill tests.

When --mss=XXX option is set, packetdrill always provide gso_type & gso_size
for its inbound packets, regardless of packet size.

if (packet->tcp && packet->mss) {
if (packet->ipv4)
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
gso.gso_size = packet->mss;
}

Since many other programs could do the same, relax virtio_net_hdr_to_skb()
to no longer return an error, but instead ignore gso settings.

This keeps Willem intent to make sure no malicious packet could
reach gso stack.

Note that TCP stack has a special logic in tcp_set_skb_tso_segs()
to clear gso_size for small packets.

Fixes: 6dd912f82680 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Acked-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/virtio_net.h |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -109,16 +109,17 @@ retry:
 
if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
 
-   if (skb->len - p_off <= gso_size)
-   return -EINVAL;
+   /* Too small packets are not really GSO ones. */
+   if (skb->len - p_off > gso_size) {
+   shinfo->gso_size = gso_size;
+   shinfo->gso_type = gso_type;
 
-   skb_shinfo(skb)->gso_size = gso_size;
-   skb_shinfo(skb)->gso_type = gso_type;
-
-   /* Header must be checked, and gso_segs computed. */
-   skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
-   skb_shinfo(skb)->gso_segs = 0;
+   /* Header must be checked, and gso_segs computed. */
+   shinfo->gso_type |= SKB_GSO_DODGY;
+   shinfo->gso_segs = 0;
+   }
}
 
return 0;




[PATCH 4.9 08/90] sch_sfq: validate silly quantum values

2020-05-18 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]

syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_sfq.c |9 +
 1 file changed, 9 insertions(+)

--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -635,6 +635,15 @@ static int sfq_change(struct Qdisc *sch,
if (ctl->divisor &&
(!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
return -EINVAL;
+
+   /* slot->allot is a short, make sure quantum is not too big. */
+   if (ctl->quantum) {
+   unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum);
+
+   if (scaled <= 0 || scaled > SHRT_MAX)
+   return -EINVAL;
+   }
+
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max,
ctl_v1->Wlog))
return -EINVAL;




[PATCH 4.14 008/114] sch_sfq: validate silly quantum values

2020-05-18 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]

syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_sfq.c |9 +
 1 file changed, 9 insertions(+)

--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -639,6 +639,15 @@ static int sfq_change(struct Qdisc *sch,
if (ctl->divisor &&
(!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
return -EINVAL;
+
+   /* slot->allot is a short, make sure quantum is not too big. */
+   if (ctl->quantum) {
+   unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum);
+
+   if (scaled <= 0 || scaled > SHRT_MAX)
+   return -EINVAL;
+   }
+
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max,
ctl_v1->Wlog))
return -EINVAL;




[PATCH 4.4 05/86] sch_sfq: validate silly quantum values

2020-05-18 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]

syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_sfq.c |9 +
 1 file changed, 9 insertions(+)

--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -635,6 +635,15 @@ static int sfq_change(struct Qdisc *sch,
if (ctl->divisor &&
(!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
return -EINVAL;
+
+   /* slot->allot is a short, make sure quantum is not too big. */
+   if (ctl->quantum) {
+   unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum);
+
+   if (scaled <= 0 || scaled > SHRT_MAX)
+   return -EINVAL;
+   }
+
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max,
ctl_v1->Wlog))
return -EINVAL;




[PATCH 5.6 036/118] sch_sfq: validate silly quantum values

2020-05-13 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]

syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_sfq.c |9 +
 1 file changed, 9 insertions(+)

--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -637,6 +637,15 @@ static int sfq_change(struct Qdisc *sch,
if (ctl->divisor &&
(!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
return -EINVAL;
+
+   /* slot->allot is a short, make sure quantum is not too big. */
+   if (ctl->quantum) {
+   unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum);
+
+   if (scaled <= 0 || scaled > SHRT_MAX)
+   return -EINVAL;
+   }
+
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max,
ctl_v1->Wlog))
return -EINVAL;




[PATCH 5.4 28/90] sch_sfq: validate silly quantum values

2020-05-13 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]

syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_sfq.c |9 +
 1 file changed, 9 insertions(+)

--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -637,6 +637,15 @@ static int sfq_change(struct Qdisc *sch,
if (ctl->divisor &&
(!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
return -EINVAL;
+
+   /* slot->allot is a short, make sure quantum is not too big. */
+   if (ctl->quantum) {
+   unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum);
+
+   if (scaled <= 0 || scaled > SHRT_MAX)
+   return -EINVAL;
+   }
+
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max,
ctl_v1->Wlog))
return -EINVAL;




[PATCH 4.19 11/48] sch_sfq: validate silly quantum values

2020-05-13 Thread Greg Kroah-Hartman
From: Eric Dumazet 

[ Upstream commit df4953e4e997e273501339f607b77953772e3559 ]

syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f139 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot+0251e883fe39e7a0c...@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/sched/sch_sfq.c |9 +
 1 file changed, 9 insertions(+)

--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -641,6 +641,15 @@ static int sfq_change(struct Qdisc *sch,
if (ctl->divisor &&
(!is_power_of_2(ctl->divisor) || ctl->divisor > 65536))
return -EINVAL;
+
+   /* slot->allot is a short, make sure quantum is not too big. */
+   if (ctl->quantum) {
+   unsigned int scaled = SFQ_ALLOT_SIZE(ctl->quantum);
+
+   if (scaled <= 0 || scaled > SHRT_MAX)
+   return -EINVAL;
+   }
+
if (ctl_v1 && !red_check_params(ctl_v1->qth_min, ctl_v1->qth_max,
ctl_v1->Wlog))
return -EINVAL;




[PATCH 3.16 18/86] locking/static_keys: Fix a silly typo

2019-05-16 Thread Ben Hutchings
3.16.68-rc1 review patch.  If anyone has any objections, please let me know.

--

From: Jonathan Corbet 

commit edcd591c77a48da753456f92daf8bb50fe9bac93 upstream.

Commit:

  412758cb2670 ("jump label, locking/static_keys: Update docs")

introduced a typo that might as well get fixed.

Signed-off-by: Jonathan Corbet 
Cc: Andrew Morton 
Cc: Jason Baron 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20150907131803.54c02...@lwn.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings 
---
 Documentation/static-keys.txt | 2 +-
 include/linux/jump_label.h| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/Documentation/static-keys.txt
+++ b/Documentation/static-keys.txt
@@ -16,7 +16,7 @@ The updated API replacements are:
 DEFINE_STATIC_KEY_TRUE(key);
 DEFINE_STATIC_KEY_FALSE(key);
 static_key_likely()
-statick_key_unlikely()
+static_key_unlikely()
 
 0) Abstract
 
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -22,7 +22,7 @@
  * DEFINE_STATIC_KEY_TRUE(key);
  * DEFINE_STATIC_KEY_FALSE(key);
  * static_key_likely()
- * statick_key_unlikely()
+ * static_key_unlikely()
  *
  * Jump labels provide an interface to generate dynamic branches using
  * self-modifying code. Assuming toolchain and architecture support, if we



Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-12-02 Thread Joey Pabalinas
On Sun, Dec 02, 2018 at 12:02:54PM +0300, Alexey Dobriyan wrote:
> Empty function will be inlined so asmlinkage doesn't do anything.

Yes, that is an example of a perfect explanation to have in the
commit message :) Ack from me after that addition.

Acked-by: Joey Pabalinas 

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-12-02 Thread Joey Pabalinas
On Sun, Dec 02, 2018 at 12:02:54PM +0300, Alexey Dobriyan wrote:
> Empty function will be inlined so asmlinkage doesn't do anything.

Yes, that is an example of a perfect explanation to have in the
commit message :) Ack from me after that addition.

Acked-by: Joey Pabalinas 

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-12-02 Thread Alexey Dobriyan
On Sat, Dec 01, 2018 at 09:33:38PM -1000, Joey Pabalinas wrote:
> On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote:
> > -static inline asmlinkage void dump_stack(void)
> > +static inline void dump_stack(void)
> 
> Why is it "silly"? An explanation in the commit message would be useful.

Empty function will be inlined so asmlinkage doesn't do anything.


Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-12-02 Thread Alexey Dobriyan
On Sat, Dec 01, 2018 at 09:33:38PM -1000, Joey Pabalinas wrote:
> On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote:
> > -static inline asmlinkage void dump_stack(void)
> > +static inline void dump_stack(void)
> 
> Why is it "silly"? An explanation in the commit message would be useful.

Empty function will be inlined so asmlinkage doesn't do anything.


Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-12-01 Thread Joey Pabalinas
On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote:
> -static inline asmlinkage void dump_stack(void)
> +static inline void dump_stack(void)

Why is it "silly"? An explanation in the commit message would be useful.

> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  include/linux/printk.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char 
> *log_lvl)
>  {
>  }
>  
> -static inline asmlinkage void dump_stack(void)
> +static inline void dump_stack(void)
>  {
>  }
>  

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


Re: [PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-12-01 Thread Joey Pabalinas
On Sat, Nov 24, 2018 at 12:35:30PM +0300, Alexey Dobriyan wrote:
> -static inline asmlinkage void dump_stack(void)
> +static inline void dump_stack(void)

Why is it "silly"? An explanation in the commit message would be useful.

> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  include/linux/printk.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char 
> *log_lvl)
>  {
>  }
>  
> -static inline asmlinkage void dump_stack(void)
> +static inline void dump_stack(void)
>  {
>  }
>  

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature


[PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-11-24 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan 
---

 include/linux/printk.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char *log_lvl)
 {
 }
 
-static inline asmlinkage void dump_stack(void)
+static inline void dump_stack(void)
 {
 }
 


[PATCH] drop silly "static inline asmlinkage" from dump_stack()

2018-11-24 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan 
---

 include/linux/printk.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -269,7 +269,7 @@ static inline void show_regs_print_info(const char *log_lvl)
 {
 }
 
-static inline asmlinkage void dump_stack(void)
+static inline void dump_stack(void)
 {
 }
 


[PATCH 4.14 060/146] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet <eduma...@google.com>


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
Tested-by: Xin Long <lucien@gmail.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1774,12 +1779,12 @@ static struct sk_buff *add_grec(struct s

[PATCH 4.14 060/146] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet 


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Tested-by: Xin Long 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1774,12 +1779,12 @@ static struct sk_buff *add_grec(struct s
pgr->grec_nsrcs = htons(scount);
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->m

[PATCH 4.9 25/75] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
4.9-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet <eduma...@google.com>


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
Tested-by: Xin Long <lucien@gmail.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1774,12 +1779,12 @@ static struct sk_buff *add_

[PATCH 4.9 25/75] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
4.9-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet 


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Tested-by: Xin Long 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1682,16 +1682,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1714,10 +1714,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1738,7 +1743,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1774,12 +1779,12 @@ static struct sk_buff *add_grec(struct s
pgr->grec_nsrcs = htons(scount);
if (skb)
mld_sendpack(skb);
-

[PATCH 4.4 35/63] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet <eduma...@google.com>


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
Tested-by: Xin Long <lucien@gmail.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1668,16 +1668,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1700,10 +1700,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1724,7 +1729,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1751,12 +1756,12 @@ static struct sk_buff *add_

[PATCH 4.4 35/63] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
4.4-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet 


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Tested-by: Xin Long 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1668,16 +1668,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1700,10 +1700,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1724,7 +1729,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1751,12 +1756,12 @@ static struct sk_buff *add_grec(struct s
pgr->grec_nsrcs = htons(scount);
if (skb)
mld_sendpack(skb);
-

[PATCH 3.18 19/32] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
3.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet <eduma...@google.com>


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
Tested-by: Xin Long <lucien@gmail.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1670,16 +1670,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1702,10 +1702,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1726,7 +1731,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1753,12 +1758,12 @@ static struct sk_buff *add_

[PATCH 3.18 19/32] ipv6: mcast: better catch silly mtu values

2018-01-01 Thread Greg Kroah-Hartman
3.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Eric Dumazet 


[ Upstream commit b9b312a7a451e9c098921856e7cfbc201120e1a7 ]

syzkaller reported crashes in IPv6 stack [1]

Xin Long found that lo MTU was set to silly values.

IPv6 stack reacts to changes to small MTU, by disabling itself under
RTNL.

But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in mld code where it is assumed
the mtu is suitable.

Fix this by reading device mtu once and checking IPv6 minimal MTU.

[1]
 skbuff: skb_over_panic: text:10b86b8d len:196 put:20
 head:3b477e60 data:0e85441e tail:0xd4 end:0xc0 dev:lo
 [ cut here ]
 kernel BUG at net/core/skbuff.c:104!
 invalid opcode:  [#1] SMP KASAN
 Dumping ftrace buffer:
(ftrace buffer empty)
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc2-mm1+ #39
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
 Google 01/01/2011
 RIP: 0010:skb_panic+0x15c/0x1f0 net/core/skbuff.c:100
 RSP: 0018:8801db307508 EFLAGS: 00010286
 RAX: 0082 RBX: 8801c517e840 RCX: 
 RDX: 0082 RSI: 11003b660e61 RDI: ed003b660e95
 RBP: 8801db307570 R08: 11003b660e23 R09: 
 R10:  R11:  R12: 85bd4020
 R13: 84754ed2 R14: 0014 R15: 8801c4e26540
 FS:  () GS:8801db30() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00463610 CR3: 0001c6698000 CR4: 001406e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  skb_over_panic net/core/skbuff.c:109 [inline]
  skb_put+0x181/0x1c0 net/core/skbuff.c:1694
  add_grhead.isra.24+0x42/0x3b0 net/ipv6/mcast.c:1695
  add_grec+0xa55/0x1060 net/ipv6/mcast.c:1817
  mld_send_cr net/ipv6/mcast.c:1903 [inline]
  mld_ifc_timer_expire+0x4d2/0x770 net/ipv6/mcast.c:2448
  call_timer_fn+0x23b/0x840 kernel/time/timer.c:1320
  expire_timers kernel/time/timer.c:1357 [inline]
  __run_timers+0x7e1/0xb60 kernel/time/timer.c:1660
  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d3/0x210 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:920

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Tested-by: Xin Long 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 net/ipv6/mcast.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1670,16 +1670,16 @@ static int grec_size(struct ifmcaddr6 *p
 }
 
 static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-   int type, struct mld2_grec **ppgr)
+   int type, struct mld2_grec **ppgr, unsigned int mtu)
 {
-   struct net_device *dev = pmc->idev->dev;
struct mld2_report *pmr;
struct mld2_grec *pgr;
 
-   if (!skb)
-   skb = mld_newpack(pmc->idev, dev->mtu);
-   if (!skb)
-   return NULL;
+   if (!skb) {
+   skb = mld_newpack(pmc->idev, mtu);
+   if (!skb)
+   return NULL;
+   }
pgr = (struct mld2_grec *)skb_put(skb, sizeof(struct mld2_grec));
pgr->grec_type = type;
pgr->grec_auxwords = 0;
@@ -1702,10 +1702,15 @@ static struct sk_buff *add_grec(struct s
struct mld2_grec *pgr = NULL;
struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list;
int scount, stotal, first, isquery, truncate;
+   unsigned int mtu;
 
if (pmc->mca_flags & MAF_NOREPORT)
return skb;
 
+   mtu = READ_ONCE(dev->mtu);
+   if (mtu < IPV6_MIN_MTU)
+   return skb;
+
isquery = type == MLD2_MODE_IS_INCLUDE ||
  type == MLD2_MODE_IS_EXCLUDE;
truncate = type == MLD2_MODE_IS_EXCLUDE ||
@@ -1726,7 +1731,7 @@ static struct sk_buff *add_grec(struct s
AVAILABLE(skb) < grec_size(pmc, type, gdeleted, sdeleted)) {
if (skb)
mld_sendpack(skb);
-   skb = mld_newpack(idev, dev->mtu);
+   skb = mld_newpack(idev, mtu);
}
}
first = 1;
@@ -1753,12 +1758,12 @@ static struct sk_buff *add_grec(struct s
pgr->grec_nsrcs = htons(scount);
if (skb)
mld_sendpack(skb);
-

[PATCH 4.12 37/43] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme

2017-09-08 Thread Greg Kroah-Hartman
4.12-stable review patch.  If anyone has any objections, please let me know.

--

From: Christoph Hellwig 

commit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream.

Intel AHCI controllers that also hide NVMe devices in their bar
can't use MSI interrupts, so disable them.

Reported-by: John Loy 
Tested-by: John Loy 
Signed-off-by: Christoph Hellwig 
Fixes: d684a90d38e2 ("ahci: per-port msix support")
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/ata/ahci.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1467,7 +1467,14 @@ static void ahci_remap_check(struct pci_
return;
 
dev_warn(>dev, "Found %d remapped NVMe devices.\n", count);
-   dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use 
them.\n");
+   dev_warn(>dev,
+"Switch your BIOS from RAID to AHCI mode to use them.\n");
+
+   /*
+* Don't rely on the msi-x capability in the remap case,
+* share the legacy interrupt across ahci and remapped devices.
+*/
+   hpriv->flags |= AHCI_HFLAG_NO_MSI;
 }
 
 static int ahci_get_irq_vector(struct ata_host *host, int port)




[PATCH 4.12 37/43] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme

2017-09-08 Thread Greg Kroah-Hartman
4.12-stable review patch.  If anyone has any objections, please let me know.

--

From: Christoph Hellwig 

commit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream.

Intel AHCI controllers that also hide NVMe devices in their bar
can't use MSI interrupts, so disable them.

Reported-by: John Loy 
Tested-by: John Loy 
Signed-off-by: Christoph Hellwig 
Fixes: d684a90d38e2 ("ahci: per-port msix support")
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/ata/ahci.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1467,7 +1467,14 @@ static void ahci_remap_check(struct pci_
return;
 
dev_warn(>dev, "Found %d remapped NVMe devices.\n", count);
-   dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use 
them.\n");
+   dev_warn(>dev,
+"Switch your BIOS from RAID to AHCI mode to use them.\n");
+
+   /*
+* Don't rely on the msi-x capability in the remap case,
+* share the legacy interrupt across ahci and remapped devices.
+*/
+   hpriv->flags |= AHCI_HFLAG_NO_MSI;
 }
 
 static int ahci_get_irq_vector(struct ata_host *host, int port)




[PATCH 4.13 41/47] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme

2017-09-08 Thread Greg Kroah-Hartman
4.13-stable review patch.  If anyone has any objections, please let me know.

--

From: Christoph Hellwig 

commit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream.

Intel AHCI controllers that also hide NVMe devices in their bar
can't use MSI interrupts, so disable them.

Reported-by: John Loy 
Tested-by: John Loy 
Signed-off-by: Christoph Hellwig 
Fixes: d684a90d38e2 ("ahci: per-port msix support")
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/ata/ahci.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1469,7 +1469,14 @@ static void ahci_remap_check(struct pci_
return;
 
dev_warn(>dev, "Found %d remapped NVMe devices.\n", count);
-   dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use 
them.\n");
+   dev_warn(>dev,
+"Switch your BIOS from RAID to AHCI mode to use them.\n");
+
+   /*
+* Don't rely on the msi-x capability in the remap case,
+* share the legacy interrupt across ahci and remapped devices.
+*/
+   hpriv->flags |= AHCI_HFLAG_NO_MSI;
 }
 
 static int ahci_get_irq_vector(struct ata_host *host, int port)




[PATCH 4.13 41/47] ahci: dont use MSI for devices with the silly Intel NVMe remapping scheme

2017-09-08 Thread Greg Kroah-Hartman
4.13-stable review patch.  If anyone has any objections, please let me know.

--

From: Christoph Hellwig 

commit f723fa4e69920f6a5dd5fa0d10ce90e2f14d189c upstream.

Intel AHCI controllers that also hide NVMe devices in their bar
can't use MSI interrupts, so disable them.

Reported-by: John Loy 
Tested-by: John Loy 
Signed-off-by: Christoph Hellwig 
Fixes: d684a90d38e2 ("ahci: per-port msix support")
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/ata/ahci.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1469,7 +1469,14 @@ static void ahci_remap_check(struct pci_
return;
 
dev_warn(>dev, "Found %d remapped NVMe devices.\n", count);
-   dev_warn(>dev, "Switch your BIOS from RAID to AHCI mode to use 
them.\n");
+   dev_warn(>dev,
+"Switch your BIOS from RAID to AHCI mode to use them.\n");
+
+   /*
+* Don't rely on the msi-x capability in the remap case,
+* share the legacy interrupt across ahci and remapped devices.
+*/
+   hpriv->flags |= AHCI_HFLAG_NO_MSI;
 }
 
 static int ahci_get_irq_vector(struct ata_host *host, int port)




Moved Low-Jitter Doom 3 video and other silly videos.

2017-02-20 Thread Ove Kent Karlsen
For the interested, I moved the video I did of my low-jitter inspired 
configurations, of Doom 3, which was a very jitter sensitive game.


A quick summary: By removing and reducing unecessary code, and 
codepaths, such as reducing the HZ to 90, choosing low-jitter components 
in configuration, and renicing X, to remove its bottleneck aspect, I was 
able to run the 3-pass layer engine of Doom 3, at 72.7x3 FPS, without 
any hickup (supposedly thought to be disk reads etc.)


The video is now here: https://www.youtube.com/watch?v=xDct6vVvFxA

Now also with an upscale filter for pseudo 4K.

Also on here is an Unreal video, with an undocumented antialiasing mode, 
with maximal pixelinformation for 4K: 
https://www.youtube.com/watch?v=MKqfAbPqleo


Also a fun filter for taking 160x200 to a whole 4K :o) 
https://www.youtube.com/watch?v=E0tkH7kWB6A=33s


Best Regards,
Ove Kent Karlsen



Moved Low-Jitter Doom 3 video and other silly videos.

2017-02-20 Thread Ove Kent Karlsen
For the interested, I moved the video I did of my low-jitter inspired 
configurations, of Doom 3, which was a very jitter sensitive game.


A quick summary: By removing and reducing unecessary code, and 
codepaths, such as reducing the HZ to 90, choosing low-jitter components 
in configuration, and renicing X, to remove its bottleneck aspect, I was 
able to run the 3-pass layer engine of Doom 3, at 72.7x3 FPS, without 
any hickup (supposedly thought to be disk reads etc.)


The video is now here: https://www.youtube.com/watch?v=xDct6vVvFxA

Now also with an upscale filter for pseudo 4K.

Also on here is an Unreal video, with an undocumented antialiasing mode, 
with maximal pixelinformation for 4K: 
https://www.youtube.com/watch?v=MKqfAbPqleo


Also a fun filter for taking 160x200 to a whole 4K :o) 
https://www.youtube.com/watch?v=E0tkH7kWB6A=33s


Best Regards,
Ove Kent Karlsen



Re: [PATCH] serial: core: remove silly test for uart_state

2017-01-06 Thread Andy Shevchenko
On Fri, Jan 6, 2017 at 4:45 AM, Thadeu Lima de Souza Cascardo
 wrote:
> The polling functions were checking for a NULL uart_state, which is
> indexed from uart_driver->state. It should be always allocated and
> non-NULL when the tty_driver is registered, and line should not be
> larger than the tty_driver->num anyways.

I'm not sure this is guaranteed. Let Peter to comment on it.

Peter, what is your opinion?

P.S. If Peter is okay with it, I don't see any problems with code itself either.

>
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  drivers/tty/serial/serial_core.c | 16 
>  1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/tty/serial/serial_core.c 
> b/drivers/tty/serial/serial_core.c
> index 9939c3d9912b..6f7247114ef8 100644
> --- a/drivers/tty/serial/serial_core.c
> +++ b/drivers/tty/serial/serial_core.c
> @@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, 
> int line, char *options)
> int flow = 'n';
> int ret = 0;
>
> -   if (!state)
> -   return -1;
> -
> tport = >port;
> mutex_lock(>mutex);
>
> @@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver 
> *driver, int line)
> struct uart_port *port;
> int ret = -1;
>
> -   if (state) {
> -   port = uart_port_ref(state);
> -   if (port) {
> -   ret = port->ops->poll_get_char(port);
> -   uart_port_deref(port);
> -   }
> +   port = uart_port_ref(state);
> +   if (port) {
> +   ret = port->ops->poll_get_char(port);
> +   uart_port_deref(port);
> }
> return ret;
>  }
> @@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver 
> *driver, int line, char ch)
> struct uart_state *state = drv->state + line;
> struct uart_port *port;
>
> -   if (!state)
> -   return;
> -
> port = uart_port_ref(state);
> if (!port)
> return;
> --
> 2.11.0
>



-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH] serial: core: remove silly test for uart_state

2017-01-06 Thread Andy Shevchenko
On Fri, Jan 6, 2017 at 4:45 AM, Thadeu Lima de Souza Cascardo
 wrote:
> The polling functions were checking for a NULL uart_state, which is
> indexed from uart_driver->state. It should be always allocated and
> non-NULL when the tty_driver is registered, and line should not be
> larger than the tty_driver->num anyways.

I'm not sure this is guaranteed. Let Peter to comment on it.

Peter, what is your opinion?

P.S. If Peter is okay with it, I don't see any problems with code itself either.

>
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  drivers/tty/serial/serial_core.c | 16 
>  1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/tty/serial/serial_core.c 
> b/drivers/tty/serial/serial_core.c
> index 9939c3d9912b..6f7247114ef8 100644
> --- a/drivers/tty/serial/serial_core.c
> +++ b/drivers/tty/serial/serial_core.c
> @@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, 
> int line, char *options)
> int flow = 'n';
> int ret = 0;
>
> -   if (!state)
> -   return -1;
> -
> tport = >port;
> mutex_lock(>mutex);
>
> @@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver 
> *driver, int line)
> struct uart_port *port;
> int ret = -1;
>
> -   if (state) {
> -   port = uart_port_ref(state);
> -   if (port) {
> -   ret = port->ops->poll_get_char(port);
> -   uart_port_deref(port);
> -   }
> +   port = uart_port_ref(state);
> +   if (port) {
> +   ret = port->ops->poll_get_char(port);
> +   uart_port_deref(port);
> }
> return ret;
>  }
> @@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver 
> *driver, int line, char ch)
> struct uart_state *state = drv->state + line;
> struct uart_port *port;
>
> -   if (!state)
> -   return;
> -
> port = uart_port_ref(state);
> if (!port)
> return;
> --
> 2.11.0
>



-- 
With Best Regards,
Andy Shevchenko


[PATCH] serial: core: remove silly test for uart_state

2017-01-05 Thread Thadeu Lima de Souza Cascardo
The polling functions were checking for a NULL uart_state, which is
indexed from uart_driver->state. It should be always allocated and
non-NULL when the tty_driver is registered, and line should not be
larger than the tty_driver->num anyways.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 drivers/tty/serial/serial_core.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index 9939c3d9912b..6f7247114ef8 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, int 
line, char *options)
int flow = 'n';
int ret = 0;
 
-   if (!state)
-   return -1;
-
tport = >port;
mutex_lock(>mutex);
 
@@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver 
*driver, int line)
struct uart_port *port;
int ret = -1;
 
-   if (state) {
-   port = uart_port_ref(state);
-   if (port) {
-   ret = port->ops->poll_get_char(port);
-   uart_port_deref(port);
-   }
+   port = uart_port_ref(state);
+   if (port) {
+   ret = port->ops->poll_get_char(port);
+   uart_port_deref(port);
}
return ret;
 }
@@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver *driver, 
int line, char ch)
struct uart_state *state = drv->state + line;
struct uart_port *port;
 
-   if (!state)
-   return;
-
port = uart_port_ref(state);
if (!port)
return;
-- 
2.11.0



[PATCH] serial: core: remove silly test for uart_state

2017-01-05 Thread Thadeu Lima de Souza Cascardo
The polling functions were checking for a NULL uart_state, which is
indexed from uart_driver->state. It should be always allocated and
non-NULL when the tty_driver is registered, and line should not be
larger than the tty_driver->num anyways.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 drivers/tty/serial/serial_core.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index 9939c3d9912b..6f7247114ef8 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -2330,9 +2330,6 @@ static int uart_poll_init(struct tty_driver *driver, int 
line, char *options)
int flow = 'n';
int ret = 0;
 
-   if (!state)
-   return -1;
-
tport = >port;
mutex_lock(>mutex);
 
@@ -2367,12 +2364,10 @@ static int uart_poll_get_char(struct tty_driver 
*driver, int line)
struct uart_port *port;
int ret = -1;
 
-   if (state) {
-   port = uart_port_ref(state);
-   if (port) {
-   ret = port->ops->poll_get_char(port);
-   uart_port_deref(port);
-   }
+   port = uart_port_ref(state);
+   if (port) {
+   ret = port->ops->poll_get_char(port);
+   uart_port_deref(port);
}
return ret;
 }
@@ -2383,9 +2378,6 @@ static void uart_poll_put_char(struct tty_driver *driver, 
int line, char ch)
struct uart_state *state = drv->state + line;
struct uart_port *port;
 
-   if (!state)
-   return;
-
port = uart_port_ref(state);
if (!port)
return;
-- 
2.11.0



[tip:perf/core] perf/x86/intel/bts: Kill a silly warning

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  ef9ef3befa0d76008e988a9ed9fe439e803351b9
Gitweb: http://git.kernel.org/tip/ef9ef3befa0d76008e988a9ed9fe439e803351b9
Author: Alexander Shishkin <alexander.shish...@linux.intel.com>
AuthorDate: Tue, 6 Sep 2016 16:23:53 +0300
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 10 Sep 2016 11:15:38 +0200

perf/x86/intel/bts: Kill a silly warning

At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-6-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf..bdcd651 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;


[tip:perf/core] perf/x86/intel/bts: Kill a silly warning

2016-09-10 Thread tip-bot for Alexander Shishkin
Commit-ID:  ef9ef3befa0d76008e988a9ed9fe439e803351b9
Gitweb: http://git.kernel.org/tip/ef9ef3befa0d76008e988a9ed9fe439e803351b9
Author: Alexander Shishkin 
AuthorDate: Tue, 6 Sep 2016 16:23:53 +0300
Committer:  Ingo Molnar 
CommitDate: Sat, 10 Sep 2016 11:15:38 +0200

perf/x86/intel/bts: Kill a silly warning

At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: vi...@deater.net
Link: 
http://lkml.kernel.org/r/20160906132353.19887-6-alexander.shish...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf..bdcd651 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;


[PATCH v2 5/5] perf/x86/intel/bts: Kill a silly warning

2016-09-06 Thread Alexander Shishkin
At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf993..bdcd651099 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;
-- 
2.9.3



[PATCH v2 5/5] perf/x86/intel/bts: Kill a silly warning

2016-09-06 Thread Alexander Shishkin
At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 9233edf993..bdcd651099 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -378,8 +378,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;
-- 
2.9.3



[PATCH 2/3] perf/x86/intel/bts: Kill a silly warning

2016-08-23 Thread Alexander Shishkin
At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0b8ccff564..41975c6c2d 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -363,8 +363,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;
-- 
2.8.1



[PATCH 2/3] perf/x86/intel/bts: Kill a silly warning

2016-08-23 Thread Alexander Shishkin
At the moment, intel_bts will WARN() out if there is more than one
event writing to the same ring buffer, via SET_OUTPUT, and will only
send data from one event to a buffer.

There is no reason to have this warning in, so kill it.

Signed-off-by: Alexander Shishkin 
---
 arch/x86/events/intel/bts.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0b8ccff564..41975c6c2d 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -363,8 +363,6 @@ bts_buffer_reset(struct bts_buffer *buf, struct 
perf_output_handle *handle)
return 0;
 
head = handle->head & ((buf->nr_pages << PAGE_SHIFT) - 1);
-   if (WARN_ON_ONCE(head != local_read(>head)))
-   return -EINVAL;
 
phys = >buf[buf->cur_buf];
space = phys->offset + phys->displacement + phys->size - head;
-- 
2.8.1



Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

2016-08-21 Thread Theodore Ts'o
On Sun, Aug 21, 2016 at 12:53:15PM +0300, Jan Varho wrote:
> On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o  wrote:
> > +static inline void maybe_reseed_primary_crng(void)
> > +{
> > +   if (crng_init > 2 &&
> > +   time_after(jiffies, primary_crng.init_time + 
> > CRNG_RESEED_INTERVAL))
> > +   crng_reseed(_crng, _pool);
> > +}
> 
> Is the above function (which is now in 4.8-rc2) supposed to do
> something? It seems to have no callers and the maximum value of
> crng_init is 2.

It's dead code.  Its function got moved into _extra_crng(), and you're
right, these days crng_init never gets above 2.  Thanks for pointing
that out.  I'll take it out as a cleanup patch.

 - Ted


Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

2016-08-21 Thread Theodore Ts'o
On Sun, Aug 21, 2016 at 12:53:15PM +0300, Jan Varho wrote:
> On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o  wrote:
> > +static inline void maybe_reseed_primary_crng(void)
> > +{
> > +   if (crng_init > 2 &&
> > +   time_after(jiffies, primary_crng.init_time + 
> > CRNG_RESEED_INTERVAL))
> > +   crng_reseed(_crng, _pool);
> > +}
> 
> Is the above function (which is now in 4.8-rc2) supposed to do
> something? It seems to have no callers and the maximum value of
> crng_init is 2.

It's dead code.  Its function got moved into _extra_crng(), and you're
right, these days crng_init never gets above 2.  Thanks for pointing
that out.  I'll take it out as a cleanup patch.

 - Ted


Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

2016-08-21 Thread Jan Varho
On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o  wrote:
> +static inline void maybe_reseed_primary_crng(void)
> +{
> +   if (crng_init > 2 &&
> +   time_after(jiffies, primary_crng.init_time + 
> CRNG_RESEED_INTERVAL))
> +   crng_reseed(_crng, _pool);
> +}

Hi,

Is the above function (which is now in 4.8-rc2) supposed to do
something? It seems to have no callers and the maximum value of
crng_init is 2.
--
Jan Varho


Re: [PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

2016-08-21 Thread Jan Varho
On Mon, Jun 13, 2016 at 6:48 PM, Theodore Ts'o  wrote:
> +static inline void maybe_reseed_primary_crng(void)
> +{
> +   if (crng_init > 2 &&
> +   time_after(jiffies, primary_crng.init_time + 
> CRNG_RESEED_INTERVAL))
> +   crng_reseed(_crng, _pool);
> +}

Hi,

Is the above function (which is now in 4.8-rc2) supposed to do
something? It seems to have no callers and the maximum value of
crng_init is 2.
--
Jan Varho


Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Tony Luck
On Thu, Jul 28, 2016 at 6:41 AM, Theodore Ts'o  wrote:
> On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote:
>>
>> Oh, I just realized that Linus pulled your changes. Actually I was hoping
>> we could get this fixed before the broken code would be merged.
>> Could you please make sure the bug fix gets included as soon as possible?
>
> Yes, I'll send the pull request to ASAP.

Also broke ia64.  Same fix works for me.

Tested-by: Tony Luck 


Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Tony Luck
On Thu, Jul 28, 2016 at 6:41 AM, Theodore Ts'o  wrote:
> On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote:
>>
>> Oh, I just realized that Linus pulled your changes. Actually I was hoping
>> we could get this fixed before the broken code would be merged.
>> Could you please make sure the bug fix gets included as soon as possible?
>
> Yes, I'll send the pull request to ASAP.

Also broke ia64.  Same fix works for me.

Tested-by: Tony Luck 


Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Joe Perches
On Wed, 2016-07-27 at 23:46 -0400, Theodore Ts'o wrote:
> On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> > 
> > it looks like your patch "random: make /dev/urandom scalable for silly
> > userspace programs" within linux-next seems to be a bit broken:
> > 
> > It causes this allocation failure and subsequent crash on s390 with fake
> > NUMA enabled
> Thanks for reporting this.  This patch fixes things for you, yes?

trivia:

> diff --git a/drivers/char/random.c b/drivers/char/random.c
[]
> @@ -1668,13 +1668,12 @@ static int rand_initialize(void)
>  #ifdef CONFIG_NUMA
>   pool = kmalloc(num_nodes * sizeof(void *),
>      GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);

The __GFP_ZERO is unusual and this could use kcalloc instead.

> - for (i=0; i < num_nodes; i++) {
> + for_each_online_node(i) {
>   crng = kmalloc_node(sizeof(struct crng_state),
>   GFP_KERNEL | __GFP_NOFAIL, i);
>   spin_lock_init(>lock);
>   crng_initialize(crng);
>   pool[i] = crng;
> -
>   }
>   mb();
>   crng_node_pool = pool;



Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Joe Perches
On Wed, 2016-07-27 at 23:46 -0400, Theodore Ts'o wrote:
> On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> > 
> > it looks like your patch "random: make /dev/urandom scalable for silly
> > userspace programs" within linux-next seems to be a bit broken:
> > 
> > It causes this allocation failure and subsequent crash on s390 with fake
> > NUMA enabled
> Thanks for reporting this.  This patch fixes things for you, yes?

trivia:

> diff --git a/drivers/char/random.c b/drivers/char/random.c
[]
> @@ -1668,13 +1668,12 @@ static int rand_initialize(void)
>  #ifdef CONFIG_NUMA
>   pool = kmalloc(num_nodes * sizeof(void *),
>      GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);

The __GFP_ZERO is unusual and this could use kcalloc instead.

> - for (i=0; i < num_nodes; i++) {
> + for_each_online_node(i) {
>   crng = kmalloc_node(sizeof(struct crng_state),
>   GFP_KERNEL | __GFP_NOFAIL, i);
>   spin_lock_init(>lock);
>   crng_initialize(crng);
>   pool[i] = crng;
> -
>   }
>   mb();
>   crng_node_pool = pool;



Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Theodore Ts'o
On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote:
> 
> Oh, I just realized that Linus pulled your changes. Actually I was hoping
> we could get this fixed before the broken code would be merged.
> Could you please make sure the bug fix gets included as soon as possible?

Yes, I'll send the pull request to ASAP.

- Ted


Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Theodore Ts'o
On Thu, Jul 28, 2016 at 09:24:08AM +0200, Heiko Carstens wrote:
> 
> Oh, I just realized that Linus pulled your changes. Actually I was hoping
> we could get this fixed before the broken code would be merged.
> Could you please make sure the bug fix gets included as soon as possible?

Yes, I'll send the pull request to ASAP.

- Ted


Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Heiko Carstens
On Thu, Jul 28, 2016 at 07:55:48AM +0200, Heiko Carstens wrote:
> On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote:
> > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> > > it looks like your patch "random: make /dev/urandom scalable for silly
> > > userspace programs" within linux-next seems to be a bit broken:
> > > 
> > > It causes this allocation failure and subsequent crash on s390 with fake
> > > NUMA enabled
> > 
> > Thanks for reporting this.  This patch fixes things for you, yes?
> > 
> > - Ted
> 
> Yes, it does. It's actually the same what I did to fix this ;)

Oh, I just realized that Linus pulled your changes. Actually I was hoping
we could get this fixed before the broken code would be merged.
Could you please make sure the bug fix gets included as soon as possible?

Right now booting a kernel with any defconfig on s390 will crash because of
this.

I will also change the fake NUMA code on s390, since it doesn't make sense
to have possible but not online nodes (in this case).

> > commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613
> > Author: Theodore Ts'o <ty...@mit.edu>
> > Date:   Wed Jul 27 23:30:25 2016 -0400
> > 
> > random: use for_each_online_node() to iterate over NUMA nodes
> > 
> > This fixes a crash on s390 with fake NUMA enabled.
> > 
> > Reported-by: Heiko Carstens <heiko.carst...@de.ibm.com>
> > Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly 
> > userspace programs")
> > Signed-off-by: Theodore Ts'o <ty...@mit.edu>
> > 
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index 8d0af74..7f06224 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1668,13 +1668,12 @@ static int rand_initialize(void)
> >  #ifdef CONFIG_NUMA
> > pool = kmalloc(num_nodes * sizeof(void *),
> >GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
> > -   for (i=0; i < num_nodes; i++) {
> > +   for_each_online_node(i) {
> > crng = kmalloc_node(sizeof(struct crng_state),
> > GFP_KERNEL | __GFP_NOFAIL, i);
> > spin_lock_init(>lock);
> > crng_initialize(crng);
> > pool[i] = crng;
> > -
> > }
> > mb();
> > crng_node_pool = pool;
> > 
> 



Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-28 Thread Heiko Carstens
On Thu, Jul 28, 2016 at 07:55:48AM +0200, Heiko Carstens wrote:
> On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote:
> > On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> > > it looks like your patch "random: make /dev/urandom scalable for silly
> > > userspace programs" within linux-next seems to be a bit broken:
> > > 
> > > It causes this allocation failure and subsequent crash on s390 with fake
> > > NUMA enabled
> > 
> > Thanks for reporting this.  This patch fixes things for you, yes?
> > 
> > - Ted
> 
> Yes, it does. It's actually the same what I did to fix this ;)

Oh, I just realized that Linus pulled your changes. Actually I was hoping
we could get this fixed before the broken code would be merged.
Could you please make sure the bug fix gets included as soon as possible?

Right now booting a kernel with any defconfig on s390 will crash because of
this.

I will also change the fake NUMA code on s390, since it doesn't make sense
to have possible but not online nodes (in this case).

> > commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613
> > Author: Theodore Ts'o 
> > Date:   Wed Jul 27 23:30:25 2016 -0400
> > 
> > random: use for_each_online_node() to iterate over NUMA nodes
> > 
> > This fixes a crash on s390 with fake NUMA enabled.
> > 
> > Reported-by: Heiko Carstens 
> > Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly 
> > userspace programs")
> > Signed-off-by: Theodore Ts'o 
> > 
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index 8d0af74..7f06224 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1668,13 +1668,12 @@ static int rand_initialize(void)
> >  #ifdef CONFIG_NUMA
> > pool = kmalloc(num_nodes * sizeof(void *),
> >GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
> > -   for (i=0; i < num_nodes; i++) {
> > +   for_each_online_node(i) {
> > crng = kmalloc_node(sizeof(struct crng_state),
> > GFP_KERNEL | __GFP_NOFAIL, i);
> > spin_lock_init(>lock);
> > crng_initialize(crng);
> > pool[i] = crng;
> > -
> > }
> > mb();
> > crng_node_pool = pool;
> > 
> 



Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-27 Thread Heiko Carstens
On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote:
> On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> > it looks like your patch "random: make /dev/urandom scalable for silly
> > userspace programs" within linux-next seems to be a bit broken:
> > 
> > It causes this allocation failure and subsequent crash on s390 with fake
> > NUMA enabled
> 
> Thanks for reporting this.  This patch fixes things for you, yes?
> 
>   - Ted

Yes, it does. It's actually the same what I did to fix this ;)

> commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613
> Author: Theodore Ts'o <ty...@mit.edu>
> Date:   Wed Jul 27 23:30:25 2016 -0400
> 
> random: use for_each_online_node() to iterate over NUMA nodes
> 
> This fixes a crash on s390 with fake NUMA enabled.
> 
> Reported-by: Heiko Carstens <heiko.carst...@de.ibm.com>
> Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly 
> userspace programs")
> Signed-off-by: Theodore Ts'o <ty...@mit.edu>
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 8d0af74..7f06224 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1668,13 +1668,12 @@ static int rand_initialize(void)
>  #ifdef CONFIG_NUMA
>   pool = kmalloc(num_nodes * sizeof(void *),
>  GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
> - for (i=0; i < num_nodes; i++) {
> + for_each_online_node(i) {
>   crng = kmalloc_node(sizeof(struct crng_state),
>   GFP_KERNEL | __GFP_NOFAIL, i);
>   spin_lock_init(>lock);
>   crng_initialize(crng);
>   pool[i] = crng;
> -
>   }
>   mb();
>   crng_node_pool = pool;
> 



Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-27 Thread Heiko Carstens
On Wed, Jul 27, 2016 at 11:46:01PM -0400, Theodore Ts'o wrote:
> On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> > it looks like your patch "random: make /dev/urandom scalable for silly
> > userspace programs" within linux-next seems to be a bit broken:
> > 
> > It causes this allocation failure and subsequent crash on s390 with fake
> > NUMA enabled
> 
> Thanks for reporting this.  This patch fixes things for you, yes?
> 
>   - Ted

Yes, it does. It's actually the same what I did to fix this ;)

> commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613
> Author: Theodore Ts'o 
> Date:   Wed Jul 27 23:30:25 2016 -0400
> 
> random: use for_each_online_node() to iterate over NUMA nodes
> 
> This fixes a crash on s390 with fake NUMA enabled.
> 
> Reported-by: Heiko Carstens 
> Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly 
> userspace programs")
> Signed-off-by: Theodore Ts'o 
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 8d0af74..7f06224 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1668,13 +1668,12 @@ static int rand_initialize(void)
>  #ifdef CONFIG_NUMA
>   pool = kmalloc(num_nodes * sizeof(void *),
>  GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
> - for (i=0; i < num_nodes; i++) {
> + for_each_online_node(i) {
>   crng = kmalloc_node(sizeof(struct crng_state),
>   GFP_KERNEL | __GFP_NOFAIL, i);
>   spin_lock_init(>lock);
>   crng_initialize(crng);
>   pool[i] = crng;
> -
>   }
>   mb();
>   crng_node_pool = pool;
> 



Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-27 Thread Theodore Ts'o
On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> it looks like your patch "random: make /dev/urandom scalable for silly
> userspace programs" within linux-next seems to be a bit broken:
> 
> It causes this allocation failure and subsequent crash on s390 with fake
> NUMA enabled

Thanks for reporting this.  This patch fixes things for you, yes?

- Ted

commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613
Author: Theodore Ts'o <ty...@mit.edu>
Date:   Wed Jul 27 23:30:25 2016 -0400

random: use for_each_online_node() to iterate over NUMA nodes

This fixes a crash on s390 with fake NUMA enabled.

Reported-by: Heiko Carstens <heiko.carst...@de.ibm.com>
Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly 
userspace programs")
Signed-off-by: Theodore Ts'o <ty...@mit.edu>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8d0af74..7f06224 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1668,13 +1668,12 @@ static int rand_initialize(void)
 #ifdef CONFIG_NUMA
pool = kmalloc(num_nodes * sizeof(void *),
   GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
-   for (i=0; i < num_nodes; i++) {
+   for_each_online_node(i) {
crng = kmalloc_node(sizeof(struct crng_state),
GFP_KERNEL | __GFP_NOFAIL, i);
spin_lock_init(>lock);
crng_initialize(crng);
pool[i] = crng;
-
}
mb();
crng_node_pool = pool;


Re: [BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-27 Thread Theodore Ts'o
On Wed, Jul 27, 2016 at 09:14:00AM +0200, Heiko Carstens wrote:
> it looks like your patch "random: make /dev/urandom scalable for silly
> userspace programs" within linux-next seems to be a bit broken:
> 
> It causes this allocation failure and subsequent crash on s390 with fake
> NUMA enabled

Thanks for reporting this.  This patch fixes things for you, yes?

- Ted

commit 59b8d4f1f5d26e4ca92172ff6dcd1492cdb39613
Author: Theodore Ts'o 
Date:   Wed Jul 27 23:30:25 2016 -0400

random: use for_each_online_node() to iterate over NUMA nodes

This fixes a crash on s390 with fake NUMA enabled.

Reported-by: Heiko Carstens 
Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly 
userspace programs")
Signed-off-by: Theodore Ts'o 

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8d0af74..7f06224 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1668,13 +1668,12 @@ static int rand_initialize(void)
 #ifdef CONFIG_NUMA
pool = kmalloc(num_nodes * sizeof(void *),
   GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
-   for (i=0; i < num_nodes; i++) {
+   for_each_online_node(i) {
crng = kmalloc_node(sizeof(struct crng_state),
GFP_KERNEL | __GFP_NOFAIL, i);
spin_lock_init(>lock);
crng_initialize(crng);
pool[i] = crng;
-
}
mb();
crng_node_pool = pool;


[BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-27 Thread Heiko Carstens
Hi Ted,

it looks like your patch "random: make /dev/urandom scalable for silly
userspace programs" within linux-next seems to be a bit broken:

It causes this allocation failure and subsequent crash on s390 with fake
NUMA enabled:

[0.533195] SLUB: Unable to allocate memory on node 1, 
gfp=0x24008c0(GFP_KERNEL|__GFP_NOFAIL)
[0.533198]   cache: kmalloc-192, object size: 192, buffer size: 528, defaul 
order: 3, min order: 0
[0.533202]   node 0: slabs: 2, objs: 124, free: 17
[0.533208] Unable to handle kernel pointer dereference in virtual kernel 
address space
[0.533211] Failing address:  TEID: 0483
...
[0.533276] Krnl PSW : 0704e0018000 001a853e 
(lockdep_init_map+0x1e/0x220)
[0.533281]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
RI:0 EA:3
   Krnl GPRS: 00a23400 370c8008 0060 
00bedc90
[0.533285]02070800  0001 

[0.533287]3743d3f8 3743d408 02070800 
00bedc90
[0.533289]0048 009c2030 370cfd00 
370cfcc0
[0.533295] Krnl Code: 001a852e: a7840001brc 8,1a8530
   001a8532: e3f0ffc0ff71   lay %r15,-64(%r15)
  #001a8538: e3e0f0980024   stg %r14,152(%r15)
  >001a853e: e5482008   mvghi   8(%r2),0
   001a8544: e5482010   mvghi   16(%r2),0
   001a854a: 58100370   l   %r1,880
   001a854e: 50102020   st  %r1,32(%r2)
   001a8552: b90400c2   lgr %r12,%r2
[0.533313] Call Trace:
[0.533315] ([<0001>] 0x1)
[0.533318] ([<001b4220>] __raw_spin_lock_init+0x50/0x80)
[0.533320] ([<00759e7a>] rand_initialize+0xc2/0xf0)
[0.533322] ([<001002cc>] do_one_initcall+0xb4/0x140)
[0.533325] ([<00ef2cc0>] kernel_init_freeable+0x140/0x2d8)
[0.533328] ([<009b07ea>] kernel_init+0x2a/0x150)
[0.50] ([<009bd782>] kernel_thread_starter+0x6/0xc)
[0.52] ([<009bd77c>] kernel_thread_starter+0x0/0xc)

To me it looks rand_initialize is broken with CONFIG_NUMA:

static int rand_initialize(void)
{
#ifdef CONFIG_NUMA
int i;
int num_nodes = num_possible_nodes();
struct crng_state *crng;
struct crng_state **pool;
#endif

init_std_data(_pool);
init_std_data(_pool);
crng_initialize(_crng);

#ifdef CONFIG_NUMA
pool = kmalloc(num_nodes * sizeof(void *),
   GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
for (i=0; i < num_nodes; i++) {
crng = kmalloc_node(sizeof(struct crng_state),
GFP_KERNEL | __GFP_NOFAIL, i);
spin_lock_init(>lock);
crng_initialize(crng);
pool[i] = crng;

}
mb();
crng_node_pool = pool;
#endif
return 0;
}
early_initcall(rand_initialize);

First the for loop should use for_each_node() to skip not possible nodes,
no?

However that wouldn't be enough, since in this case it crashed because node
1 is in the possible map, but it isn't online and doesn't have any memory,
which explains why the allocation fails and the subsequent crash when
calling spin_lock_init().

I think the proper fix would be to simply use for_each_online_node(); at
least that fixes the crash on s390.



[BUG -next] "random: make /dev/urandom scalable for silly userspace programs" causes crash

2016-07-27 Thread Heiko Carstens
Hi Ted,

it looks like your patch "random: make /dev/urandom scalable for silly
userspace programs" within linux-next seems to be a bit broken:

It causes this allocation failure and subsequent crash on s390 with fake
NUMA enabled:

[0.533195] SLUB: Unable to allocate memory on node 1, 
gfp=0x24008c0(GFP_KERNEL|__GFP_NOFAIL)
[0.533198]   cache: kmalloc-192, object size: 192, buffer size: 528, defaul 
order: 3, min order: 0
[0.533202]   node 0: slabs: 2, objs: 124, free: 17
[0.533208] Unable to handle kernel pointer dereference in virtual kernel 
address space
[0.533211] Failing address:  TEID: 0483
...
[0.533276] Krnl PSW : 0704e0018000 001a853e 
(lockdep_init_map+0x1e/0x220)
[0.533281]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
RI:0 EA:3
   Krnl GPRS: 00a23400 370c8008 0060 
00bedc90
[0.533285]02070800  0001 

[0.533287]3743d3f8 3743d408 02070800 
00bedc90
[0.533289]0048 009c2030 370cfd00 
370cfcc0
[0.533295] Krnl Code: 001a852e: a7840001brc 8,1a8530
   001a8532: e3f0ffc0ff71   lay %r15,-64(%r15)
  #001a8538: e3e0f0980024   stg %r14,152(%r15)
  >001a853e: e5482008   mvghi   8(%r2),0
   001a8544: e5482010   mvghi   16(%r2),0
   001a854a: 58100370   l   %r1,880
   001a854e: 50102020   st  %r1,32(%r2)
   001a8552: b90400c2   lgr %r12,%r2
[0.533313] Call Trace:
[0.533315] ([<0001>] 0x1)
[0.533318] ([<001b4220>] __raw_spin_lock_init+0x50/0x80)
[0.533320] ([<00759e7a>] rand_initialize+0xc2/0xf0)
[0.533322] ([<001002cc>] do_one_initcall+0xb4/0x140)
[0.533325] ([<00ef2cc0>] kernel_init_freeable+0x140/0x2d8)
[0.533328] ([<009b07ea>] kernel_init+0x2a/0x150)
[0.50] ([<009bd782>] kernel_thread_starter+0x6/0xc)
[0.52] ([<009bd77c>] kernel_thread_starter+0x0/0xc)

To me it looks rand_initialize is broken with CONFIG_NUMA:

static int rand_initialize(void)
{
#ifdef CONFIG_NUMA
int i;
int num_nodes = num_possible_nodes();
struct crng_state *crng;
struct crng_state **pool;
#endif

init_std_data(_pool);
init_std_data(_pool);
crng_initialize(_crng);

#ifdef CONFIG_NUMA
pool = kmalloc(num_nodes * sizeof(void *),
   GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
for (i=0; i < num_nodes; i++) {
crng = kmalloc_node(sizeof(struct crng_state),
GFP_KERNEL | __GFP_NOFAIL, i);
spin_lock_init(>lock);
crng_initialize(crng);
pool[i] = crng;

}
mb();
crng_node_pool = pool;
#endif
return 0;
}
early_initcall(rand_initialize);

First the for loop should use for_each_node() to skip not possible nodes,
no?

However that wouldn't be enough, since in this case it crashed because node
1 is in the possible map, but it isn't online and doesn't have any memory,
which explains why the allocation fails and the subsequent crash when
calling spin_lock_init().

I think the proper fix would be to simply use for_each_online_node(); at
least that fixes the crash on s390.



[PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

2016-06-13 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 62 +++
 1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 841f9a8..d640865 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -434,6 +434,8 @@ struct crng_state primary_crng = {
  */
 static int crng_init = 0;
 #define crng_ready() (likely(crng_init > 0))
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE]);
 static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
 static void process_random_ready_list(void);
 
@@ -754,6 +756,16 @@ static void credit_entropy_bits_safe(struct entropy_store 
*r, int nbits)
 
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
 static void crng_initialize(struct crng_state *crng)
 {
int i;
@@ -815,7 +827,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
if (num == 0)
return;
} else
-   extract_crng(buf.block);
+   _extract_crng(_crng, buf.block);
spin_lock_irqsave(_crng.lock, flags);
for (i = 0; i < 8; i++) {
unsigned long   rv;
@@ -835,19 +847,26 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
spin_unlock_irqrestore(_crng.lock, flags);
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_crng, _pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
if (crng_init > 1 &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(crng, _pool);
+   crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -857,6 +876,19 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+   struct crng_state *crng = NULL;
+
+#ifdef CONFIG_NUMA
+   if (crng_node_pool)
+   crng = crng_node_pool[numa_node_id()];
+   if (crng == NULL)
+#endif
+   crng = _crng;
+   _extract_crng(crng, out);
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1575,9 +1607,31 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+   struct crng_state **pool;
+#endif
+
init_std_data(_pool);
init_std_data(_pool);
crng_initialize(_crng);
+
+#ifdef CONFIG_NUMA
+   pool = kmalloc(num_nodes * sizeof(void *),
+  GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc_node(sizeof(struct crng_state),
+   GFP_KERNEL | __GFP_NOFAIL, i);
+   spin_lock_init(>lock);
+   crng_initialize(crng);
+   pool[i] = crng;
+
+   }
+   mb();
+   crng_node_pool = pool;
+#endif
return 0;
 }
 early_initcall(rand_initialize);
-- 
2.5.0



[PATCH 6/7] random: make /dev/urandom scalable for silly userspace programs

2016-06-13 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 62 +++
 1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 841f9a8..d640865 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -434,6 +434,8 @@ struct crng_state primary_crng = {
  */
 static int crng_init = 0;
 #define crng_ready() (likely(crng_init > 0))
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE]);
 static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
 static void process_random_ready_list(void);
 
@@ -754,6 +756,16 @@ static void credit_entropy_bits_safe(struct entropy_store 
*r, int nbits)
 
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
 static void crng_initialize(struct crng_state *crng)
 {
int i;
@@ -815,7 +827,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
if (num == 0)
return;
} else
-   extract_crng(buf.block);
+   _extract_crng(_crng, buf.block);
spin_lock_irqsave(_crng.lock, flags);
for (i = 0; i < 8; i++) {
unsigned long   rv;
@@ -835,19 +847,26 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
spin_unlock_irqrestore(_crng.lock, flags);
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_crng, _pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
if (crng_init > 1 &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(crng, _pool);
+   crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -857,6 +876,19 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+   struct crng_state *crng = NULL;
+
+#ifdef CONFIG_NUMA
+   if (crng_node_pool)
+   crng = crng_node_pool[numa_node_id()];
+   if (crng == NULL)
+#endif
+   crng = _crng;
+   _extract_crng(crng, out);
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1575,9 +1607,31 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+   struct crng_state **pool;
+#endif
+
init_std_data(_pool);
init_std_data(_pool);
crng_initialize(_crng);
+
+#ifdef CONFIG_NUMA
+   pool = kmalloc(num_nodes * sizeof(void *),
+  GFP_KERNEL|__GFP_NOFAIL|__GFP_ZERO);
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc_node(sizeof(struct crng_state),
+   GFP_KERNEL | __GFP_NOFAIL, i);
+   spin_lock_init(>lock);
+   crng_initialize(crng);
+   pool[i] = crng;
+
+   }
+   mb();
+   crng_node_pool = pool;
+#endif
return 0;
 }
 early_initcall(rand_initialize);
-- 
2.5.0



Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs

2016-05-30 Thread Theodore Ts'o
On Mon, May 30, 2016 at 08:03:59AM +0200, Stephan Mueller wrote:
> >  static int rand_initialize(void)
> >  {
> > +#ifdef CONFIG_NUMA
> > +   int i;
> > +   int num_nodes = num_possible_nodes();
> > +   struct crng_state *crng;
> > +
> > +   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
> > +GFP_KERNEL|__GFP_NOFAIL);
> > +
> > +   for (i=0; i < num_nodes; i++) {
> > +   crng = kmalloc(sizeof(struct crng_state),
> > +  GFP_KERNEL | __GFP_NOFAIL);
> > +   initialize_crng(crng);
> 
> Could you please help me understand the logic flow here: The NUMA secondary 
> DRNGs are initialized before the input/blocking pools and the primary DRNG.
> 
> The initialization call uses get_random_bytes() for the secondary DRNGs. But 
> since the primary DRNG is not yet initialized, where does the 
> get_random_bytes 
> gets its randomness from?

Yeah, I screwed up.  The hunk of code staring with "crng_node_pool =
kmalloc(..." and the for loop afterwards should be moved to after
_initialize_crng().  Serves me right for not testing CONFIG_NUMA
before sending out the patches.

This is *not* about adding entropy; as you've noted, this is done very
early in boot up, before there has been any chance for any kind of
entropy to be collected in any of the input pools.  It's more of a
insurance policy just in case on some platform, if it turns out that
assuming a bit's worth of entropy per interrupt was hopelessly
over-optimistic, at least the starting point will be different across
different kernels (and maybe different boot times, but on the sorts of
platforms where I'm most concerned, there may not be a real-time clock
and almost certainly not a architectural hwrng in the CPU).

   - Ted


Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs

2016-05-30 Thread Theodore Ts'o
On Mon, May 30, 2016 at 08:03:59AM +0200, Stephan Mueller wrote:
> >  static int rand_initialize(void)
> >  {
> > +#ifdef CONFIG_NUMA
> > +   int i;
> > +   int num_nodes = num_possible_nodes();
> > +   struct crng_state *crng;
> > +
> > +   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
> > +GFP_KERNEL|__GFP_NOFAIL);
> > +
> > +   for (i=0; i < num_nodes; i++) {
> > +   crng = kmalloc(sizeof(struct crng_state),
> > +  GFP_KERNEL | __GFP_NOFAIL);
> > +   initialize_crng(crng);
> 
> Could you please help me understand the logic flow here: The NUMA secondary 
> DRNGs are initialized before the input/blocking pools and the primary DRNG.
> 
> The initialization call uses get_random_bytes() for the secondary DRNGs. But 
> since the primary DRNG is not yet initialized, where does the 
> get_random_bytes 
> gets its randomness from?

Yeah, I screwed up.  The hunk of code staring with "crng_node_pool =
kmalloc(..." and the for loop afterwards should be moved to after
_initialize_crng().  Serves me right for not testing CONFIG_NUMA
before sending out the patches.

This is *not* about adding entropy; as you've noted, this is done very
early in boot up, before there has been any chance for any kind of
entropy to be collected in any of the input pools.  It's more of a
insurance policy just in case on some platform, if it turns out that
assuming a bit's worth of entropy per interrupt was hopelessly
over-optimistic, at least the starting point will be different across
different kernels (and maybe different boot times, but on the sorts of
platforms where I'm most concerned, there may not be a real-time clock
and almost certainly not a architectural hwrng in the CPU).

   - Ted


Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs

2016-05-30 Thread Stephan Mueller
Am Montag, 30. Mai 2016, 01:39:22 schrieb Theodore Ts'o:

Hi Theodore,

> On a system with a 4 socket (NUMA) system where a large number of
> application threads were all trying to read from /dev/urandom, this
> can result in the system spending 80% of its time contending on the
> global urandom spinlock.  The application should have used its own
> PRNG, but let's try to help it from running, lemming-like, straight
> over the locking cliff.
> 
> Reported-by: Andi Kleen 
> Signed-off-by: Theodore Ts'o 
> ---
>  drivers/char/random.c | 57
> +++ 1 file changed, 53
> insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 3a4d865..1778c84 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -434,6 +434,8 @@ struct crng_state primary_crng = {
>   */
>  static int crng_init = 0;
>  #define crng_ready() (likely(crng_init >= 2))
> +static void _extract_crng(struct crng_state *crng,
> +   __u8 out[CHACHA20_BLOCK_SIZE]);
>  static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
>  static void process_random_ready_list(void);
> 
> @@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct
> entropy_store *r, int nbits)
> 
>  static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
> 
> +#ifdef CONFIG_NUMA
> +/*
> + * Hack to deal with crazy userspace progams when they are all trying
> + * to access /dev/urandom in parallel.  The programs are almost
> + * certainly doing something terribly wrong, but we'll work around
> + * their brain damage.
> + */
> +static struct crng_state **crng_node_pool __read_mostly;
> +#endif
> +
> +
>  static void _initialize_crng(struct crng_state *crng)
>  {
>   int i;
> @@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng)
>   crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
>  }
> 
> +#ifdef CONFIG_NUMA
>  static void initialize_crng(struct crng_state *crng)
>  {
>   _initialize_crng(crng);
>   spin_lock_init(>lock);
>  }
> +#endif
> 
>  static int crng_fast_load(__u32 pool[4])
>  {
> @@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct
> entropy_store *r) if (num == 0)
>   return;
>   } else
> - extract_crng(buf.block);
> + _extract_crng(_crng, buf.block);
>   spin_lock_irqsave(_crng.lock, flags);
>   for (i = 0; i < 8; i++) {
>   unsigned long   rv;
> @@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng,
> struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags);
>  }
> 
> +static inline void maybe_reseed_primary_crng(void)
> +{
> + if (crng_init > 2 &&
> + time_after(jiffies, primary_crng.init_time + 
CRNG_RESEED_INTERVAL))
> + crng_reseed(_crng, _pool);
> +}
> +
>  static inline void crng_wait_ready(void)
>  {
>   wait_event_interruptible(crng_init_wait, crng_ready());
>  }
> 
> -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> +static void _extract_crng(struct crng_state *crng,
> +   __u8 out[CHACHA20_BLOCK_SIZE])
>  {
>   unsigned long v, flags;
> - struct crng_state *crng = _crng;
> 
>   if (crng_init > 2 &&
>   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
> - crng_reseed(crng, _pool);
> + crng_reseed(crng, crng == _crng ? _pool : NULL);
>   spin_lock_irqsave(>lock, flags);
>   if (arch_get_random_long())
>   crng->state[14] ^= v;
> @@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> spin_unlock_irqrestore(>lock, flags);
>  }
> 
> +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> +{
> +#ifndef CONFIG_NUMA
> + struct crng_state *crng = _crng;
> +#else
> + struct crng_state *crng = crng_node_pool[numa_node_id()];
> +#endif
> +
> + _extract_crng(crng, out);
> +}
> +
>  static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
>  {
>   ssize_t ret = 0, i;
> @@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r)
>   */
>  static int rand_initialize(void)
>  {
> +#ifdef CONFIG_NUMA
> + int i;
> + int num_nodes = num_possible_nodes();
> + struct crng_state *crng;
> +
> + crng_node_pool = kmalloc(num_nodes * sizeof(void *),
> +  GFP_KERNEL|__GFP_NOFAIL);
> +
> + for (i=0; i < num_nodes; i++) {
> + crng = kmalloc(sizeof(struct crng_state),
> +GFP_KERNEL | __GFP_NOFAIL);
> + initialize_crng(crng);

Could you please help me understand the logic flow here: The NUMA secondary 
DRNGs are initialized before the input/blocking pools and the primary DRNG.

The initialization call uses get_random_bytes() for the secondary DRNGs. But 
since the primary DRNG is not yet initialized, where does the get_random_bytes 
gets its randomness from?

> +  

Re: [PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs

2016-05-30 Thread Stephan Mueller
Am Montag, 30. Mai 2016, 01:39:22 schrieb Theodore Ts'o:

Hi Theodore,

> On a system with a 4 socket (NUMA) system where a large number of
> application threads were all trying to read from /dev/urandom, this
> can result in the system spending 80% of its time contending on the
> global urandom spinlock.  The application should have used its own
> PRNG, but let's try to help it from running, lemming-like, straight
> over the locking cliff.
> 
> Reported-by: Andi Kleen 
> Signed-off-by: Theodore Ts'o 
> ---
>  drivers/char/random.c | 57
> +++ 1 file changed, 53
> insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 3a4d865..1778c84 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -434,6 +434,8 @@ struct crng_state primary_crng = {
>   */
>  static int crng_init = 0;
>  #define crng_ready() (likely(crng_init >= 2))
> +static void _extract_crng(struct crng_state *crng,
> +   __u8 out[CHACHA20_BLOCK_SIZE]);
>  static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
>  static void process_random_ready_list(void);
> 
> @@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct
> entropy_store *r, int nbits)
> 
>  static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
> 
> +#ifdef CONFIG_NUMA
> +/*
> + * Hack to deal with crazy userspace progams when they are all trying
> + * to access /dev/urandom in parallel.  The programs are almost
> + * certainly doing something terribly wrong, but we'll work around
> + * their brain damage.
> + */
> +static struct crng_state **crng_node_pool __read_mostly;
> +#endif
> +
> +
>  static void _initialize_crng(struct crng_state *crng)
>  {
>   int i;
> @@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng)
>   crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
>  }
> 
> +#ifdef CONFIG_NUMA
>  static void initialize_crng(struct crng_state *crng)
>  {
>   _initialize_crng(crng);
>   spin_lock_init(>lock);
>  }
> +#endif
> 
>  static int crng_fast_load(__u32 pool[4])
>  {
> @@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct
> entropy_store *r) if (num == 0)
>   return;
>   } else
> - extract_crng(buf.block);
> + _extract_crng(_crng, buf.block);
>   spin_lock_irqsave(_crng.lock, flags);
>   for (i = 0; i < 8; i++) {
>   unsigned long   rv;
> @@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng,
> struct entropy_store *r) spin_unlock_irqrestore(_crng.lock, flags);
>  }
> 
> +static inline void maybe_reseed_primary_crng(void)
> +{
> + if (crng_init > 2 &&
> + time_after(jiffies, primary_crng.init_time + 
CRNG_RESEED_INTERVAL))
> + crng_reseed(_crng, _pool);
> +}
> +
>  static inline void crng_wait_ready(void)
>  {
>   wait_event_interruptible(crng_init_wait, crng_ready());
>  }
> 
> -static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> +static void _extract_crng(struct crng_state *crng,
> +   __u8 out[CHACHA20_BLOCK_SIZE])
>  {
>   unsigned long v, flags;
> - struct crng_state *crng = _crng;
> 
>   if (crng_init > 2 &&
>   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
> - crng_reseed(crng, _pool);
> + crng_reseed(crng, crng == _crng ? _pool : NULL);
>   spin_lock_irqsave(>lock, flags);
>   if (arch_get_random_long())
>   crng->state[14] ^= v;
> @@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> spin_unlock_irqrestore(>lock, flags);
>  }
> 
> +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> +{
> +#ifndef CONFIG_NUMA
> + struct crng_state *crng = _crng;
> +#else
> + struct crng_state *crng = crng_node_pool[numa_node_id()];
> +#endif
> +
> + _extract_crng(crng, out);
> +}
> +
>  static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
>  {
>   ssize_t ret = 0, i;
> @@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r)
>   */
>  static int rand_initialize(void)
>  {
> +#ifdef CONFIG_NUMA
> + int i;
> + int num_nodes = num_possible_nodes();
> + struct crng_state *crng;
> +
> + crng_node_pool = kmalloc(num_nodes * sizeof(void *),
> +  GFP_KERNEL|__GFP_NOFAIL);
> +
> + for (i=0; i < num_nodes; i++) {
> + crng = kmalloc(sizeof(struct crng_state),
> +GFP_KERNEL | __GFP_NOFAIL);
> + initialize_crng(crng);

Could you please help me understand the logic flow here: The NUMA secondary 
DRNGs are initialized before the input/blocking pools and the primary DRNG.

The initialization call uses get_random_bytes() for the secondary DRNGs. But 
since the primary DRNG is not yet initialized, where does the get_random_bytes 
gets its randomness from?

> + crng_node_pool[i] = crng;
> +
> + 

[PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs

2016-05-29 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 57 +++
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3a4d865..1778c84 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -434,6 +434,8 @@ struct crng_state primary_crng = {
  */
 static int crng_init = 0;
 #define crng_ready() (likely(crng_init >= 2))
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE]);
 static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
 static void process_random_ready_list(void);
 
@@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct entropy_store 
*r, int nbits)
 
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
int i;
@@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
_initialize_crng(crng);
spin_lock_init(>lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
if (num == 0)
return;
} else
-   extract_crng(buf.block);
+   _extract_crng(_crng, buf.block);
spin_lock_irqsave(_crng.lock, flags);
for (i = 0; i < 8; i++) {
unsigned long   rv;
@@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
spin_unlock_irqrestore(_crng.lock, flags);
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_crng, _pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
if (crng_init > 2 &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(crng, _pool);
+   crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+   struct crng_state *crng = _crng;
+#else
+   struct crng_state *crng = crng_node_pool[numa_node_id()];
+#endif
+
+   _extract_crng(crng, out);
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+
+   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+GFP_KERNEL|__GFP_NOFAIL);
+
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc(sizeof(struct crng_state),
+  GFP_KERNEL | __GFP_NOFAIL);
+   initialize_crng(crng);
+   crng_node_pool[i] = crng;
+
+   }
+#endif
init_std_data(_pool);
init_std_data(_pool);
_initialize_crng(_crng);
-- 
2.5.0



[PATCH 2/5] random: make /dev/urandom scalable for silly userspace programs

2016-05-29 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 57 +++
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3a4d865..1778c84 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -434,6 +434,8 @@ struct crng_state primary_crng = {
  */
 static int crng_init = 0;
 #define crng_ready() (likely(crng_init >= 2))
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE]);
 static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]);
 static void process_random_ready_list(void);
 
@@ -754,6 +756,17 @@ static void credit_entropy_bits_safe(struct entropy_store 
*r, int nbits)
 
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
int i;
@@ -774,11 +787,13 @@ static void _initialize_crng(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
_initialize_crng(crng);
spin_lock_init(>lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -818,7 +833,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
if (num == 0)
return;
} else
-   extract_crng(buf.block);
+   _extract_crng(_crng, buf.block);
spin_lock_irqsave(_crng.lock, flags);
for (i = 0; i < 8; i++) {
unsigned long   rv;
@@ -838,19 +853,26 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
spin_unlock_irqrestore(_crng.lock, flags);
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_crng, _pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
if (crng_init > 2 &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(crng, _pool);
+   crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -860,6 +882,17 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+   struct crng_state *crng = _crng;
+#else
+   struct crng_state *crng = crng_node_pool[numa_node_id()];
+#endif
+
+   _extract_crng(crng, out);
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1573,6 +1606,22 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+
+   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+GFP_KERNEL|__GFP_NOFAIL);
+
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc(sizeof(struct crng_state),
+  GFP_KERNEL | __GFP_NOFAIL);
+   initialize_crng(crng);
+   crng_node_pool[i] = crng;
+
+   }
+#endif
init_std_data(_pool);
init_std_data(_pool);
_initialize_crng(_crng);
-- 
2.5.0



[PATCH 2/4] random: make /dev/urandom scalable for silly userspace programs

2016-05-04 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 67 +++
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 91d5c2a..b970db6 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -749,6 +749,17 @@ struct crng_state primary_crng = {
 };
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
int i;
@@ -764,11 +775,13 @@ static void _initialize_crng(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
_initialize_crng(crng);
spin_lock_init(>lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -823,19 +836,23 @@ out:
return ret;
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
-   if (crng_init > 2 &&
-   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(_pool);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -845,6 +862,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+#else
+   int node_id = numa_node_id();
+   struct crng_state *crng = crng_node_pool[node_id];
+
+   if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) {
+   unsigned long flags;
+
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+   spin_lock_irqsave(>lock, flags);
+   memcpy(>state[4], out, CHACHA20_KEY_SIZE);
+   crng->state[15] = node_id;
+   crng->init_time = jiffies;
+   spin_unlock_irqrestore(>lock, flags);
+   }
+   _extract_crng(crng, out);
+#endif
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1549,6 +1590,22 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+
+   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+GFP_KERNEL|__GFP_NOFAIL);
+
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc(sizeof(struct crng_state),
+  GFP_KERNEL | __GFP_NOFAIL);
+   initialize_crng(crng);
+   crng_node_pool[i] = crng;
+
+   }
+#endif
init_std_data(_pool);
init_std_data(_pool);
_initialize_crng(_crng);
-- 
2.5.0



[PATCH 2/4] random: make /dev/urandom scalable for silly userspace programs

2016-05-04 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application threads were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application should have used its own
PRNG, but let's try to help it from running, lemming-like, straight
over the locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 67 +++
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 91d5c2a..b970db6 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -749,6 +749,17 @@ struct crng_state primary_crng = {
 };
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
int i;
@@ -764,11 +775,13 @@ static void _initialize_crng(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
_initialize_crng(crng);
spin_lock_init(>lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -823,19 +836,23 @@ out:
return ret;
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
-   if (crng_init > 2 &&
-   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(_pool);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -845,6 +862,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+#else
+   int node_id = numa_node_id();
+   struct crng_state *crng = crng_node_pool[node_id];
+
+   if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) {
+   unsigned long flags;
+
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+   spin_lock_irqsave(>lock, flags);
+   memcpy(>state[4], out, CHACHA20_KEY_SIZE);
+   crng->state[15] = node_id;
+   crng->init_time = jiffies;
+   spin_unlock_irqrestore(>lock, flags);
+   }
+   _extract_crng(crng, out);
+#endif
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1549,6 +1590,22 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+
+   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+GFP_KERNEL|__GFP_NOFAIL);
+
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc(sizeof(struct crng_state),
+  GFP_KERNEL | __GFP_NOFAIL);
+   initialize_crng(crng);
+   crng_node_pool[i] = crng;
+
+   }
+#endif
init_std_data(_pool);
init_std_data(_pool);
_initialize_crng(_crng);
-- 
2.5.0



Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Stephan Mueller
Am Montag, 2. Mai 2016, 09:48:57 schrieb Theodore Ts'o:

Hi Theodore,

> On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote:
> > > - entropy pool draining: when having a timer-based reseeding on a quiet
> > > system, the entropy pool can be drained during the expiry of the timer.
> > > So, I tried to handle that by increasing the timer by, say, 100 seconds
> > > for each new NUMA node. Note, even the baseline of 300 seconds with
> > > CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM
> > > test system and left it quiet, entropy pool draining was prevented at
> > > around 500 seconds.
> 
> One other thought.  If your KVM test system was completely quiet, then
> all of the entropy was coming from timer interrupts.  It is an open
> question whether an adversary could predict the bit of "entropy" you
> are generating with better than 50% probability if both the host and
> the guest system are quiescent.  And if they can, then maybe assuming
> one bit of entropy per interrupt might be a too optimistic.

It is not entirely quiet -- systemd likes to dump data on the disk once in a 
while. So, it is no timer interrupt that I see.

Note, my test system runs as tickless kernel.
> 
> This is especially true on bare metal where very often, especially on
> smaller machines, where there is a single oscillator from which all of
> the clocks on the SOC or motherboard are derived.  There is a reason
> why I was being ultra conservative in sampling 64 interrupts into a
> 32-bit fast-mix pool before mixing it into the input pool, and only
> crediting the pool with a single bit of entropy each time I did this.

As I do not think that we see any timer interrupts, I think this argument may 
not count.

Besides, I have not seen any timer interrupts lately (with or without a 
tickless kernel). Thus, which interrupt do you think is a timer interrupt?


Ciao
Stephan


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Stephan Mueller
Am Montag, 2. Mai 2016, 09:48:57 schrieb Theodore Ts'o:

Hi Theodore,

> On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote:
> > > - entropy pool draining: when having a timer-based reseeding on a quiet
> > > system, the entropy pool can be drained during the expiry of the timer.
> > > So, I tried to handle that by increasing the timer by, say, 100 seconds
> > > for each new NUMA node. Note, even the baseline of 300 seconds with
> > > CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM
> > > test system and left it quiet, entropy pool draining was prevented at
> > > around 500 seconds.
> 
> One other thought.  If your KVM test system was completely quiet, then
> all of the entropy was coming from timer interrupts.  It is an open
> question whether an adversary could predict the bit of "entropy" you
> are generating with better than 50% probability if both the host and
> the guest system are quiescent.  And if they can, then maybe assuming
> one bit of entropy per interrupt might be a too optimistic.

It is not entirely quiet -- systemd likes to dump data on the disk once in a 
while. So, it is no timer interrupt that I see.

Note, my test system runs as tickless kernel.
> 
> This is especially true on bare metal where very often, especially on
> smaller machines, where there is a single oscillator from which all of
> the clocks on the SOC or motherboard are derived.  There is a reason
> why I was being ultra conservative in sampling 64 interrupts into a
> 32-bit fast-mix pool before mixing it into the input pool, and only
> crediting the pool with a single bit of entropy each time I did this.

As I do not think that we see any timer interrupts, I think this argument may 
not count.

Besides, I have not seen any timer interrupts lately (with or without a 
tickless kernel). Thus, which interrupt do you think is a timer interrupt?


Ciao
Stephan


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Theodore Ts'o
On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote:
> > - entropy pool draining: when having a timer-based reseeding on a quiet 
> > system, the entropy pool can be drained during the expiry of the timer. So, 
> > I 
> > tried to handle that by increasing the timer by, say, 100 seconds for each 
> > new 
> > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL 
> > is 
> > low. When I experimented with that on a KVM test system and left it quiet, 
> > entropy pool draining was prevented at around 500 seconds.

One other thought.  If your KVM test system was completely quiet, then
all of the entropy was coming from timer interrupts.  It is an open
question whether an adversary could predict the bit of "entropy" you
are generating with better than 50% probability if both the host and
the guest system are quiescent.  And if they can, then maybe assuming
one bit of entropy per interrupt might be a too optimistic.

This is especially true on bare metal where very often, especially on
smaller machines, where there is a single oscillator from which all of
the clocks on the SOC or motherboard are derived.  There is a reason
why I was being ultra conservative in sampling 64 interrupts into a
32-bit fast-mix pool before mixing it into the input pool, and only
crediting the pool with a single bit of entropy each time I did this.

(It's also because of this conservatism that I was comfortable with
having add_disk_randomness giving some extra credit for interrupts
that are probably more likely to be hard-to-predict by an adversary.
Especially if the interrupts are coming from a device with spinning
rust platters.)

- Ted


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Theodore Ts'o
On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote:
> > - entropy pool draining: when having a timer-based reseeding on a quiet 
> > system, the entropy pool can be drained during the expiry of the timer. So, 
> > I 
> > tried to handle that by increasing the timer by, say, 100 seconds for each 
> > new 
> > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL 
> > is 
> > low. When I experimented with that on a KVM test system and left it quiet, 
> > entropy pool draining was prevented at around 500 seconds.

One other thought.  If your KVM test system was completely quiet, then
all of the entropy was coming from timer interrupts.  It is an open
question whether an adversary could predict the bit of "entropy" you
are generating with better than 50% probability if both the host and
the guest system are quiescent.  And if they can, then maybe assuming
one bit of entropy per interrupt might be a too optimistic.

This is especially true on bare metal where very often, especially on
smaller machines, where there is a single oscillator from which all of
the clocks on the SOC or motherboard are derived.  There is a reason
why I was being ultra conservative in sampling 64 interrupts into a
32-bit fast-mix pool before mixing it into the input pool, and only
crediting the pool with a single bit of entropy each time I did this.

(It's also because of this conservatism that I was comfortable with
having add_disk_randomness giving some extra credit for interrupts
that are probably more likely to be hard-to-predict by an adversary.
Especially if the interrupts are coming from a device with spinning
rust platters.)

- Ted


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Theodore Ts'o
On Mon, May 02, 2016 at 09:00:22AM +0200, Stephan Mueller wrote:
> - reseed avalanche: I see that you added a time-based reseed code too (I am 
> glad about that one). What I fear is that there is a reseed avalanche when 
> the 
> various RNGs are seeded initially closely after each other (and thus the 
> reseed timer will expire at the same time). That would mean that they can be 
> reseeded all at the same time again when the timer based threshold expires 
> and 
> drain the input_pool such that if you have many nodes, the input pool will 
> not 
> have sufficient capacity (I am not speaking about entropy, but the potential 
> to store entropy) to satisfy all RNGs at the same time. Hence, we would then 
> have the potential to have entropy-starved RNGs.

The crng is a CRNG, not an entropy pool.  So we don't pretend to track
entropy on the CRNG's at all.  The current rule is that when you draw
from a crng, if it has been over 5 mintues, it will reseed from its
"parent" source.  In the case of the primary_crng will draw between
128 and 256 bits of entropy from the input pool.  In the per-NUMA node
case, they draw from the primary_crng.

So if there are many secondary (per-NUMA node) CRNG's that are seeded
within five minutes of each other, the input pool only gets drawn down
once to seed the primary_crng.  The per-NUMA node crng's feed from the
primary crng, and absent some catastrophic security breach where the
adversary can read kernel memory (at which point you're toast anyway)
the output of the primary_crng is never exposed directly outside of
the system.  So even if you have some crazy SGI system with 1024 NUMA
nodes, the primary_crng will only be generating at most 32k worth of
data to seed the secondary crng's before it gets reseed --- and the
input pool is only going to be debited at most 128-256 bits of entropy
each time.

I thought about using the primary_crng to serve double duty as the
CRNG for NUMA node 0, but I decided that on a NUMA system you have
TB's and TB's of memory, and so blowing another 80 bytes or so on a
separate primary_crng state makes the security analysis much simpler,
and the code much simpler.  I also thought about only dynamically
initializing a node_id's CRNG if a spin_trylock on node 0's CRNG
failed, but again, decided against it in the itnerests of keeping
things simple and that NUMA people can afford to be profligate with
memory --- and they're blowing way more than 80 bytes per NUMA node
anyway.  Besides, manufactuers of crazy-expensive NUMA systems have to
feed their children, too.  :-)

> - entropy pool draining: when having a timer-based reseeding on a quiet 
> system, the entropy pool can be drained during the expiry of the timer. So, I 
> tried to handle that by increasing the timer by, say, 100 seconds for each 
> new 
> NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL 
> is 
> low. When I experimented with that on a KVM test system and left it quiet, 
> entropy pool draining was prevented at around 500 seconds.

Sure, but if no one is actually *using* the system, who cares about
whether the input pool's entropy is getting drawn down?  The usual
reason why we might want to worry about reseeding frequently is if the
system is generating a huge amount of randomness for some reason.
This might be a good reason (you're running a IPSEC server and
generating lots of IKE session keys) or it might be for a really
stupid reason (dd if=/dev/urandom of=/dev/sdX bs=4k), but either way,
there will be lots of disk or networking interrupts to feed the input
pool.

I have thought about adding something a bit more sophisticated to
control the reseed logic (either tracking amount of data used, or
making the reseed interval adjustable, or dynamically adjustable), but
this was the simplest thing to do as a starting point.  Besides for
the people who believe that it's realistic to write academic papers
about recovering from catastrophic security exposures where the bad
guy can read arbitrary kernel memory, and somehow _not_ managed to
bootstrap that into a full privilege escalation attack and installed a
backdoor into your BIOS so that you are permanently pwned, they might
be happy that we will be trying to recover within 5 minutes.  :-)

- Ted


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Theodore Ts'o
On Mon, May 02, 2016 at 09:00:22AM +0200, Stephan Mueller wrote:
> - reseed avalanche: I see that you added a time-based reseed code too (I am 
> glad about that one). What I fear is that there is a reseed avalanche when 
> the 
> various RNGs are seeded initially closely after each other (and thus the 
> reseed timer will expire at the same time). That would mean that they can be 
> reseeded all at the same time again when the timer based threshold expires 
> and 
> drain the input_pool such that if you have many nodes, the input pool will 
> not 
> have sufficient capacity (I am not speaking about entropy, but the potential 
> to store entropy) to satisfy all RNGs at the same time. Hence, we would then 
> have the potential to have entropy-starved RNGs.

The crng is a CRNG, not an entropy pool.  So we don't pretend to track
entropy on the CRNG's at all.  The current rule is that when you draw
from a crng, if it has been over 5 mintues, it will reseed from its
"parent" source.  In the case of the primary_crng will draw between
128 and 256 bits of entropy from the input pool.  In the per-NUMA node
case, they draw from the primary_crng.

So if there are many secondary (per-NUMA node) CRNG's that are seeded
within five minutes of each other, the input pool only gets drawn down
once to seed the primary_crng.  The per-NUMA node crng's feed from the
primary crng, and absent some catastrophic security breach where the
adversary can read kernel memory (at which point you're toast anyway)
the output of the primary_crng is never exposed directly outside of
the system.  So even if you have some crazy SGI system with 1024 NUMA
nodes, the primary_crng will only be generating at most 32k worth of
data to seed the secondary crng's before it gets reseed --- and the
input pool is only going to be debited at most 128-256 bits of entropy
each time.

I thought about using the primary_crng to serve double duty as the
CRNG for NUMA node 0, but I decided that on a NUMA system you have
TB's and TB's of memory, and so blowing another 80 bytes or so on a
separate primary_crng state makes the security analysis much simpler,
and the code much simpler.  I also thought about only dynamically
initializing a node_id's CRNG if a spin_trylock on node 0's CRNG
failed, but again, decided against it in the itnerests of keeping
things simple and that NUMA people can afford to be profligate with
memory --- and they're blowing way more than 80 bytes per NUMA node
anyway.  Besides, manufactuers of crazy-expensive NUMA systems have to
feed their children, too.  :-)

> - entropy pool draining: when having a timer-based reseeding on a quiet 
> system, the entropy pool can be drained during the expiry of the timer. So, I 
> tried to handle that by increasing the timer by, say, 100 seconds for each 
> new 
> NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL 
> is 
> low. When I experimented with that on a KVM test system and left it quiet, 
> entropy pool draining was prevented at around 500 seconds.

Sure, but if no one is actually *using* the system, who cares about
whether the input pool's entropy is getting drawn down?  The usual
reason why we might want to worry about reseeding frequently is if the
system is generating a huge amount of randomness for some reason.
This might be a good reason (you're running a IPSEC server and
generating lots of IKE session keys) or it might be for a really
stupid reason (dd if=/dev/urandom of=/dev/sdX bs=4k), but either way,
there will be lots of disk or networking interrupts to feed the input
pool.

I have thought about adding something a bit more sophisticated to
control the reseed logic (either tracking amount of data used, or
making the reseed interval adjustable, or dynamically adjustable), but
this was the simplest thing to do as a starting point.  Besides for
the people who believe that it's realistic to write academic papers
about recovering from catastrophic security exposures where the bad
guy can read arbitrary kernel memory, and somehow _not_ managed to
bootstrap that into a full privilege escalation attack and installed a
backdoor into your BIOS so that you are permanently pwned, they might
be happy that we will be trying to recover within 5 minutes.  :-)

- Ted


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Stephan Mueller
Am Montag, 2. Mai 2016, 02:26:52 schrieb Theodore Ts'o:

Hi Theodore,

I have not digested the patch set yet, but I have the following questions to 
your patch set.

> On a system with a 4 socket (NUMA) system where a large number of
> application processes were all trying to read from /dev/urandom, this
> can result in the system spending 80% of its time contending on the
> global urandom spinlock.  The application have used its own PRNG, but
> let's try to help it from running, lemming-like, straight over the
> locking cliff.

- initialization: In my DRBG based patch-set I tried serialize the 
initialization of the per-NUMA node RNGs as follows: first the node 0 pool is 
seeded completely, followed by the other nodes in a completely serial fashion. 
If during that initialization time, say, node 3 wants some random number, but 
the RNG for node 3 is not yet fully seeded, it goes back to the "default" RNG 
of node 0. This way, it is ensured that we try to have properly seeded RNGs 
even during heavy load at boot time. Would that make sense here?

- reseed avalanche: I see that you added a time-based reseed code too (I am 
glad about that one). What I fear is that there is a reseed avalanche when the 
various RNGs are seeded initially closely after each other (and thus the 
reseed timer will expire at the same time). That would mean that they can be 
reseeded all at the same time again when the timer based threshold expires and 
drain the input_pool such that if you have many nodes, the input pool will not 
have sufficient capacity (I am not speaking about entropy, but the potential 
to store entropy) to satisfy all RNGs at the same time. Hence, we would then 
have the potential to have entropy-starved RNGs.

- entropy pool draining: when having a timer-based reseeding on a quiet 
system, the entropy pool can be drained during the expiry of the timer. So, I 
tried to handle that by increasing the timer by, say, 100 seconds for each new 
NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is 
low. When I experimented with that on a KVM test system and left it quiet, 
entropy pool draining was prevented at around 500 seconds.

Ciao
Stephan


Re: [PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Stephan Mueller
Am Montag, 2. Mai 2016, 02:26:52 schrieb Theodore Ts'o:

Hi Theodore,

I have not digested the patch set yet, but I have the following questions to 
your patch set.

> On a system with a 4 socket (NUMA) system where a large number of
> application processes were all trying to read from /dev/urandom, this
> can result in the system spending 80% of its time contending on the
> global urandom spinlock.  The application have used its own PRNG, but
> let's try to help it from running, lemming-like, straight over the
> locking cliff.

- initialization: In my DRBG based patch-set I tried serialize the 
initialization of the per-NUMA node RNGs as follows: first the node 0 pool is 
seeded completely, followed by the other nodes in a completely serial fashion. 
If during that initialization time, say, node 3 wants some random number, but 
the RNG for node 3 is not yet fully seeded, it goes back to the "default" RNG 
of node 0. This way, it is ensured that we try to have properly seeded RNGs 
even during heavy load at boot time. Would that make sense here?

- reseed avalanche: I see that you added a time-based reseed code too (I am 
glad about that one). What I fear is that there is a reseed avalanche when the 
various RNGs are seeded initially closely after each other (and thus the 
reseed timer will expire at the same time). That would mean that they can be 
reseeded all at the same time again when the timer based threshold expires and 
drain the input_pool such that if you have many nodes, the input pool will not 
have sufficient capacity (I am not speaking about entropy, but the potential 
to store entropy) to satisfy all RNGs at the same time. Hence, we would then 
have the potential to have entropy-starved RNGs.

- entropy pool draining: when having a timer-based reseeding on a quiet 
system, the entropy pool can be drained during the expiry of the timer. So, I 
tried to handle that by increasing the timer by, say, 100 seconds for each new 
NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is 
low. When I experimented with that on a KVM test system and left it quiet, 
entropy pool draining was prevented at around 500 seconds.

Ciao
Stephan


[PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application processes were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application have used its own PRNG, but
let's try to help it from running, lemming-like, straight over the
locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 67 +++
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 95f4451..d5bb3b3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -746,6 +746,17 @@ struct crng_state primary_crng = {
 };
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
int i;
@@ -761,11 +772,13 @@ static void _initialize_crng(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
_initialize_crng(crng);
spin_lock_init(>lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -822,19 +835,23 @@ out:
return ret;
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
-   if (crng_init > 2 &&
-   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(_pool);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -844,6 +861,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+#else
+   int node_id = numa_node_id();
+   struct crng_state *crng = crng_node_pool[node_id];
+
+   if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) {
+   unsigned long flags;
+
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+   spin_lock_irqsave(>lock, flags);
+   memcpy(>state[4], out, CHACHA20_KEY_SIZE);
+   crng->state[15] = numa_node_id();
+   crng->init_time = jiffies;
+   spin_unlock_irqrestore(>lock, flags);
+   }
+   _extract_crng(crng, out);
+#endif
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1548,6 +1589,22 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+
+   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+GFP_KERNEL|__GFP_NOFAIL);
+
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc(sizeof(struct crng_state),
+  GFP_KERNEL | __GFP_NOFAIL);
+   initialize_crng(crng);
+   crng_node_pool[i] = crng;
+
+   }
+#endif
init_std_data(_pool);
init_std_data(_pool);
_initialize_crng(_crng);
-- 
2.5.0



[PATCH 2/3] random: make /dev/urandom scalable for silly userspace programs

2016-05-02 Thread Theodore Ts'o
On a system with a 4 socket (NUMA) system where a large number of
application processes were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application have used its own PRNG, but
let's try to help it from running, lemming-like, straight over the
locking cliff.

Reported-by: Andi Kleen 
Signed-off-by: Theodore Ts'o 
---
 drivers/char/random.c | 67 +++
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 95f4451..d5bb3b3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -746,6 +746,17 @@ struct crng_state primary_crng = {
 };
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
int i;
@@ -761,11 +772,13 @@ static void _initialize_crng(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
_initialize_crng(crng);
spin_lock_init(>lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -822,19 +835,23 @@ out:
return ret;
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+   if (crng_init > 2 &&
+   time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+   crng_reseed(_pool);
+}
+
 static inline void crng_wait_ready(void)
 {
wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+ __u8 out[CHACHA20_BLOCK_SIZE])
 {
unsigned long v, flags;
-   struct crng_state *crng = _crng;
 
-   if (crng_init > 2 &&
-   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-   crng_reseed(_pool);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
crng->state[14] ^= v;
@@ -844,6 +861,30 @@ static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
spin_unlock_irqrestore(>lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+#else
+   int node_id = numa_node_id();
+   struct crng_state *crng = crng_node_pool[node_id];
+
+   if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) {
+   unsigned long flags;
+
+   maybe_reseed_primary_crng();
+   _extract_crng(_crng, out);
+   spin_lock_irqsave(>lock, flags);
+   memcpy(>state[4], out, CHACHA20_KEY_SIZE);
+   crng->state[15] = numa_node_id();
+   crng->init_time = jiffies;
+   spin_unlock_irqrestore(>lock, flags);
+   }
+   _extract_crng(crng, out);
+#endif
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
ssize_t ret = 0, i;
@@ -1548,6 +1589,22 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+   int i;
+   int num_nodes = num_possible_nodes();
+   struct crng_state *crng;
+
+   crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+GFP_KERNEL|__GFP_NOFAIL);
+
+   for (i=0; i < num_nodes; i++) {
+   crng = kmalloc(sizeof(struct crng_state),
+  GFP_KERNEL | __GFP_NOFAIL);
+   initialize_crng(crng);
+   crng_node_pool[i] = crng;
+
+   }
+#endif
init_std_data(_pool);
init_std_data(_pool);
_initialize_crng(_crng);
-- 
2.5.0



[tip:perf/core] perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups

2016-02-29 Thread tip-bot for Thomas Gleixner
Commit-ID:  827db839cd051cfde5a618fb4427e86dc91bc9aa
Gitweb: http://git.kernel.org/tip/827db839cd051cfde5a618fb4427e86dc91bc9aa
Author: Thomas Gleixner <t...@linutronix.de>
AuthorDate: Mon, 22 Feb 2016 22:19:20 +
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:35:21 +0100

perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups

CQM is a strict per package facility. Use the proper cpumasks to lookup the
readers.

Signed-off-by: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Andi Kleen <andi.kl...@intel.com>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Harish Chegondi <harish.chego...@intel.com>
Cc: Jacob Pan <jacob.jun@linux.intel.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Kan Liang <kan.li...@intel.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/2016021012.054916...@linutronix.de
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/events/intel/cqm.c | 32 +++-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 1b064c4..93cb412 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int reader;
 
-   for_each_cpu(i, _cpumask) {
-   if (phys_id == topology_physical_package_id(i))
-   return; /* already got reader for this socket */
-   }
-
-   cpumask_set_cpu(cpu, _cpumask);
+   /* First online cpu in package becomes the reader */
+   reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu));
+   if (reader >= nr_cpu_ids)
+   cpumask_set_cpu(cpu, _cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int target;
 
-   /*
-* Is @cpu a designated cqm reader?
-*/
+   /* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return;
 
-   for_each_online_cpu(i) {
-   if (i == cpu)
-   continue;
+   /* Find another online reader in this package */
+   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
 
-   if (phys_id == topology_physical_package_id(i)) {
-   cpumask_set_cpu(i, _cpumask);
-   break;
-   }
-   }
+   if (target < nr_cpu_ids)
+   cpumask_set_cpu(target, _cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,


[tip:perf/core] perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups

2016-02-29 Thread tip-bot for Thomas Gleixner
Commit-ID:  827db839cd051cfde5a618fb4427e86dc91bc9aa
Gitweb: http://git.kernel.org/tip/827db839cd051cfde5a618fb4427e86dc91bc9aa
Author: Thomas Gleixner 
AuthorDate: Mon, 22 Feb 2016 22:19:20 +
Committer:  Ingo Molnar 
CommitDate: Mon, 29 Feb 2016 09:35:21 +0100

perf/x86/intel/cqm: Get rid of the silly for_each_cpu() lookups

CQM is a strict per package facility. Use the proper cpumasks to lookup the
readers.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andi Kleen 
Cc: Arnaldo Carvalho de Melo 
Cc: Borislav Petkov 
Cc: Harish Chegondi 
Cc: Jacob Pan 
Cc: Jiri Olsa 
Cc: Kan Liang 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Vince Weaver 
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/2016021012.054916...@linutronix.de
Signed-off-by: Ingo Molnar 
---
 arch/x86/events/intel/cqm.c | 32 +++-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 1b064c4..93cb412 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int reader;
 
-   for_each_cpu(i, _cpumask) {
-   if (phys_id == topology_physical_package_id(i))
-   return; /* already got reader for this socket */
-   }
-
-   cpumask_set_cpu(cpu, _cpumask);
+   /* First online cpu in package becomes the reader */
+   reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu));
+   if (reader >= nr_cpu_ids)
+   cpumask_set_cpu(cpu, _cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int target;
 
-   /*
-* Is @cpu a designated cqm reader?
-*/
+   /* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return;
 
-   for_each_online_cpu(i) {
-   if (i == cpu)
-   continue;
+   /* Find another online reader in this package */
+   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
 
-   if (phys_id == topology_physical_package_id(i)) {
-   cpumask_set_cpu(i, _cpumask);
-   break;
-   }
-   }
+   if (target < nr_cpu_ids)
+   cpumask_set_cpu(target, _cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,


[patch V3 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups

2016-02-22 Thread Thomas Gleixner
CQM is a strict per package facility. Use the proper cpumasks to lookup the
readers.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c |   34 ++---
 1 file changed, 12 insertions(+), 22 deletions(-)

Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
===
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int reader;
 
-   for_each_cpu(i, _cpumask) {
-   if (phys_id == topology_physical_package_id(i))
-   return; /* already got reader for this socket */
-   }
-
-   cpumask_set_cpu(cpu, _cpumask);
+   /* First online cpu in package becomes the reader */
+   reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu));
+   if (reader >= nr_cpu_ids)
+   cpumask_set_cpu(cpu, _cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int target;
 
-   /*
-* Is @cpu a designated cqm reader?
-*/
+   /* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return;
 
-   for_each_online_cpu(i) {
-   if (i == cpu)
-   continue;
-
-   if (phys_id == topology_physical_package_id(i)) {
-   cpumask_set_cpu(i, _cpumask);
-   break;
-   }
-   }
+   /* Find another online reader in this package */
+   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+
+   if (target < nr_cpu_ids)
+   cpumask_set_cpu(target, _cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,




[patch V3 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups

2016-02-22 Thread Thomas Gleixner
CQM is a strict per package facility. Use the proper cpumasks to lookup the
readers.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c |   34 ++---
 1 file changed, 12 insertions(+), 22 deletions(-)

Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
===
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int reader;
 
-   for_each_cpu(i, _cpumask) {
-   if (phys_id == topology_physical_package_id(i))
-   return; /* already got reader for this socket */
-   }
-
-   cpumask_set_cpu(cpu, _cpumask);
+   /* First online cpu in package becomes the reader */
+   reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu));
+   if (reader >= nr_cpu_ids)
+   cpumask_set_cpu(cpu, _cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int target;
 
-   /*
-* Is @cpu a designated cqm reader?
-*/
+   /* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return;
 
-   for_each_online_cpu(i) {
-   if (i == cpu)
-   continue;
-
-   if (phys_id == topology_physical_package_id(i)) {
-   cpumask_set_cpu(i, _cpumask);
-   break;
-   }
-   }
+   /* Find another online reader in this package */
+   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+
+   if (target < nr_cpu_ids)
+   cpumask_set_cpu(target, _cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,




[patch V2 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups

2016-02-22 Thread Thomas Gleixner
CQM is a strict per package facility. Use the proper cpumasks to lookup the
readers.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c |   34 ++---
 1 file changed, 12 insertions(+), 22 deletions(-)

Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
===
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int reader;
 
-   for_each_cpu(i, _cpumask) {
-   if (phys_id == topology_physical_package_id(i))
-   return; /* already got reader for this socket */
-   }
-
-   cpumask_set_cpu(cpu, _cpumask);
+   /* First online cpu in package becomes the reader */
+   reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu));
+   if (reader >= nr_cpu_ids)
+   cpumask_set_cpu(cpu, _cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int target;
 
-   /*
-* Is @cpu a designated cqm reader?
-*/
+   /* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return;
 
-   for_each_online_cpu(i) {
-   if (i == cpu)
-   continue;
-
-   if (phys_id == topology_physical_package_id(i)) {
-   cpumask_set_cpu(i, _cpumask);
-   break;
-   }
-   }
+   /* Find another online reader in this package */
+   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+
+   if (target < nr_cpu_ids)
+   cpumask_set_cpu(target, _cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,




[patch V2 17/28] x86/perf/cqm: Get rid of the silly for_each_cpu lookups

2016-02-22 Thread Thomas Gleixner
CQM is a strict per package facility. Use the proper cpumasks to lookup the
readers.

Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/cpu/perf_event_intel_cqm.c |   34 ++---
 1 file changed, 12 insertions(+), 22 deletions(-)

Index: b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
===
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1244,15 +1244,12 @@ static struct pmu intel_cqm_pmu = {
 
 static inline void cqm_pick_event_reader(int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int reader;
 
-   for_each_cpu(i, _cpumask) {
-   if (phys_id == topology_physical_package_id(i))
-   return; /* already got reader for this socket */
-   }
-
-   cpumask_set_cpu(cpu, _cpumask);
+   /* First online cpu in package becomes the reader */
+   reader = cpumask_any_and(_cpumask, topology_core_cpumask(cpu));
+   if (reader >= nr_cpu_ids)
+   cpumask_set_cpu(cpu, _cpumask);
 }
 
 static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1270,24 +1267,17 @@ static void intel_cqm_cpu_starting(unsig
 
 static void intel_cqm_cpu_exit(unsigned int cpu)
 {
-   int phys_id = topology_physical_package_id(cpu);
-   int i;
+   int target;
 
-   /*
-* Is @cpu a designated cqm reader?
-*/
+   /* Is @cpu the current cqm reader for this package ? */
if (!cpumask_test_and_clear_cpu(cpu, _cpumask))
return;
 
-   for_each_online_cpu(i) {
-   if (i == cpu)
-   continue;
-
-   if (phys_id == topology_physical_package_id(i)) {
-   cpumask_set_cpu(i, _cpumask);
-   break;
-   }
-   }
+   /* Find another online reader in this package */
+   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+
+   if (target < nr_cpu_ids)
+   cpumask_set_cpu(target, _cpumask);
 }
 
 static int intel_cqm_cpu_notifier(struct notifier_block *nb,




Re: [PATCH] x86/perf/intel/cqm: Get rid of the silly for_each_cpu lookups

2016-02-18 Thread Vikas Shivappa



On Thu, 18 Feb 2016, Thomas Gleixner wrote:


On Wed, 17 Feb 2016, Thomas Gleixner wrote:

On Wed, 17 Feb 2016, Vikas Shivappa wrote:

Please stop top posting, finally!


But we have an extra static - static to avoid having it in the stack..


It's not about the cpu mask on the stack. The reason was that with cpumask off
stack cpumask_and_mask() requires an allocation, which then can't be used in
the starting/dying callbacks.

Darn, you are right to remind me.

Now, the proper solution for this stuff is to provide a library function as we
need that for several drivers. No point to duplicate that functionality. I'll
cook something up and repost the uncore/cqm set tomorrow.


Second thoughts on that.

cpumask_any_but() is fine as is, if we feed it topology_core_cpumask(cpu). The
worst case search is two bitmap_find_next() if the first search returned cpu.

Now cpumask_any_and() does a search as well, but the number of
bitmap_find_next() invocations is limited to the number of sockets if we feed
the cqm_cpu_mask as first argument. So for 4 or 8 sockets that's still a
reasonable limit. If the people with insane large machines care, we can
revisit that topic. It's still faster than for_each_online_cpu() :)


Agree. if we dont care about the large number of sockets this would still be 
far better than scanning each cpu. There could be some branches we 
avoid if we are too aggressive and remove 'all' loops (the 2nd search is always 
a success if 1st one fails in cpumask_any_but)
by using the cpumask_and but they should not be much important/use 
in this case.


Will send rapl patch separately.

Thanks,
Vikas



Thanks,

tglx



Re: [PATCH] x86/perf/intel/cqm: Get rid of the silly for_each_cpu lookups

2016-02-18 Thread Vikas Shivappa



On Thu, 18 Feb 2016, Thomas Gleixner wrote:


On Wed, 17 Feb 2016, Thomas Gleixner wrote:

On Wed, 17 Feb 2016, Vikas Shivappa wrote:

Please stop top posting, finally!


But we have an extra static - static to avoid having it in the stack..


It's not about the cpu mask on the stack. The reason was that with cpumask off
stack cpumask_and_mask() requires an allocation, which then can't be used in
the starting/dying callbacks.

Darn, you are right to remind me.

Now, the proper solution for this stuff is to provide a library function as we
need that for several drivers. No point to duplicate that functionality. I'll
cook something up and repost the uncore/cqm set tomorrow.


Second thoughts on that.

cpumask_any_but() is fine as is, if we feed it topology_core_cpumask(cpu). The
worst case search is two bitmap_find_next() if the first search returned cpu.

Now cpumask_any_and() does a search as well, but the number of
bitmap_find_next() invocations is limited to the number of sockets if we feed
the cqm_cpu_mask as first argument. So for 4 or 8 sockets that's still a
reasonable limit. If the people with insane large machines care, we can
revisit that topic. It's still faster than for_each_online_cpu() :)


Agree. if we dont care about the large number of sockets this would still be 
far better than scanning each cpu. There could be some branches we 
avoid if we are too aggressive and remove 'all' loops (the 2nd search is always 
a success if 1st one fails in cpumask_any_but)
by using the cpumask_and but they should not be much important/use 
in this case.


Will send rapl patch separately.

Thanks,
Vikas



Thanks,

tglx



  1   2   3   4   >