Re: Race condition in Kernel

2021-04-01 Thread Ming Lei
On Thu, Apr 01, 2021 at 04:27:37PM +, Gulam Mohamed wrote:
> Hi Ming,
> 
>   Thanks for taking a look into this. Can you please see my inline 
> comments in below mail?
> 
> Regards,
> Gulam Mohamed.
> 
> -Original Message-
> From: Ming Lei  
> Sent: Thursday, March 25, 2021 7:16 AM
> To: Gulam Mohamed 
> Cc: h...@infradead.org; linux-kernel@vger.kernel.org; 
> linux-bl...@vger.kernel.org; Junxiao Bi ; Martin 
> Petersen ; ax...@kernel.dk
> Subject: Re: Race condition in Kernel
> 
> On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote:
> > Hi All,
> > 
> > We are facing a stale link (of the device) issue during the iscsi-logout 
> > process if we use parted command just before the iscsi logout. Here are the 
> > details:
> >  
> > As part of iscsi logout, the partitions and the disk will be removed. The 
> > parted command, used to list the partitions, will open the disk in RW mode 
> > which results in systemd-udevd re-reading the partitions. This will trigger 
> > the rescan partitions which will also delete and re-add the partitions. So, 
> > both iscsi logout processing and the parted (through systemd-udevd) will be 
> > involved in add/delete of partitions. In our case, the following sequence 
> > of operations happened (the iscsi device is /dev/sdb with partition sdb1):
> > 
> > 1. sdb1 was removed by PARTED
> > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
> > already removed by PARTED
> > 3. sdb1 was added by parted
> 
> After kworker is started for logout, I guess all IOs are supposed to be 
> failed at that time, so just wondering why 'sdb1' is still added by 
> parted(systemd-udev)? 
> ioctl(BLKRRPART) needs to read partition table for adding back partitions, if 
> IOs are failed by iscsi logout, I guess the issue can be avoided too?
> 
> [GULAM]: Yes, the ioctl(BLKRRPART) reads the partition table for adding back 
> the partitions. I kept a printk in the code just after the partition table is 
> read. Noticed that the partition table was read before the iscsi-logout 
> kworker started the logout processing.

OK, I guess I understood your issue now, what you want is to not allow
to add partitions since step 1, so can you remove disk just at the
beginning of 2) if it is possible? then step 1) isn't needed any more

For your issue, my patch of 'not drop partitions if partition table
isn't changed' can't fix your issue completely since new real partition
still may come from parted during the series.


Thanks,
Ming



RE: Race condition in Kernel

2021-04-01 Thread Gulam Mohamed
Hi Ming,

  Thanks for taking a look into this. Can you please see my inline comments 
in below mail?

Regards,
Gulam Mohamed.

-Original Message-
From: Ming Lei  
Sent: Thursday, March 25, 2021 7:16 AM
To: Gulam Mohamed 
Cc: h...@infradead.org; linux-kernel@vger.kernel.org; 
linux-bl...@vger.kernel.org; Junxiao Bi ; Martin 
Petersen ; ax...@kernel.dk
Subject: Re: Race condition in Kernel

On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote:
> Hi All,
> 
> We are facing a stale link (of the device) issue during the iscsi-logout 
> process if we use parted command just before the iscsi logout. Here are the 
> details:
>
> As part of iscsi logout, the partitions and the disk will be removed. The 
> parted command, used to list the partitions, will open the disk in RW mode 
> which results in systemd-udevd re-reading the partitions. This will trigger 
> the rescan partitions which will also delete and re-add the partitions. So, 
> both iscsi logout processing and the parted (through systemd-udevd) will be 
> involved in add/delete of partitions. In our case, the following sequence of 
> operations happened (the iscsi device is /dev/sdb with partition sdb1):
>   
>   1. sdb1 was removed by PARTED
>   2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
> already removed by PARTED
>   3. sdb1 was added by parted

After kworker is started for logout, I guess all IOs are supposed to be failed 
at that time, so just wondering why 'sdb1' is still added by 
parted(systemd-udev)? 
ioctl(BLKRRPART) needs to read partition table for adding back partitions, if 
IOs are failed by iscsi logout, I guess the issue can be avoided too?

[GULAM]: Yes, the ioctl(BLKRRPART) reads the partition table for adding back 
the partitions. I kept a printk in the code just after the partition table is 
read. Noticed that the partition table was read before the iscsi-logout kworker 
started the logout processing.
   Following are the logs for your reference:

 Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling 
sysfs_delete_link() for dev: sdb3 command: systemd-udevd   <== sdb3 
Removed by PARTED 
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: rescan_partitions() Read 
Complete to the disk: sdb command: systemd-udevd   <== Reading sdb completed, 
before iscsi-logout worker started
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() 
for dev: 3:0:0:0 command: kworker/u16:3
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: sdb: sdb1 sdb2 sdb3
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: device: 'sdb3': device_add 
command: systemd-udevd   <== sdb3 Added by PARTED 
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() 
for dev: 8:16 command: kworker/u16:3
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() 
for dev: sdb command: kworker/u16:3 <== sdb Removed by iscsi 
Apr  1 09:23:27 gms-iscsi-initiator-2 kernel: scsi 3:0:0:0: alua: Detached

--
Ming



Re: Race condition in Kernel

2021-03-24 Thread Ming Lei
On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote:
> Hi All,
> 
> We are facing a stale link (of the device) issue during the iscsi-logout 
> process if we use parted command just before the iscsi logout. Here are the 
> details:
>
> As part of iscsi logout, the partitions and the disk will be removed. The 
> parted command, used to list the partitions, will open the disk in RW mode 
> which results in systemd-udevd re-reading the partitions. This will trigger 
> the rescan partitions which will also delete and re-add the partitions. So, 
> both iscsi logout processing and the parted (through systemd-udevd) will be 
> involved in add/delete of partitions. In our case, the following sequence of 
> operations happened (the iscsi device is /dev/sdb with partition sdb1):
>   
>   1. sdb1 was removed by PARTED
>   2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
> already removed by PARTED
>   3. sdb1 was added by parted

After kworker is started for logout, I guess all IOs are supposed to be failed
at that time, so just wondering why 'sdb1' is still added by 
parted(systemd-udev)? 
ioctl(BLKRRPART) needs to read partition table for adding back partitions, if 
IOs
are failed by iscsi logout, I guess the issue can be avoided too?

-- 
Ming



Re: Race condition in Kernel

2021-03-24 Thread Junxiao Bi

On 3/24/21 5:37 PM, Ming Lei wrote:


On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote:

Hi All,

We are facing a stale link (of the device) issue during the iscsi-logout 
process if we use parted command just before the iscsi logout. Here are the 
details:

As part of iscsi logout, the partitions and the disk will be removed. The 
parted command, used to list the partitions, will open the disk in RW mode 
which results in systemd-udevd re-reading the partitions. This will trigger the 
rescan partitions which will also delete and re-add the partitions. So, both 
iscsi logout processing and the parted (through systemd-udevd) will be involved 
in add/delete of partitions. In our case, the following sequence of operations 
happened (the iscsi device is /dev/sdb with partition sdb1):

1. sdb1 was removed by PARTED
2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
already removed by PARTED
3. sdb1 was added by parted
4. sdb was NOW removed as part of iscsi logout (the last part of the 
device removal after remoing the partitions)

Since the symlink /sys/class/block/sdb1 points to 
/sys/class/devices/platform/hostx/sessionx/targetx:x:x:x/x:x:x:x/block/sdb/sdb1 
and since sdb is already removed, the symlink /sys/class/block/sdb1 will be 
orphan and stale. So, this stale link is a result of the race condition in 
kernel between the systemd-udevd and iscsi-logout processing as described 
above. We are able to reproduce this even with latest upstream kernel.

We have come across a patch from Ming Lei which was created for "avoid to drop & 
re-add partitions if partitions aren't changed":
https://lore.kernel.org/linux-block/20210216084430.ga23...@lst.de/T/

BTW,  there is a newer version of this patchset:

https://lore.kernel.org/linux-block/20210224081825.ga1...@lst.de/#r



This patch could resolve our problem of stale link but it just seems to be a 
work-around and not the actual fix for the race. We were looking for help to 
fix this race in kernel. Do you have any idea how to fix this race condition?


IMO, that isn't a work-around, kernel shouldn't drop partitions if
partition table isn't changed. But Christoph thought the current approach
is taken since beginning of kernel, and he suggested to fix systemd-udev.


This is a real kernel bug. Whatever BLK_RRPART do, it should not cause 
this sysfs stale link issue. After this issue happen, there is no way to 
remove that stale link except reboot. The situation is even worse when 
login back a new disk, since it will reuse the disk number of the old 
one, it will fail when it creates the symbol link because the stale link 
is still there.


Thanks,

Junxiao.





Thanks,
Ming



Re: Race condition in Kernel

2021-03-24 Thread Ming Lei
On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote:
> Hi All,
> 
> We are facing a stale link (of the device) issue during the iscsi-logout 
> process if we use parted command just before the iscsi logout. Here are the 
> details:
>
> As part of iscsi logout, the partitions and the disk will be removed. The 
> parted command, used to list the partitions, will open the disk in RW mode 
> which results in systemd-udevd re-reading the partitions. This will trigger 
> the rescan partitions which will also delete and re-add the partitions. So, 
> both iscsi logout processing and the parted (through systemd-udevd) will be 
> involved in add/delete of partitions. In our case, the following sequence of 
> operations happened (the iscsi device is /dev/sdb with partition sdb1):
>   
>   1. sdb1 was removed by PARTED
>   2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
> already removed by PARTED
>   3. sdb1 was added by parted
>   4. sdb was NOW removed as part of iscsi logout (the last part of the 
> device removal after remoing the partitions)
> 
> Since the symlink /sys/class/block/sdb1 points to 
> /sys/class/devices/platform/hostx/sessionx/targetx:x:x:x/x:x:x:x/block/sdb/sdb1
>  and since sdb is already removed, the symlink /sys/class/block/sdb1 will be 
> orphan and stale. So, this stale link is a result of the race condition in 
> kernel between the systemd-udevd and iscsi-logout processing as described 
> above. We are able to reproduce this even with latest upstream kernel.
>   
> We have come across a patch from Ming Lei which was created for "avoid to 
> drop & re-add partitions if partitions aren't changed":
> https://lore.kernel.org/linux-block/20210216084430.ga23...@lst.de/T/

BTW,  there is a newer version of this patchset:

https://lore.kernel.org/linux-block/20210224081825.ga1...@lst.de/#r

>   
> This patch could resolve our problem of stale link but it just seems to be a 
> work-around and not the actual fix for the race. We were looking for help to 
> fix this race in kernel. Do you have any idea how to fix this race condition?
>

IMO, that isn't a work-around, kernel shouldn't drop partitions if
partition table isn't changed. But Christoph thought the current approach
is taken since beginning of kernel, and he suggested to fix systemd-udev.



Thanks, 
Ming



Race condition in Kernel

2021-03-24 Thread Gulam Mohamed
Hi All,

We are facing a stale link (of the device) issue during the iscsi-logout 
process if we use parted command just before the iscsi logout. Here are the 
details:
 
As part of iscsi logout, the partitions and the disk will be removed. The 
parted command, used to list the partitions, will open the disk in RW mode 
which results in systemd-udevd re-reading the partitions. This will trigger the 
rescan partitions which will also delete and re-add the partitions. So, both 
iscsi logout processing and the parted (through systemd-udevd) will be involved 
in add/delete of partitions. In our case, the following sequence of operations 
happened (the iscsi device is /dev/sdb with partition sdb1):

1. sdb1 was removed by PARTED
2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
already removed by PARTED
3. sdb1 was added by parted
4. sdb was NOW removed as part of iscsi logout (the last part of the 
device removal after remoing the partitions)

Since the symlink /sys/class/block/sdb1 points to 
/sys/class/devices/platform/hostx/sessionx/targetx:x:x:x/x:x:x:x/block/sdb/sdb1 
and since sdb is already removed, the symlink /sys/class/block/sdb1 will be 
orphan and stale. So, this stale link is a result of the race condition in 
kernel between the systemd-udevd and iscsi-logout processing as described 
above. We are able to reproduce this even with latest upstream kernel.

We have come across a patch from Ming Lei which was created for "avoid to drop 
& re-add partitions if partitions aren't changed":
https://lore.kernel.org/linux-block/20210216084430.ga23...@lst.de/T/

This patch could resolve our problem of stale link but it just seems to be a 
work-around and not the actual fix for the race. We were looking for help to 
fix this race in kernel. Do you have any idea how to fix this race condition?

Following is the script we are using to reproduce the issue:

#!/bin/bash
  
dir=/sys/class/block
iter_count=0
while [ $iter_count -lt 1000 ]; do
iscsiadm -m node -T iqn.2016-01.com.example:target1 -p 100.100.242.162:3260 
-l

poll_loop=0
while [ ! -e /sys/class/block/sdb1 ]; do
ls  -i -l /sys/class/block/sd* > /dev/null
let poll_loop+=1
if [ $poll_loop -gt 100 ]; then
ls  -i -l /sys/class/block/sd* --color
exit 1
fi
done

ls -i -l /sys/class/block/sd* --color
mount /dev/sdb1 /mnt
dd of=/mnt/sdb1 if=/dev/sdb2 bs=1M count=100 &
pid_01=$!
wait $pid_01
umount -l /mnt &
pid_02=$!
wait $pid_02

parted /dev/sdb -s print
iscsiadm -m node -T iqn.2016-01.com.example:target1 -p 100.100.242.162:3260 
-u &
pid_1=$!

iscsiadm -m node -T iqn.2016-01.com.example:target2 -p 100.100.242.162:3260 
-l &
pid_2=$!

sleep 1
ls -i -l /sys/class/block/sd* --color

for i in `ls  $dir`; do
if [ ! -e $dir/$i ]; then
echo "broken link: $dir/$i"
exit 1
fi
done

parted /dev/sdb -s print
iscsiadm -m node -T iqn.2016-01.com.example:target2 -p 100.100.242.162:3260 
-u
iter_count=`expr $iter_count + 1`
done


Regards,
Gulam Mohamed.


[PATCH AUTOSEL 5.6 223/606] net/tls: fix race condition causing kernel panic

2020-06-08 Thread Sasha Levin
From: Vinay Kumar Yadav 

[ Upstream commit 0cada33241d9de205522e3858b18e506ca5cce2c ]

tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
// tls_sw_recvmsg()
if (atomic_read(>decrypt_pending))
crypto_wait_req(-EINPROGRESS, >async_wait);
else
reinit_completion(>async_wait.completion);

//tls_decrypt_done()
pending = atomic_dec_return(>decrypt_pending);

if (!pending && READ_ONCE(ctx->async_notify))
complete(>async_wait.completion);

Consider the scenario tls_decrypt_done() is about to run complete()

if (!pending && READ_ONCE(ctx->async_notify))

and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
then tls_decrypt_done() runs complete(). This sequence of execution
results in wrong completion. Consequently, for next decrypt request,
it will not wait for completion, eventually on connection close, crypto
resources freed, there is no way to handle pending decrypt response.

This race condition can be avoided by having atomic_read() mutually
exclusive with atomic_dec_return(),complete().Intoduced spin lock to
ensure the mutual exclution.

Addressed similar problem in tx direction.

v1->v2:
- More readable commit message.
- Corrected the lock to fix new race scenario.
- Removed barrier which is not needed now.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for 
performance")
Signed-off-by: Vinay Kumar Yadav 
Reviewed-by: Jakub Kicinski 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 include/net/tls.h |  4 
 net/tls/tls_sw.c  | 33 +++--
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index bf9eb4823933..18cd4f418464 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -135,6 +135,8 @@ struct tls_sw_context_tx {
struct tls_rec *open_rec;
struct list_head tx_list;
atomic_t encrypt_pending;
+   /* protect crypto_wait with encrypt_pending */
+   spinlock_t encrypt_compl_lock;
int async_notify;
u8 async_capable:1;
 
@@ -155,6 +157,8 @@ struct tls_sw_context_rx {
u8 async_capable:1;
u8 decrypted:1;
atomic_t decrypt_pending;
+   /* protect crypto_wait with decrypt_pending*/
+   spinlock_t decrypt_compl_lock;
bool async_notify;
 };
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e23f94a5549b..ffa3cbc5449d 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -206,10 +206,12 @@ static void tls_decrypt_done(struct crypto_async_request 
*req, int err)
 
kfree(aead_req);
 
+   spin_lock_bh(>decrypt_compl_lock);
pending = atomic_dec_return(>decrypt_pending);
 
-   if (!pending && READ_ONCE(ctx->async_notify))
+   if (!pending && ctx->async_notify)
complete(>async_wait.completion);
+   spin_unlock_bh(>decrypt_compl_lock);
 }
 
 static int tls_do_decryption(struct sock *sk,
@@ -467,10 +469,12 @@ static void tls_encrypt_done(struct crypto_async_request 
*req, int err)
ready = true;
}
 
+   spin_lock_bh(>encrypt_compl_lock);
pending = atomic_dec_return(>encrypt_pending);
 
-   if (!pending && READ_ONCE(ctx->async_notify))
+   if (!pending && ctx->async_notify)
complete(>async_wait.completion);
+   spin_unlock_bh(>encrypt_compl_lock);
 
if (!ready)
return;
@@ -926,6 +930,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
int num_zc = 0;
int orig_size;
int ret = 0;
+   int pending;
 
if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL))
return -EOPNOTSUPP;
@@ -1092,13 +1097,19 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
goto send_end;
} else if (num_zc) {
/* Wait for pending encryptions to get completed */
-   smp_store_mb(ctx->async_notify, true);
+   spin_lock_bh(>encrypt_compl_lock);
+   ctx->async_notify = true;
 
-   if (atomic_read(>encrypt_pending))
+   pending = atomic_read(>encrypt_pending);
+   spin_unlock_bh(>encrypt_compl_lock);
+   if (pending)
crypto_wait_req(-EINPROGRESS, >async_wait);
else
reinit_completion(>async_wait.completion);
 
+   /* There can be no concurrent accesses, since we have no
+* pending encrypt operations
+*/
WRITE_ONCE(ctx->async_notify, false);
 
if (ctx->async_wait.err) {
@@ -1729,6 +1740,7 @@ int tls_sw_recvmsg(struct sock *sk,
bool is_kvec = iov_iter_is_kvec(>msg_iter);
bool is_peek = flags & MSG_PEEK;
int num_async = 0;
+   int pending;
 
flags |= nonblock;
 
@@ -1891,8 +1903,11 

[PATCH 5.4 015/142] net/tls: fix race condition causing kernel panic

2020-06-01 Thread Greg Kroah-Hartman
From: Vinay Kumar Yadav 

[ Upstream commit 0cada33241d9de205522e3858b18e506ca5cce2c ]

tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
// tls_sw_recvmsg()
if (atomic_read(>decrypt_pending))
crypto_wait_req(-EINPROGRESS, >async_wait);
else
reinit_completion(>async_wait.completion);

//tls_decrypt_done()
pending = atomic_dec_return(>decrypt_pending);

if (!pending && READ_ONCE(ctx->async_notify))
complete(>async_wait.completion);

Consider the scenario tls_decrypt_done() is about to run complete()

if (!pending && READ_ONCE(ctx->async_notify))

and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
then tls_decrypt_done() runs complete(). This sequence of execution
results in wrong completion. Consequently, for next decrypt request,
it will not wait for completion, eventually on connection close, crypto
resources freed, there is no way to handle pending decrypt response.

This race condition can be avoided by having atomic_read() mutually
exclusive with atomic_dec_return(),complete().Intoduced spin lock to
ensure the mutual exclution.

Addressed similar problem in tx direction.

v1->v2:
- More readable commit message.
- Corrected the lock to fix new race scenario.
- Removed barrier which is not needed now.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for 
performance")
Signed-off-by: Vinay Kumar Yadav 
Reviewed-by: Jakub Kicinski 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 include/net/tls.h |4 
 net/tls/tls_sw.c  |   33 +++--
 2 files changed, 31 insertions(+), 6 deletions(-)

--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -157,6 +157,8 @@ struct tls_sw_context_tx {
struct tls_rec *open_rec;
struct list_head tx_list;
atomic_t encrypt_pending;
+   /* protect crypto_wait with encrypt_pending */
+   spinlock_t encrypt_compl_lock;
int async_notify;
int async_capable;
 
@@ -177,6 +179,8 @@ struct tls_sw_context_rx {
int async_capable;
bool decrypted;
atomic_t decrypt_pending;
+   /* protect crypto_wait with decrypt_pending*/
+   spinlock_t decrypt_compl_lock;
bool async_notify;
 };
 
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -203,10 +203,12 @@ static void tls_decrypt_done(struct cryp
 
kfree(aead_req);
 
+   spin_lock_bh(>decrypt_compl_lock);
pending = atomic_dec_return(>decrypt_pending);
 
-   if (!pending && READ_ONCE(ctx->async_notify))
+   if (!pending && ctx->async_notify)
complete(>async_wait.completion);
+   spin_unlock_bh(>decrypt_compl_lock);
 }
 
 static int tls_do_decryption(struct sock *sk,
@@ -464,10 +466,12 @@ static void tls_encrypt_done(struct cryp
ready = true;
}
 
+   spin_lock_bh(>encrypt_compl_lock);
pending = atomic_dec_return(>encrypt_pending);
 
-   if (!pending && READ_ONCE(ctx->async_notify))
+   if (!pending && ctx->async_notify)
complete(>async_wait.completion);
+   spin_unlock_bh(>encrypt_compl_lock);
 
if (!ready)
return;
@@ -923,6 +927,7 @@ int tls_sw_sendmsg(struct sock *sk, stru
int num_zc = 0;
int orig_size;
int ret = 0;
+   int pending;
 
if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL))
return -EOPNOTSUPP;
@@ -1089,13 +1094,19 @@ trim_sgl:
goto send_end;
} else if (num_zc) {
/* Wait for pending encryptions to get completed */
-   smp_store_mb(ctx->async_notify, true);
+   spin_lock_bh(>encrypt_compl_lock);
+   ctx->async_notify = true;
 
-   if (atomic_read(>encrypt_pending))
+   pending = atomic_read(>encrypt_pending);
+   spin_unlock_bh(>encrypt_compl_lock);
+   if (pending)
crypto_wait_req(-EINPROGRESS, >async_wait);
else
reinit_completion(>async_wait.completion);
 
+   /* There can be no concurrent accesses, since we have no
+* pending encrypt operations
+*/
WRITE_ONCE(ctx->async_notify, false);
 
if (ctx->async_wait.err) {
@@ -1724,6 +1735,7 @@ int tls_sw_recvmsg(struct sock *sk,
bool is_kvec = iov_iter_is_kvec(>msg_iter);
bool is_peek = flags & MSG_PEEK;
int num_async = 0;
+   int pending;
 
flags |= nonblock;
 
@@ -1886,8 +1898,11 @@ pick_next_record:
 recv_end:
if (num_async) {
/* Wait for all previously submitted records to be decrypted */
-   smp_store_mb(ctx->async_notify, true);
-   if (atomic_read(>decrypt_pending)) {
+   spin_lock_bh(>decrypt_compl_lock);
+   ctx->async_notify = 

[PATCH 5.6 017/177] net/tls: fix race condition causing kernel panic

2020-06-01 Thread Greg Kroah-Hartman
From: Vinay Kumar Yadav 

[ Upstream commit 0cada33241d9de205522e3858b18e506ca5cce2c ]

tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
// tls_sw_recvmsg()
if (atomic_read(>decrypt_pending))
crypto_wait_req(-EINPROGRESS, >async_wait);
else
reinit_completion(>async_wait.completion);

//tls_decrypt_done()
pending = atomic_dec_return(>decrypt_pending);

if (!pending && READ_ONCE(ctx->async_notify))
complete(>async_wait.completion);

Consider the scenario tls_decrypt_done() is about to run complete()

if (!pending && READ_ONCE(ctx->async_notify))

and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
then tls_decrypt_done() runs complete(). This sequence of execution
results in wrong completion. Consequently, for next decrypt request,
it will not wait for completion, eventually on connection close, crypto
resources freed, there is no way to handle pending decrypt response.

This race condition can be avoided by having atomic_read() mutually
exclusive with atomic_dec_return(),complete().Intoduced spin lock to
ensure the mutual exclution.

Addressed similar problem in tx direction.

v1->v2:
- More readable commit message.
- Corrected the lock to fix new race scenario.
- Removed barrier which is not needed now.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for 
performance")
Signed-off-by: Vinay Kumar Yadav 
Reviewed-by: Jakub Kicinski 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
---
 include/net/tls.h |4 
 net/tls/tls_sw.c  |   33 +++--
 2 files changed, 31 insertions(+), 6 deletions(-)

--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -135,6 +135,8 @@ struct tls_sw_context_tx {
struct tls_rec *open_rec;
struct list_head tx_list;
atomic_t encrypt_pending;
+   /* protect crypto_wait with encrypt_pending */
+   spinlock_t encrypt_compl_lock;
int async_notify;
u8 async_capable:1;
 
@@ -155,6 +157,8 @@ struct tls_sw_context_rx {
u8 async_capable:1;
u8 decrypted:1;
atomic_t decrypt_pending;
+   /* protect crypto_wait with decrypt_pending*/
+   spinlock_t decrypt_compl_lock;
bool async_notify;
 };
 
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -206,10 +206,12 @@ static void tls_decrypt_done(struct cryp
 
kfree(aead_req);
 
+   spin_lock_bh(>decrypt_compl_lock);
pending = atomic_dec_return(>decrypt_pending);
 
-   if (!pending && READ_ONCE(ctx->async_notify))
+   if (!pending && ctx->async_notify)
complete(>async_wait.completion);
+   spin_unlock_bh(>decrypt_compl_lock);
 }
 
 static int tls_do_decryption(struct sock *sk,
@@ -467,10 +469,12 @@ static void tls_encrypt_done(struct cryp
ready = true;
}
 
+   spin_lock_bh(>encrypt_compl_lock);
pending = atomic_dec_return(>encrypt_pending);
 
-   if (!pending && READ_ONCE(ctx->async_notify))
+   if (!pending && ctx->async_notify)
complete(>async_wait.completion);
+   spin_unlock_bh(>encrypt_compl_lock);
 
if (!ready)
return;
@@ -926,6 +930,7 @@ int tls_sw_sendmsg(struct sock *sk, stru
int num_zc = 0;
int orig_size;
int ret = 0;
+   int pending;
 
if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL))
return -EOPNOTSUPP;
@@ -1092,13 +1097,19 @@ trim_sgl:
goto send_end;
} else if (num_zc) {
/* Wait for pending encryptions to get completed */
-   smp_store_mb(ctx->async_notify, true);
+   spin_lock_bh(>encrypt_compl_lock);
+   ctx->async_notify = true;
 
-   if (atomic_read(>encrypt_pending))
+   pending = atomic_read(>encrypt_pending);
+   spin_unlock_bh(>encrypt_compl_lock);
+   if (pending)
crypto_wait_req(-EINPROGRESS, >async_wait);
else
reinit_completion(>async_wait.completion);
 
+   /* There can be no concurrent accesses, since we have no
+* pending encrypt operations
+*/
WRITE_ONCE(ctx->async_notify, false);
 
if (ctx->async_wait.err) {
@@ -1729,6 +1740,7 @@ int tls_sw_recvmsg(struct sock *sk,
bool is_kvec = iov_iter_is_kvec(>msg_iter);
bool is_peek = flags & MSG_PEEK;
int num_async = 0;
+   int pending;
 
flags |= nonblock;
 
@@ -1891,8 +1903,11 @@ pick_next_record:
 recv_end:
if (num_async) {
/* Wait for all previously submitted records to be decrypted */
-   smp_store_mb(ctx->async_notify, true);
-   if (atomic_read(>decrypt_pending)) {
+   spin_lock_bh(>decrypt_compl_lock);
+   ctx->async_notify 

Re: race condition in kernel/padata.c

2017-03-23 Thread Steffen Klassert
On Thu, Mar 23, 2017 at 12:03:43AM +0100, Jason A. Donenfeld wrote:
> Hey Steffen,
> 
> WireGuard makes really heavy use of padata, feeding it units of work
> from different cores in different contexts all at the same time. For
> the most part, everything has been fine, but one particular user has
> consistently run into mysterious bugs. He's using a rather old dual
> core CPU, which have a tendency to bring out race conditions
> sometimes. After struggling with getting a good backtrace, we finally
> managed to extract this from list debugging:
> 
> [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
> __list_add+0xae/0x130
> [87487.301868] list_add corruption. prev->next should be next
> (b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00).
> [87487.339011]  [] dump_stack+0x68/0xa3
> [87487.342198]  [] ? console_unlock+0x281/0x6d0
> [87487.345364]  [] __warn+0xff/0x140
> [87487.348513]  [] warn_slowpath_fmt+0x4a/0x50
> [87487.351659]  [] __list_add+0xae/0x130
> [87487.354772]  [] ? _raw_spin_lock+0x64/0x70
> [87487.357915]  [] padata_reorder+0x1e6/0x420
> [87487.361084]  [] padata_do_serial+0xa5/0x120
> 
> padata_reorder calls list_add_tail with the list to which its adding
> locked, which seems correct:
> 
> spin_lock(>serial.lock);
> list_add_tail(>list, >serial.list);
> spin_unlock(>serial.lock);
> 
> This therefore leaves only place where such inconsistency could occur:
> if padata->list is added at the same time on two different threads.
> This pdata pointer comes from the function call to
> padata_get_next(pd), which has in it the following block:
> 
> next_queue = per_cpu_ptr(pd->pqueue, cpu);
> padata = NULL;
> reorder = _queue->reorder;
> if (!list_empty(>list)) {
>padata = list_entry(reorder->list.next,
>struct padata_priv, list);
>spin_lock(>lock);
>list_del_init(>list);
>atomic_dec(>reorder_objects);
>spin_unlock(>lock);
> 
>pd->processed++;
> 
>goto out;
> }
> out:
> return padata;
> 
> I strongly suspect that the problem here is that two threads can race
> on reorder list. Even though the deletion is locked, call to
> list_entry is not locked, which means it's feasible that two threads
> pick up the same padata object and subsequently call list_add_tail on
> them at the same time. The fix would thus be to hoist that lock
> outside of that block.

Yes, looks like we should lock the whole list handling block here.

Thanks!


Re: race condition in kernel/padata.c

2017-03-23 Thread Steffen Klassert
On Thu, Mar 23, 2017 at 12:03:43AM +0100, Jason A. Donenfeld wrote:
> Hey Steffen,
> 
> WireGuard makes really heavy use of padata, feeding it units of work
> from different cores in different contexts all at the same time. For
> the most part, everything has been fine, but one particular user has
> consistently run into mysterious bugs. He's using a rather old dual
> core CPU, which have a tendency to bring out race conditions
> sometimes. After struggling with getting a good backtrace, we finally
> managed to extract this from list debugging:
> 
> [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
> __list_add+0xae/0x130
> [87487.301868] list_add corruption. prev->next should be next
> (b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00).
> [87487.339011]  [] dump_stack+0x68/0xa3
> [87487.342198]  [] ? console_unlock+0x281/0x6d0
> [87487.345364]  [] __warn+0xff/0x140
> [87487.348513]  [] warn_slowpath_fmt+0x4a/0x50
> [87487.351659]  [] __list_add+0xae/0x130
> [87487.354772]  [] ? _raw_spin_lock+0x64/0x70
> [87487.357915]  [] padata_reorder+0x1e6/0x420
> [87487.361084]  [] padata_do_serial+0xa5/0x120
> 
> padata_reorder calls list_add_tail with the list to which its adding
> locked, which seems correct:
> 
> spin_lock(>serial.lock);
> list_add_tail(>list, >serial.list);
> spin_unlock(>serial.lock);
> 
> This therefore leaves only place where such inconsistency could occur:
> if padata->list is added at the same time on two different threads.
> This pdata pointer comes from the function call to
> padata_get_next(pd), which has in it the following block:
> 
> next_queue = per_cpu_ptr(pd->pqueue, cpu);
> padata = NULL;
> reorder = _queue->reorder;
> if (!list_empty(>list)) {
>padata = list_entry(reorder->list.next,
>struct padata_priv, list);
>spin_lock(>lock);
>list_del_init(>list);
>atomic_dec(>reorder_objects);
>spin_unlock(>lock);
> 
>pd->processed++;
> 
>goto out;
> }
> out:
> return padata;
> 
> I strongly suspect that the problem here is that two threads can race
> on reorder list. Even though the deletion is locked, call to
> list_entry is not locked, which means it's feasible that two threads
> pick up the same padata object and subsequently call list_add_tail on
> them at the same time. The fix would thus be to hoist that lock
> outside of that block.

Yes, looks like we should lock the whole list handling block here.

Thanks!


race condition in kernel/padata.c

2017-03-22 Thread Jason A. Donenfeld
Hey Steffen,

WireGuard makes really heavy use of padata, feeding it units of work
from different cores in different contexts all at the same time. For
the most part, everything has been fine, but one particular user has
consistently run into mysterious bugs. He's using a rather old dual
core CPU, which have a tendency to bring out race conditions
sometimes. After struggling with getting a good backtrace, we finally
managed to extract this from list debugging:

[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00).
[87487.339011]  [] dump_stack+0x68/0xa3
[87487.342198]  [] ? console_unlock+0x281/0x6d0
[87487.345364]  [] __warn+0xff/0x140
[87487.348513]  [] warn_slowpath_fmt+0x4a/0x50
[87487.351659]  [] __list_add+0xae/0x130
[87487.354772]  [] ? _raw_spin_lock+0x64/0x70
[87487.357915]  [] padata_reorder+0x1e6/0x420
[87487.361084]  [] padata_do_serial+0xa5/0x120

padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:

spin_lock(>serial.lock);
list_add_tail(>list, >serial.list);
spin_unlock(>serial.lock);

This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:

next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = _queue->reorder;
if (!list_empty(>list)) {
   padata = list_entry(reorder->list.next,
   struct padata_priv, list);
   spin_lock(>lock);
   list_del_init(>list);
   atomic_dec(>reorder_objects);
   spin_unlock(>lock);

   pd->processed++;

   goto out;
}
out:
return padata;

I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix would thus be to hoist that lock
outside of that block.

This theory is unconfirmed at the moment, but I'll be playing with
some patches to see if this fixes the issue and then I'll get back to
you. In the meantime, if you have any insight before I potentially
waste some time, I'm all ears.

Thanks,
Jason


race condition in kernel/padata.c

2017-03-22 Thread Jason A. Donenfeld
Hey Steffen,

WireGuard makes really heavy use of padata, feeding it units of work
from different cores in different contexts all at the same time. For
the most part, everything has been fine, but one particular user has
consistently run into mysterious bugs. He's using a rather old dual
core CPU, which have a tendency to bring out race conditions
sometimes. After struggling with getting a good backtrace, we finally
managed to extract this from list debugging:

[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00).
[87487.339011]  [] dump_stack+0x68/0xa3
[87487.342198]  [] ? console_unlock+0x281/0x6d0
[87487.345364]  [] __warn+0xff/0x140
[87487.348513]  [] warn_slowpath_fmt+0x4a/0x50
[87487.351659]  [] __list_add+0xae/0x130
[87487.354772]  [] ? _raw_spin_lock+0x64/0x70
[87487.357915]  [] padata_reorder+0x1e6/0x420
[87487.361084]  [] padata_do_serial+0xa5/0x120

padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:

spin_lock(>serial.lock);
list_add_tail(>list, >serial.list);
spin_unlock(>serial.lock);

This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:

next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = _queue->reorder;
if (!list_empty(>list)) {
   padata = list_entry(reorder->list.next,
   struct padata_priv, list);
   spin_lock(>lock);
   list_del_init(>list);
   atomic_dec(>reorder_objects);
   spin_unlock(>lock);

   pd->processed++;

   goto out;
}
out:
return padata;

I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix would thus be to hoist that lock
outside of that block.

This theory is unconfirmed at the moment, but I'll be playing with
some patches to see if this fixes the issue and then I'll get back to
you. In the meantime, if you have any insight before I potentially
waste some time, I'm all ears.

Thanks,
Jason


Re: [RFC]: Possible race condition in kernel futex code

2016-05-15 Thread Ben Hutchings
On Fri, 2015-10-09 at 10:06 +0100, Thomas Gleixner wrote:
[...]
> @stable: Can you please pick up ff47ab4ff3cd plus 
> 
> df90ca969035d x86, sparse: Do not force removal of __user when calling 
> copy_to/from_user_nocheck()
> 
> for stable kernels <= 3.12?
[...]

I've finally queued these up for 3.2, thanks.

Ben.

-- 
Ben Hutchings
For every action, there is an equal and opposite criticism. - Harrison

signature.asc
Description: This is a digitally signed message part


Re: [RFC]: Possible race condition in kernel futex code

2016-05-15 Thread Ben Hutchings
On Fri, 2015-10-09 at 10:06 +0100, Thomas Gleixner wrote:
[...]
> @stable: Can you please pick up ff47ab4ff3cd plus 
> 
> df90ca969035d x86, sparse: Do not force removal of __user when calling 
> copy_to/from_user_nocheck()
> 
> for stable kernels <= 3.12?
[...]

I've finally queued these up for 3.2, thanks.

Ben.

-- 
Ben Hutchings
For every action, there is an equal and opposite criticism. - Harrison

signature.asc
Description: This is a digitally signed message part


Re: [RFC]: Possible race condition in kernel futex code

2015-10-17 Thread Greg KH
On Fri, Oct 09, 2015 at 10:06:41AM +0100, Thomas Gleixner wrote:
> On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
> > We did some tests with different compilers, kernel versions and kernel
> > configs, with the following results:
> > 
> > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
> > copy_user_generic_unrolled being used, so race condition possible
> > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
> > copy_user_generic_unrolled being used, so race condition possible
> > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
> > race condition
> > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
> > race condition
> > 
> > 
> > Our idea to fix this problem is use an explicit 32 bit read in
> > get_futex_value_locked() instead of using the generic function
> > copy_from_user_inatomic() and hoping the compiler uses an atomic
> > access and the right access size.
> 
> You cannot use an explicit 32bit read. We need an access which handles
> the fault gracefully.
> 
> In current mainline this is done proper:
> 
> ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32))
> 
> __copy_from_user_nocheck(dst, src, size)
> 
>  if (!__builtin_constant_p(size))
>  return copy_user_generic(dst, (__force void *)src, size);
>   
>  size is constant so we end up in the switch case
> 
>  switch(size) {
>  
>  case 4:
>   __get_user_asm(*(u32 *)dst, (u32 __user *)src,
>  ret, "l", "k", "=r", 4);
>   return ret;
> 
> 
> In 3.12 this is different:
> 
>__copy_from_user_inatomic()
>   copy_user_generic()
>   copy_user_generic_unrolled()
> 
> So this is only an issue for kernel versions < 3.13. It was fixed with
> 
> ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit 
> __copy_{from,to}_user_inatomic
> 
> but nobody noticed that the race you described can happen, so it was
> never backported to the stable kernels.
> 
> @stable: Can you please pick up ff47ab4ff3cd plus 
> 
> df90ca969035d x86, sparse: Do not force removal of __user when calling 
> copy_to/from_user_nocheck()
> 
> for stable kernels <= 3.12?
> 
> If that's too much of churn, then I can come up with an explicit fix
> for this. Let me know.

Now applied to 3.10-stable, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-17 Thread Greg KH
On Fri, Oct 09, 2015 at 10:06:41AM +0100, Thomas Gleixner wrote:
> On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
> > We did some tests with different compilers, kernel versions and kernel
> > configs, with the following results:
> > 
> > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
> > copy_user_generic_unrolled being used, so race condition possible
> > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
> > copy_user_generic_unrolled being used, so race condition possible
> > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
> > race condition
> > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
> > race condition
> > 
> > 
> > Our idea to fix this problem is use an explicit 32 bit read in
> > get_futex_value_locked() instead of using the generic function
> > copy_from_user_inatomic() and hoping the compiler uses an atomic
> > access and the right access size.
> 
> You cannot use an explicit 32bit read. We need an access which handles
> the fault gracefully.
> 
> In current mainline this is done proper:
> 
> ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32))
> 
> __copy_from_user_nocheck(dst, src, size)
> 
>  if (!__builtin_constant_p(size))
>  return copy_user_generic(dst, (__force void *)src, size);
>   
>  size is constant so we end up in the switch case
> 
>  switch(size) {
>  
>  case 4:
>   __get_user_asm(*(u32 *)dst, (u32 __user *)src,
>  ret, "l", "k", "=r", 4);
>   return ret;
> 
> 
> In 3.12 this is different:
> 
>__copy_from_user_inatomic()
>   copy_user_generic()
>   copy_user_generic_unrolled()
> 
> So this is only an issue for kernel versions < 3.13. It was fixed with
> 
> ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit 
> __copy_{from,to}_user_inatomic
> 
> but nobody noticed that the race you described can happen, so it was
> never backported to the stable kernels.
> 
> @stable: Can you please pick up ff47ab4ff3cd plus 
> 
> df90ca969035d x86, sparse: Do not force removal of __user when calling 
> copy_to/from_user_nocheck()
> 
> for stable kernels <= 3.12?
> 
> If that's too much of churn, then I can come up with an explicit fix
> for this. Let me know.

Now applied to 3.10-stable, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Peter Zijlstra
On Fri, Oct 09, 2015 at 11:25:09AM +0100, Thomas Gleixner wrote:
> Hans,
> 
> On Fri, 9 Oct 2015, Hans Zuidam wrote:
> > On 9 okt. 2015, at 11:06, Thomas Gleixner  wrote:
> > > You cannot use an explicit 32bit read. We need an access which
> > > handles the fault gracefully.
> >
> > The reason for the explicit read suggestion is to avoid the
> > _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual
> > says that there may be situations where it returns 0 even though the
> > argument is a constant.
> 
> That's insane at best.

Right, but I bet that is for cases where constant propagation completely
fails, and this is a trivial case, I have no problem relying on it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Thomas Gleixner
Hans,

On Fri, 9 Oct 2015, Hans Zuidam wrote:
> On 9 okt. 2015, at 11:06, Thomas Gleixner  wrote:
> > You cannot use an explicit 32bit read. We need an access which
> > handles the fault gracefully.
>
> The reason for the explicit read suggestion is to avoid the
> _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual
> says that there may be situations where it returns 0 even though the
> argument is a constant.

That's insane at best.

> Although none of the compiler/kernel combinations we have tried
> showed this happening, we think it is probably better to be safe
> than sorry.

So we would need something like:

   futex_copy_from_user()

which can be mapped to __copy_from_user_inatomic() first. Then go
through all architectures and the asm-generic stuff and provide the
specific variants which are guaranteed to use a 32bit access.

I really prefer that we don't have to do that and the compiler people
get their act together and fix that _builtin_constant_p() thingy.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Hans Zuidam
Hi Thomas,

On 9 okt. 2015, at 11:06, Thomas Gleixner  wrote:
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
>> We did some tests with different compilers, kernel versions and kernel
>> configs, with the following results:

> You cannot use an explicit 32bit read. We need an access which handles the 
> fault gracefully.

The reason for the explicit read suggestion is to avoid the 
_builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual says that 
there may be situations where it returns 0 even though the argument is a 
constant.  Although none of the compiler/kernel combinations we have tried 
showed this happening, we think it is probably better to be safe than sorry.

With kind regards,
Hans Zuidam

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Thomas Gleixner
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
> We did some tests with different compilers, kernel versions and kernel
> configs, with the following results:
> 
> Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
> copy_user_generic_unrolled being used, so race condition possible
> Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
> copy_user_generic_unrolled being used, so race condition possible
> Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
> race condition
> Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
> race condition
> 
> 
> Our idea to fix this problem is use an explicit 32 bit read in
> get_futex_value_locked() instead of using the generic function
> copy_from_user_inatomic() and hoping the compiler uses an atomic
> access and the right access size.

You cannot use an explicit 32bit read. We need an access which handles
the fault gracefully.

In current mainline this is done proper:

ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32))

__copy_from_user_nocheck(dst, src, size)

   if (!__builtin_constant_p(size))
 return copy_user_generic(dst, (__force void *)src, size);

   size is constant so we end up in the switch case

   switch(size) {
   
   case 4:
__get_user_asm(*(u32 *)dst, (u32 __user *)src,
   ret, "l", "k", "=r", 4);
return ret;


In 3.12 this is different:

   __copy_from_user_inatomic()
copy_user_generic()
copy_user_generic_unrolled()

So this is only an issue for kernel versions < 3.13. It was fixed with

ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit 
__copy_{from,to}_user_inatomic

but nobody noticed that the race you described can happen, so it was
never backported to the stable kernels.

@stable: Can you please pick up ff47ab4ff3cd plus 

df90ca969035d x86, sparse: Do not force removal of __user when calling 
copy_to/from_user_nocheck()

for stable kernels <= 3.12?

If that's too much of churn, then I can come up with an explicit fix
for this. Let me know.

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Thomas Gleixner
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
> We did some tests with different compilers, kernel versions and kernel
> configs, with the following results:
> 
> Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
> copy_user_generic_unrolled being used, so race condition possible
> Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
> copy_user_generic_unrolled being used, so race condition possible
> Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
> race condition
> Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
> race condition
> 
> 
> Our idea to fix this problem is use an explicit 32 bit read in
> get_futex_value_locked() instead of using the generic function
> copy_from_user_inatomic() and hoping the compiler uses an atomic
> access and the right access size.

You cannot use an explicit 32bit read. We need an access which handles
the fault gracefully.

In current mainline this is done proper:

ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32))

__copy_from_user_nocheck(dst, src, size)

   if (!__builtin_constant_p(size))
 return copy_user_generic(dst, (__force void *)src, size);

   size is constant so we end up in the switch case

   switch(size) {
   
   case 4:
__get_user_asm(*(u32 *)dst, (u32 __user *)src,
   ret, "l", "k", "=r", 4);
return ret;


In 3.12 this is different:

   __copy_from_user_inatomic()
copy_user_generic()
copy_user_generic_unrolled()

So this is only an issue for kernel versions < 3.13. It was fixed with

ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit 
__copy_{from,to}_user_inatomic

but nobody noticed that the race you described can happen, so it was
never backported to the stable kernels.

@stable: Can you please pick up ff47ab4ff3cd plus 

df90ca969035d x86, sparse: Do not force removal of __user when calling 
copy_to/from_user_nocheck()

for stable kernels <= 3.12?

If that's too much of churn, then I can come up with an explicit fix
for this. Let me know.

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Hans Zuidam
Hi Thomas,

On 9 okt. 2015, at 11:06, Thomas Gleixner  wrote:
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
>> We did some tests with different compilers, kernel versions and kernel
>> configs, with the following results:

> You cannot use an explicit 32bit read. We need an access which handles the 
> fault gracefully.

The reason for the explicit read suggestion is to avoid the 
_builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual says that 
there may be situations where it returns 0 even though the argument is a 
constant.  Although none of the compiler/kernel combinations we have tried 
showed this happening, we think it is probably better to be safe than sorry.

With kind regards,
Hans Zuidam

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Thomas Gleixner
Hans,

On Fri, 9 Oct 2015, Hans Zuidam wrote:
> On 9 okt. 2015, at 11:06, Thomas Gleixner  wrote:
> > You cannot use an explicit 32bit read. We need an access which
> > handles the fault gracefully.
>
> The reason for the explicit read suggestion is to avoid the
> _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual
> says that there may be situations where it returns 0 even though the
> argument is a constant.

That's insane at best.

> Although none of the compiler/kernel combinations we have tried
> showed this happening, we think it is probably better to be safe
> than sorry.

So we would need something like:

   futex_copy_from_user()

which can be mapped to __copy_from_user_inatomic() first. Then go
through all architectures and the asm-generic stuff and provide the
specific variants which are guaranteed to use a 32bit access.

I really prefer that we don't have to do that and the compiler people
get their act together and fix that _builtin_constant_p() thingy.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC]: Possible race condition in kernel futex code

2015-10-09 Thread Peter Zijlstra
On Fri, Oct 09, 2015 at 11:25:09AM +0100, Thomas Gleixner wrote:
> Hans,
> 
> On Fri, 9 Oct 2015, Hans Zuidam wrote:
> > On 9 okt. 2015, at 11:06, Thomas Gleixner  wrote:
> > > You cannot use an explicit 32bit read. We need an access which
> > > handles the fault gracefully.
> >
> > The reason for the explicit read suggestion is to avoid the
> > _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual
> > says that there may be situations where it returns 0 even though the
> > argument is a constant.
> 
> That's insane at best.

Right, but I bet that is for cases where constant propagation completely
fails, and this is a trivial case, I have no problem relying on it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC]: Possible race condition in kernel futex code

2015-10-05 Thread Jaccon Bastiaansen
Hello all,

For a while now, we see our application crashing occasionally (due to
an assert) in the middle of a pthread_mutex_lock() function call.

To be more precise, our application is threaded (50 threads) where all
threads share a custom memory allocator. This memory allocator is
guarded with a recursive, priority inheritance pthread mutex. The
application is running on a 8 core, 64 bit x86 system using the
3.12.42-rt58 kernel and glibc 2.13.

The assert occurs when the pthread_mutex_lock() function checks the
return value of the futex system call (see line 307 of the
pthread_mutex_lock() source code on
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
In case of the assert, the kernel has returned -EDEADLK which results
in a SIGABRT.

The strange thing is that the kernel returns -EDEADLK, because it
finds out that the calling thread is the owner of the mutex. But glibc
has already checked this before executing the futex system call (see
line 194 of 
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
The case where a recursive mutex is locked again is fully handled in
glibc. No futex system call is required.

The kernel futex code finds out that the calling thread is the owner
of the mutex by reading the value of the __data.__lock field of the
pthread_mutex_t. This field is read by the get_futex_value_locked()
function (see kernel/futex.c). The __data.__lock field in a
pthread_mutex_t is written by an atomic 32bit compare exchange when
locking the mutex (see line 288 of
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
Reading the value by the kernel should be an atomic action.

Much to our surprise, the disassembly of the get_futex_value_locked()
function is:

81098890 :
81098890:55   push   %rbp
81098891:48 89 e5 mov%rsp,%rbp
81098894:48 83 ec 10  sub$0x10,%rsp
81098898:48 89 5d f0  mov%rbx,-0x10(%rbp)
8109889c:48 89 fb mov%rdi,%rbx
8109889f:4c 89 65 f8  mov%r12,-0x8(%rbp)
810988a3:49 89 f4 mov%rsi,%r12
810988a6:e8 35 e9 06 00   callq  811071e0

810988ab:48 89 df mov%rbx,%rdi
810988ae:4c 89 e6 mov%r12,%rsi
810988b1:ba 04 00 00 00   mov$0x4,%edx
810988b6:e8 85 f0 2b 00   callq  81357940

810988bb:89 c3mov%eax,%ebx
810988bd:e8 fe e8 06 00   callq  811071c0

810988c2:83 fb 01 cmp$0x1,%ebx
810988c5:4c 8b 65 f8  mov-0x8(%rbp),%r12
810988c9:19 c0sbb%eax,%eax
810988cb:48 8b 5d f0  mov-0x10(%rbp),%rbx
810988cf:c9   leaveq
810988d0:f7 d0not%eax
810988d2:83 e0 f2 and$0xfff2,%eax
810988d5:c3   retq
810988d6:66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
810988dd:00 00 00

So the copy_user_generic_unrolled() function is called to read the
32bit __data.__lock field. The disassembly of the
copy_user_generic_unrolled()  function:


81357940 :
81357940:83 fa 08 cmp$0x8,%edx
81357943:0f 82 8c 00 00 00jb 813579d5

81357949:89 f9mov%edi,%ecx
8135794b:83 e1 07 and$0x7,%ecx
8135794e:74 15je 81357965

81357950:83 e9 08 sub$0x8,%ecx
81357953:f7 d9neg%ecx
81357955:29 casub%ecx,%edx
81357957:8a 06mov(%rsi),%al
81357959:88 07mov%al,(%rdi)
8135795b:48 ff c6 inc%rsi
8135795e:48 ff c7 inc%rdi
81357961:ff c9dec%ecx
81357963:75 f2jne81357957

81357965:89 d1mov%edx,%ecx
81357967:83 e2 3f and$0x3f,%edx
8135796a:c1 e9 06 shr$0x6,%ecx
8135796d:74 4aje 813579b9

8135796f:4c 8b 06 mov(%rsi),%r8
81357972:4c 8b 4e 08  mov0x8(%rsi),%r9
81357976:4c 8b 56 10  mov0x10(%rsi),%r10
8135797a:4c 8b 5e 18  mov0x18(%rsi),%r11
8135797e:4c 89 07 mov%r8,(%rdi)
81357981:4c 89 4f 08  

[RFC]: Possible race condition in kernel futex code

2015-10-05 Thread Jaccon Bastiaansen
Hello all,

For a while now, we see our application crashing occasionally (due to
an assert) in the middle of a pthread_mutex_lock() function call.

To be more precise, our application is threaded (50 threads) where all
threads share a custom memory allocator. This memory allocator is
guarded with a recursive, priority inheritance pthread mutex. The
application is running on a 8 core, 64 bit x86 system using the
3.12.42-rt58 kernel and glibc 2.13.

The assert occurs when the pthread_mutex_lock() function checks the
return value of the futex system call (see line 307 of the
pthread_mutex_lock() source code on
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
In case of the assert, the kernel has returned -EDEADLK which results
in a SIGABRT.

The strange thing is that the kernel returns -EDEADLK, because it
finds out that the calling thread is the owner of the mutex. But glibc
has already checked this before executing the futex system call (see
line 194 of 
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
The case where a recursive mutex is locked again is fully handled in
glibc. No futex system call is required.

The kernel futex code finds out that the calling thread is the owner
of the mutex by reading the value of the __data.__lock field of the
pthread_mutex_t. This field is read by the get_futex_value_locked()
function (see kernel/futex.c). The __data.__lock field in a
pthread_mutex_t is written by an atomic 32bit compare exchange when
locking the mutex (see line 288 of
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
Reading the value by the kernel should be an atomic action.

Much to our surprise, the disassembly of the get_futex_value_locked()
function is:

81098890 :
81098890:55   push   %rbp
81098891:48 89 e5 mov%rsp,%rbp
81098894:48 83 ec 10  sub$0x10,%rsp
81098898:48 89 5d f0  mov%rbx,-0x10(%rbp)
8109889c:48 89 fb mov%rdi,%rbx
8109889f:4c 89 65 f8  mov%r12,-0x8(%rbp)
810988a3:49 89 f4 mov%rsi,%r12
810988a6:e8 35 e9 06 00   callq  811071e0

810988ab:48 89 df mov%rbx,%rdi
810988ae:4c 89 e6 mov%r12,%rsi
810988b1:ba 04 00 00 00   mov$0x4,%edx
810988b6:e8 85 f0 2b 00   callq  81357940

810988bb:89 c3mov%eax,%ebx
810988bd:e8 fe e8 06 00   callq  811071c0

810988c2:83 fb 01 cmp$0x1,%ebx
810988c5:4c 8b 65 f8  mov-0x8(%rbp),%r12
810988c9:19 c0sbb%eax,%eax
810988cb:48 8b 5d f0  mov-0x10(%rbp),%rbx
810988cf:c9   leaveq
810988d0:f7 d0not%eax
810988d2:83 e0 f2 and$0xfff2,%eax
810988d5:c3   retq
810988d6:66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
810988dd:00 00 00

So the copy_user_generic_unrolled() function is called to read the
32bit __data.__lock field. The disassembly of the
copy_user_generic_unrolled()  function:


81357940 :
81357940:83 fa 08 cmp$0x8,%edx
81357943:0f 82 8c 00 00 00jb 813579d5

81357949:89 f9mov%edi,%ecx
8135794b:83 e1 07 and$0x7,%ecx
8135794e:74 15je 81357965

81357950:83 e9 08 sub$0x8,%ecx
81357953:f7 d9neg%ecx
81357955:29 casub%ecx,%edx
81357957:8a 06mov(%rsi),%al
81357959:88 07mov%al,(%rdi)
8135795b:48 ff c6 inc%rsi
8135795e:48 ff c7 inc%rdi
81357961:ff c9dec%ecx
81357963:75 f2jne81357957

81357965:89 d1mov%edx,%ecx
81357967:83 e2 3f and$0x3f,%edx
8135796a:c1 e9 06 shr$0x6,%ecx
8135796d:74 4aje 813579b9

8135796f:4c 8b 06 mov(%rsi),%r8
81357972:4c 8b 4e 08  mov0x8(%rsi),%r9
81357976:4c 8b 56 10  mov0x10(%rsi),%r10
8135797a:4c 8b 5e 18  mov