Re: Race condition in Kernel
On Thu, Apr 01, 2021 at 04:27:37PM +, Gulam Mohamed wrote: > Hi Ming, > > Thanks for taking a look into this. Can you please see my inline > comments in below mail? > > Regards, > Gulam Mohamed. > > -Original Message- > From: Ming Lei > Sent: Thursday, March 25, 2021 7:16 AM > To: Gulam Mohamed > Cc: h...@infradead.org; linux-kernel@vger.kernel.org; > linux-bl...@vger.kernel.org; Junxiao Bi ; Martin > Petersen ; ax...@kernel.dk > Subject: Re: Race condition in Kernel > > On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote: > > Hi All, > > > > We are facing a stale link (of the device) issue during the iscsi-logout > > process if we use parted command just before the iscsi logout. Here are the > > details: > > > > As part of iscsi logout, the partitions and the disk will be removed. The > > parted command, used to list the partitions, will open the disk in RW mode > > which results in systemd-udevd re-reading the partitions. This will trigger > > the rescan partitions which will also delete and re-add the partitions. So, > > both iscsi logout processing and the parted (through systemd-udevd) will be > > involved in add/delete of partitions. In our case, the following sequence > > of operations happened (the iscsi device is /dev/sdb with partition sdb1): > > > > 1. sdb1 was removed by PARTED > > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was > > already removed by PARTED > > 3. sdb1 was added by parted > > After kworker is started for logout, I guess all IOs are supposed to be > failed at that time, so just wondering why 'sdb1' is still added by > parted(systemd-udev)? > ioctl(BLKRRPART) needs to read partition table for adding back partitions, if > IOs are failed by iscsi logout, I guess the issue can be avoided too? > > [GULAM]: Yes, the ioctl(BLKRRPART) reads the partition table for adding back > the partitions. I kept a printk in the code just after the partition table is > read. Noticed that the partition table was read before the iscsi-logout > kworker started the logout processing. OK, I guess I understood your issue now, what you want is to not allow to add partitions since step 1, so can you remove disk just at the beginning of 2) if it is possible? then step 1) isn't needed any more For your issue, my patch of 'not drop partitions if partition table isn't changed' can't fix your issue completely since new real partition still may come from parted during the series. Thanks, Ming
RE: Race condition in Kernel
Hi Ming, Thanks for taking a look into this. Can you please see my inline comments in below mail? Regards, Gulam Mohamed. -Original Message- From: Ming Lei Sent: Thursday, March 25, 2021 7:16 AM To: Gulam Mohamed Cc: h...@infradead.org; linux-kernel@vger.kernel.org; linux-bl...@vger.kernel.org; Junxiao Bi ; Martin Petersen ; ax...@kernel.dk Subject: Re: Race condition in Kernel On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote: > Hi All, > > We are facing a stale link (of the device) issue during the iscsi-logout > process if we use parted command just before the iscsi logout. Here are the > details: > > As part of iscsi logout, the partitions and the disk will be removed. The > parted command, used to list the partitions, will open the disk in RW mode > which results in systemd-udevd re-reading the partitions. This will trigger > the rescan partitions which will also delete and re-add the partitions. So, > both iscsi logout processing and the parted (through systemd-udevd) will be > involved in add/delete of partitions. In our case, the following sequence of > operations happened (the iscsi device is /dev/sdb with partition sdb1): > > 1. sdb1 was removed by PARTED > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was > already removed by PARTED > 3. sdb1 was added by parted After kworker is started for logout, I guess all IOs are supposed to be failed at that time, so just wondering why 'sdb1' is still added by parted(systemd-udev)? ioctl(BLKRRPART) needs to read partition table for adding back partitions, if IOs are failed by iscsi logout, I guess the issue can be avoided too? [GULAM]: Yes, the ioctl(BLKRRPART) reads the partition table for adding back the partitions. I kept a printk in the code just after the partition table is read. Noticed that the partition table was read before the iscsi-logout kworker started the logout processing. Following are the logs for your reference: Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() for dev: sdb3 command: systemd-udevd <== sdb3 Removed by PARTED Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: rescan_partitions() Read Complete to the disk: sdb command: systemd-udevd <== Reading sdb completed, before iscsi-logout worker started Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() for dev: 3:0:0:0 command: kworker/u16:3 Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: sdb: sdb1 sdb2 sdb3 Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: device: 'sdb3': device_add command: systemd-udevd <== sdb3 Added by PARTED Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() for dev: 8:16 command: kworker/u16:3 Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: ORA:: Calling sysfs_delete_link() for dev: sdb command: kworker/u16:3 <== sdb Removed by iscsi Apr 1 09:23:27 gms-iscsi-initiator-2 kernel: scsi 3:0:0:0: alua: Detached -- Ming
Re: Race condition in Kernel
On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote: > Hi All, > > We are facing a stale link (of the device) issue during the iscsi-logout > process if we use parted command just before the iscsi logout. Here are the > details: > > As part of iscsi logout, the partitions and the disk will be removed. The > parted command, used to list the partitions, will open the disk in RW mode > which results in systemd-udevd re-reading the partitions. This will trigger > the rescan partitions which will also delete and re-add the partitions. So, > both iscsi logout processing and the parted (through systemd-udevd) will be > involved in add/delete of partitions. In our case, the following sequence of > operations happened (the iscsi device is /dev/sdb with partition sdb1): > > 1. sdb1 was removed by PARTED > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was > already removed by PARTED > 3. sdb1 was added by parted After kworker is started for logout, I guess all IOs are supposed to be failed at that time, so just wondering why 'sdb1' is still added by parted(systemd-udev)? ioctl(BLKRRPART) needs to read partition table for adding back partitions, if IOs are failed by iscsi logout, I guess the issue can be avoided too? -- Ming
Re: Race condition in Kernel
On 3/24/21 5:37 PM, Ming Lei wrote: On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote: Hi All, We are facing a stale link (of the device) issue during the iscsi-logout process if we use parted command just before the iscsi logout. Here are the details: As part of iscsi logout, the partitions and the disk will be removed. The parted command, used to list the partitions, will open the disk in RW mode which results in systemd-udevd re-reading the partitions. This will trigger the rescan partitions which will also delete and re-add the partitions. So, both iscsi logout processing and the parted (through systemd-udevd) will be involved in add/delete of partitions. In our case, the following sequence of operations happened (the iscsi device is /dev/sdb with partition sdb1): 1. sdb1 was removed by PARTED 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was already removed by PARTED 3. sdb1 was added by parted 4. sdb was NOW removed as part of iscsi logout (the last part of the device removal after remoing the partitions) Since the symlink /sys/class/block/sdb1 points to /sys/class/devices/platform/hostx/sessionx/targetx:x:x:x/x:x:x:x/block/sdb/sdb1 and since sdb is already removed, the symlink /sys/class/block/sdb1 will be orphan and stale. So, this stale link is a result of the race condition in kernel between the systemd-udevd and iscsi-logout processing as described above. We are able to reproduce this even with latest upstream kernel. We have come across a patch from Ming Lei which was created for "avoid to drop & re-add partitions if partitions aren't changed": https://lore.kernel.org/linux-block/20210216084430.ga23...@lst.de/T/ BTW, there is a newer version of this patchset: https://lore.kernel.org/linux-block/20210224081825.ga1...@lst.de/#r This patch could resolve our problem of stale link but it just seems to be a work-around and not the actual fix for the race. We were looking for help to fix this race in kernel. Do you have any idea how to fix this race condition? IMO, that isn't a work-around, kernel shouldn't drop partitions if partition table isn't changed. But Christoph thought the current approach is taken since beginning of kernel, and he suggested to fix systemd-udev. This is a real kernel bug. Whatever BLK_RRPART do, it should not cause this sysfs stale link issue. After this issue happen, there is no way to remove that stale link except reboot. The situation is even worse when login back a new disk, since it will reuse the disk number of the old one, it will fail when it creates the symbol link because the stale link is still there. Thanks, Junxiao. Thanks, Ming
Re: Race condition in Kernel
On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote: > Hi All, > > We are facing a stale link (of the device) issue during the iscsi-logout > process if we use parted command just before the iscsi logout. Here are the > details: > > As part of iscsi logout, the partitions and the disk will be removed. The > parted command, used to list the partitions, will open the disk in RW mode > which results in systemd-udevd re-reading the partitions. This will trigger > the rescan partitions which will also delete and re-add the partitions. So, > both iscsi logout processing and the parted (through systemd-udevd) will be > involved in add/delete of partitions. In our case, the following sequence of > operations happened (the iscsi device is /dev/sdb with partition sdb1): > > 1. sdb1 was removed by PARTED > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was > already removed by PARTED > 3. sdb1 was added by parted > 4. sdb was NOW removed as part of iscsi logout (the last part of the > device removal after remoing the partitions) > > Since the symlink /sys/class/block/sdb1 points to > /sys/class/devices/platform/hostx/sessionx/targetx:x:x:x/x:x:x:x/block/sdb/sdb1 > and since sdb is already removed, the symlink /sys/class/block/sdb1 will be > orphan and stale. So, this stale link is a result of the race condition in > kernel between the systemd-udevd and iscsi-logout processing as described > above. We are able to reproduce this even with latest upstream kernel. > > We have come across a patch from Ming Lei which was created for "avoid to > drop & re-add partitions if partitions aren't changed": > https://lore.kernel.org/linux-block/20210216084430.ga23...@lst.de/T/ BTW, there is a newer version of this patchset: https://lore.kernel.org/linux-block/20210224081825.ga1...@lst.de/#r > > This patch could resolve our problem of stale link but it just seems to be a > work-around and not the actual fix for the race. We were looking for help to > fix this race in kernel. Do you have any idea how to fix this race condition? > IMO, that isn't a work-around, kernel shouldn't drop partitions if partition table isn't changed. But Christoph thought the current approach is taken since beginning of kernel, and he suggested to fix systemd-udev. Thanks, Ming
Race condition in Kernel
Hi All, We are facing a stale link (of the device) issue during the iscsi-logout process if we use parted command just before the iscsi logout. Here are the details: As part of iscsi logout, the partitions and the disk will be removed. The parted command, used to list the partitions, will open the disk in RW mode which results in systemd-udevd re-reading the partitions. This will trigger the rescan partitions which will also delete and re-add the partitions. So, both iscsi logout processing and the parted (through systemd-udevd) will be involved in add/delete of partitions. In our case, the following sequence of operations happened (the iscsi device is /dev/sdb with partition sdb1): 1. sdb1 was removed by PARTED 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was already removed by PARTED 3. sdb1 was added by parted 4. sdb was NOW removed as part of iscsi logout (the last part of the device removal after remoing the partitions) Since the symlink /sys/class/block/sdb1 points to /sys/class/devices/platform/hostx/sessionx/targetx:x:x:x/x:x:x:x/block/sdb/sdb1 and since sdb is already removed, the symlink /sys/class/block/sdb1 will be orphan and stale. So, this stale link is a result of the race condition in kernel between the systemd-udevd and iscsi-logout processing as described above. We are able to reproduce this even with latest upstream kernel. We have come across a patch from Ming Lei which was created for "avoid to drop & re-add partitions if partitions aren't changed": https://lore.kernel.org/linux-block/20210216084430.ga23...@lst.de/T/ This patch could resolve our problem of stale link but it just seems to be a work-around and not the actual fix for the race. We were looking for help to fix this race in kernel. Do you have any idea how to fix this race condition? Following is the script we are using to reproduce the issue: #!/bin/bash dir=/sys/class/block iter_count=0 while [ $iter_count -lt 1000 ]; do iscsiadm -m node -T iqn.2016-01.com.example:target1 -p 100.100.242.162:3260 -l poll_loop=0 while [ ! -e /sys/class/block/sdb1 ]; do ls -i -l /sys/class/block/sd* > /dev/null let poll_loop+=1 if [ $poll_loop -gt 100 ]; then ls -i -l /sys/class/block/sd* --color exit 1 fi done ls -i -l /sys/class/block/sd* --color mount /dev/sdb1 /mnt dd of=/mnt/sdb1 if=/dev/sdb2 bs=1M count=100 & pid_01=$! wait $pid_01 umount -l /mnt & pid_02=$! wait $pid_02 parted /dev/sdb -s print iscsiadm -m node -T iqn.2016-01.com.example:target1 -p 100.100.242.162:3260 -u & pid_1=$! iscsiadm -m node -T iqn.2016-01.com.example:target2 -p 100.100.242.162:3260 -l & pid_2=$! sleep 1 ls -i -l /sys/class/block/sd* --color for i in `ls $dir`; do if [ ! -e $dir/$i ]; then echo "broken link: $dir/$i" exit 1 fi done parted /dev/sdb -s print iscsiadm -m node -T iqn.2016-01.com.example:target2 -p 100.100.242.162:3260 -u iter_count=`expr $iter_count + 1` done Regards, Gulam Mohamed.
[PATCH AUTOSEL 5.6 223/606] net/tls: fix race condition causing kernel panic
From: Vinay Kumar Yadav [ Upstream commit 0cada33241d9de205522e3858b18e506ca5cce2c ] tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently. // tls_sw_recvmsg() if (atomic_read(>decrypt_pending)) crypto_wait_req(-EINPROGRESS, >async_wait); else reinit_completion(>async_wait.completion); //tls_decrypt_done() pending = atomic_dec_return(>decrypt_pending); if (!pending && READ_ONCE(ctx->async_notify)) complete(>async_wait.completion); Consider the scenario tls_decrypt_done() is about to run complete() if (!pending && READ_ONCE(ctx->async_notify)) and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(), then tls_decrypt_done() runs complete(). This sequence of execution results in wrong completion. Consequently, for next decrypt request, it will not wait for completion, eventually on connection close, crypto resources freed, there is no way to handle pending decrypt response. This race condition can be avoided by having atomic_read() mutually exclusive with atomic_dec_return(),complete().Intoduced spin lock to ensure the mutual exclution. Addressed similar problem in tx direction. v1->v2: - More readable commit message. - Corrected the lock to fix new race scenario. - Removed barrier which is not needed now. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Vinay Kumar Yadav Reviewed-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- include/net/tls.h | 4 net/tls/tls_sw.c | 33 +++-- 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index bf9eb4823933..18cd4f418464 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -135,6 +135,8 @@ struct tls_sw_context_tx { struct tls_rec *open_rec; struct list_head tx_list; atomic_t encrypt_pending; + /* protect crypto_wait with encrypt_pending */ + spinlock_t encrypt_compl_lock; int async_notify; u8 async_capable:1; @@ -155,6 +157,8 @@ struct tls_sw_context_rx { u8 async_capable:1; u8 decrypted:1; atomic_t decrypt_pending; + /* protect crypto_wait with decrypt_pending*/ + spinlock_t decrypt_compl_lock; bool async_notify; }; diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index e23f94a5549b..ffa3cbc5449d 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -206,10 +206,12 @@ static void tls_decrypt_done(struct crypto_async_request *req, int err) kfree(aead_req); + spin_lock_bh(>decrypt_compl_lock); pending = atomic_dec_return(>decrypt_pending); - if (!pending && READ_ONCE(ctx->async_notify)) + if (!pending && ctx->async_notify) complete(>async_wait.completion); + spin_unlock_bh(>decrypt_compl_lock); } static int tls_do_decryption(struct sock *sk, @@ -467,10 +469,12 @@ static void tls_encrypt_done(struct crypto_async_request *req, int err) ready = true; } + spin_lock_bh(>encrypt_compl_lock); pending = atomic_dec_return(>encrypt_pending); - if (!pending && READ_ONCE(ctx->async_notify)) + if (!pending && ctx->async_notify) complete(>async_wait.completion); + spin_unlock_bh(>encrypt_compl_lock); if (!ready) return; @@ -926,6 +930,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) int num_zc = 0; int orig_size; int ret = 0; + int pending; if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) return -EOPNOTSUPP; @@ -1092,13 +1097,19 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) goto send_end; } else if (num_zc) { /* Wait for pending encryptions to get completed */ - smp_store_mb(ctx->async_notify, true); + spin_lock_bh(>encrypt_compl_lock); + ctx->async_notify = true; - if (atomic_read(>encrypt_pending)) + pending = atomic_read(>encrypt_pending); + spin_unlock_bh(>encrypt_compl_lock); + if (pending) crypto_wait_req(-EINPROGRESS, >async_wait); else reinit_completion(>async_wait.completion); + /* There can be no concurrent accesses, since we have no +* pending encrypt operations +*/ WRITE_ONCE(ctx->async_notify, false); if (ctx->async_wait.err) { @@ -1729,6 +1740,7 @@ int tls_sw_recvmsg(struct sock *sk, bool is_kvec = iov_iter_is_kvec(>msg_iter); bool is_peek = flags & MSG_PEEK; int num_async = 0; + int pending; flags |= nonblock; @@ -1891,8 +1903,11
[PATCH 5.4 015/142] net/tls: fix race condition causing kernel panic
From: Vinay Kumar Yadav [ Upstream commit 0cada33241d9de205522e3858b18e506ca5cce2c ] tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently. // tls_sw_recvmsg() if (atomic_read(>decrypt_pending)) crypto_wait_req(-EINPROGRESS, >async_wait); else reinit_completion(>async_wait.completion); //tls_decrypt_done() pending = atomic_dec_return(>decrypt_pending); if (!pending && READ_ONCE(ctx->async_notify)) complete(>async_wait.completion); Consider the scenario tls_decrypt_done() is about to run complete() if (!pending && READ_ONCE(ctx->async_notify)) and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(), then tls_decrypt_done() runs complete(). This sequence of execution results in wrong completion. Consequently, for next decrypt request, it will not wait for completion, eventually on connection close, crypto resources freed, there is no way to handle pending decrypt response. This race condition can be avoided by having atomic_read() mutually exclusive with atomic_dec_return(),complete().Intoduced spin lock to ensure the mutual exclution. Addressed similar problem in tx direction. v1->v2: - More readable commit message. - Corrected the lock to fix new race scenario. - Removed barrier which is not needed now. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Vinay Kumar Yadav Reviewed-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- include/net/tls.h |4 net/tls/tls_sw.c | 33 +++-- 2 files changed, 31 insertions(+), 6 deletions(-) --- a/include/net/tls.h +++ b/include/net/tls.h @@ -157,6 +157,8 @@ struct tls_sw_context_tx { struct tls_rec *open_rec; struct list_head tx_list; atomic_t encrypt_pending; + /* protect crypto_wait with encrypt_pending */ + spinlock_t encrypt_compl_lock; int async_notify; int async_capable; @@ -177,6 +179,8 @@ struct tls_sw_context_rx { int async_capable; bool decrypted; atomic_t decrypt_pending; + /* protect crypto_wait with decrypt_pending*/ + spinlock_t decrypt_compl_lock; bool async_notify; }; --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -203,10 +203,12 @@ static void tls_decrypt_done(struct cryp kfree(aead_req); + spin_lock_bh(>decrypt_compl_lock); pending = atomic_dec_return(>decrypt_pending); - if (!pending && READ_ONCE(ctx->async_notify)) + if (!pending && ctx->async_notify) complete(>async_wait.completion); + spin_unlock_bh(>decrypt_compl_lock); } static int tls_do_decryption(struct sock *sk, @@ -464,10 +466,12 @@ static void tls_encrypt_done(struct cryp ready = true; } + spin_lock_bh(>encrypt_compl_lock); pending = atomic_dec_return(>encrypt_pending); - if (!pending && READ_ONCE(ctx->async_notify)) + if (!pending && ctx->async_notify) complete(>async_wait.completion); + spin_unlock_bh(>encrypt_compl_lock); if (!ready) return; @@ -923,6 +927,7 @@ int tls_sw_sendmsg(struct sock *sk, stru int num_zc = 0; int orig_size; int ret = 0; + int pending; if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) return -EOPNOTSUPP; @@ -1089,13 +1094,19 @@ trim_sgl: goto send_end; } else if (num_zc) { /* Wait for pending encryptions to get completed */ - smp_store_mb(ctx->async_notify, true); + spin_lock_bh(>encrypt_compl_lock); + ctx->async_notify = true; - if (atomic_read(>encrypt_pending)) + pending = atomic_read(>encrypt_pending); + spin_unlock_bh(>encrypt_compl_lock); + if (pending) crypto_wait_req(-EINPROGRESS, >async_wait); else reinit_completion(>async_wait.completion); + /* There can be no concurrent accesses, since we have no +* pending encrypt operations +*/ WRITE_ONCE(ctx->async_notify, false); if (ctx->async_wait.err) { @@ -1724,6 +1735,7 @@ int tls_sw_recvmsg(struct sock *sk, bool is_kvec = iov_iter_is_kvec(>msg_iter); bool is_peek = flags & MSG_PEEK; int num_async = 0; + int pending; flags |= nonblock; @@ -1886,8 +1898,11 @@ pick_next_record: recv_end: if (num_async) { /* Wait for all previously submitted records to be decrypted */ - smp_store_mb(ctx->async_notify, true); - if (atomic_read(>decrypt_pending)) { + spin_lock_bh(>decrypt_compl_lock); + ctx->async_notify =
[PATCH 5.6 017/177] net/tls: fix race condition causing kernel panic
From: Vinay Kumar Yadav [ Upstream commit 0cada33241d9de205522e3858b18e506ca5cce2c ] tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently. // tls_sw_recvmsg() if (atomic_read(>decrypt_pending)) crypto_wait_req(-EINPROGRESS, >async_wait); else reinit_completion(>async_wait.completion); //tls_decrypt_done() pending = atomic_dec_return(>decrypt_pending); if (!pending && READ_ONCE(ctx->async_notify)) complete(>async_wait.completion); Consider the scenario tls_decrypt_done() is about to run complete() if (!pending && READ_ONCE(ctx->async_notify)) and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(), then tls_decrypt_done() runs complete(). This sequence of execution results in wrong completion. Consequently, for next decrypt request, it will not wait for completion, eventually on connection close, crypto resources freed, there is no way to handle pending decrypt response. This race condition can be avoided by having atomic_read() mutually exclusive with atomic_dec_return(),complete().Intoduced spin lock to ensure the mutual exclution. Addressed similar problem in tx direction. v1->v2: - More readable commit message. - Corrected the lock to fix new race scenario. - Removed barrier which is not needed now. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Vinay Kumar Yadav Reviewed-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- include/net/tls.h |4 net/tls/tls_sw.c | 33 +++-- 2 files changed, 31 insertions(+), 6 deletions(-) --- a/include/net/tls.h +++ b/include/net/tls.h @@ -135,6 +135,8 @@ struct tls_sw_context_tx { struct tls_rec *open_rec; struct list_head tx_list; atomic_t encrypt_pending; + /* protect crypto_wait with encrypt_pending */ + spinlock_t encrypt_compl_lock; int async_notify; u8 async_capable:1; @@ -155,6 +157,8 @@ struct tls_sw_context_rx { u8 async_capable:1; u8 decrypted:1; atomic_t decrypt_pending; + /* protect crypto_wait with decrypt_pending*/ + spinlock_t decrypt_compl_lock; bool async_notify; }; --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -206,10 +206,12 @@ static void tls_decrypt_done(struct cryp kfree(aead_req); + spin_lock_bh(>decrypt_compl_lock); pending = atomic_dec_return(>decrypt_pending); - if (!pending && READ_ONCE(ctx->async_notify)) + if (!pending && ctx->async_notify) complete(>async_wait.completion); + spin_unlock_bh(>decrypt_compl_lock); } static int tls_do_decryption(struct sock *sk, @@ -467,10 +469,12 @@ static void tls_encrypt_done(struct cryp ready = true; } + spin_lock_bh(>encrypt_compl_lock); pending = atomic_dec_return(>encrypt_pending); - if (!pending && READ_ONCE(ctx->async_notify)) + if (!pending && ctx->async_notify) complete(>async_wait.completion); + spin_unlock_bh(>encrypt_compl_lock); if (!ready) return; @@ -926,6 +930,7 @@ int tls_sw_sendmsg(struct sock *sk, stru int num_zc = 0; int orig_size; int ret = 0; + int pending; if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) return -EOPNOTSUPP; @@ -1092,13 +1097,19 @@ trim_sgl: goto send_end; } else if (num_zc) { /* Wait for pending encryptions to get completed */ - smp_store_mb(ctx->async_notify, true); + spin_lock_bh(>encrypt_compl_lock); + ctx->async_notify = true; - if (atomic_read(>encrypt_pending)) + pending = atomic_read(>encrypt_pending); + spin_unlock_bh(>encrypt_compl_lock); + if (pending) crypto_wait_req(-EINPROGRESS, >async_wait); else reinit_completion(>async_wait.completion); + /* There can be no concurrent accesses, since we have no +* pending encrypt operations +*/ WRITE_ONCE(ctx->async_notify, false); if (ctx->async_wait.err) { @@ -1729,6 +1740,7 @@ int tls_sw_recvmsg(struct sock *sk, bool is_kvec = iov_iter_is_kvec(>msg_iter); bool is_peek = flags & MSG_PEEK; int num_async = 0; + int pending; flags |= nonblock; @@ -1891,8 +1903,11 @@ pick_next_record: recv_end: if (num_async) { /* Wait for all previously submitted records to be decrypted */ - smp_store_mb(ctx->async_notify, true); - if (atomic_read(>decrypt_pending)) { + spin_lock_bh(>decrypt_compl_lock); + ctx->async_notify
Re: race condition in kernel/padata.c
On Thu, Mar 23, 2017 at 12:03:43AM +0100, Jason A. Donenfeld wrote: > Hey Steffen, > > WireGuard makes really heavy use of padata, feeding it units of work > from different cores in different contexts all at the same time. For > the most part, everything has been fine, but one particular user has > consistently run into mysterious bugs. He's using a rather old dual > core CPU, which have a tendency to bring out race conditions > sometimes. After struggling with getting a good backtrace, we finally > managed to extract this from list debugging: > > [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 > __list_add+0xae/0x130 > [87487.301868] list_add corruption. prev->next should be next > (b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00). > [87487.339011] [] dump_stack+0x68/0xa3 > [87487.342198] [] ? console_unlock+0x281/0x6d0 > [87487.345364] [] __warn+0xff/0x140 > [87487.348513] [] warn_slowpath_fmt+0x4a/0x50 > [87487.351659] [] __list_add+0xae/0x130 > [87487.354772] [] ? _raw_spin_lock+0x64/0x70 > [87487.357915] [] padata_reorder+0x1e6/0x420 > [87487.361084] [] padata_do_serial+0xa5/0x120 > > padata_reorder calls list_add_tail with the list to which its adding > locked, which seems correct: > > spin_lock(>serial.lock); > list_add_tail(>list, >serial.list); > spin_unlock(>serial.lock); > > This therefore leaves only place where such inconsistency could occur: > if padata->list is added at the same time on two different threads. > This pdata pointer comes from the function call to > padata_get_next(pd), which has in it the following block: > > next_queue = per_cpu_ptr(pd->pqueue, cpu); > padata = NULL; > reorder = _queue->reorder; > if (!list_empty(>list)) { >padata = list_entry(reorder->list.next, >struct padata_priv, list); >spin_lock(>lock); >list_del_init(>list); >atomic_dec(>reorder_objects); >spin_unlock(>lock); > >pd->processed++; > >goto out; > } > out: > return padata; > > I strongly suspect that the problem here is that two threads can race > on reorder list. Even though the deletion is locked, call to > list_entry is not locked, which means it's feasible that two threads > pick up the same padata object and subsequently call list_add_tail on > them at the same time. The fix would thus be to hoist that lock > outside of that block. Yes, looks like we should lock the whole list handling block here. Thanks!
Re: race condition in kernel/padata.c
On Thu, Mar 23, 2017 at 12:03:43AM +0100, Jason A. Donenfeld wrote: > Hey Steffen, > > WireGuard makes really heavy use of padata, feeding it units of work > from different cores in different contexts all at the same time. For > the most part, everything has been fine, but one particular user has > consistently run into mysterious bugs. He's using a rather old dual > core CPU, which have a tendency to bring out race conditions > sometimes. After struggling with getting a good backtrace, we finally > managed to extract this from list debugging: > > [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 > __list_add+0xae/0x130 > [87487.301868] list_add corruption. prev->next should be next > (b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00). > [87487.339011] [] dump_stack+0x68/0xa3 > [87487.342198] [] ? console_unlock+0x281/0x6d0 > [87487.345364] [] __warn+0xff/0x140 > [87487.348513] [] warn_slowpath_fmt+0x4a/0x50 > [87487.351659] [] __list_add+0xae/0x130 > [87487.354772] [] ? _raw_spin_lock+0x64/0x70 > [87487.357915] [] padata_reorder+0x1e6/0x420 > [87487.361084] [] padata_do_serial+0xa5/0x120 > > padata_reorder calls list_add_tail with the list to which its adding > locked, which seems correct: > > spin_lock(>serial.lock); > list_add_tail(>list, >serial.list); > spin_unlock(>serial.lock); > > This therefore leaves only place where such inconsistency could occur: > if padata->list is added at the same time on two different threads. > This pdata pointer comes from the function call to > padata_get_next(pd), which has in it the following block: > > next_queue = per_cpu_ptr(pd->pqueue, cpu); > padata = NULL; > reorder = _queue->reorder; > if (!list_empty(>list)) { >padata = list_entry(reorder->list.next, >struct padata_priv, list); >spin_lock(>lock); >list_del_init(>list); >atomic_dec(>reorder_objects); >spin_unlock(>lock); > >pd->processed++; > >goto out; > } > out: > return padata; > > I strongly suspect that the problem here is that two threads can race > on reorder list. Even though the deletion is locked, call to > list_entry is not locked, which means it's feasible that two threads > pick up the same padata object and subsequently call list_add_tail on > them at the same time. The fix would thus be to hoist that lock > outside of that block. Yes, looks like we should lock the whole list handling block here. Thanks!
race condition in kernel/padata.c
Hey Steffen, WireGuard makes really heavy use of padata, feeding it units of work from different cores in different contexts all at the same time. For the most part, everything has been fine, but one particular user has consistently run into mysterious bugs. He's using a rather old dual core CPU, which have a tendency to bring out race conditions sometimes. After struggling with getting a good backtrace, we finally managed to extract this from list debugging: [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 __list_add+0xae/0x130 [87487.301868] list_add corruption. prev->next should be next (b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00). [87487.339011] [] dump_stack+0x68/0xa3 [87487.342198] [] ? console_unlock+0x281/0x6d0 [87487.345364] [] __warn+0xff/0x140 [87487.348513] [] warn_slowpath_fmt+0x4a/0x50 [87487.351659] [] __list_add+0xae/0x130 [87487.354772] [] ? _raw_spin_lock+0x64/0x70 [87487.357915] [] padata_reorder+0x1e6/0x420 [87487.361084] [] padata_do_serial+0xa5/0x120 padata_reorder calls list_add_tail with the list to which its adding locked, which seems correct: spin_lock(>serial.lock); list_add_tail(>list, >serial.list); spin_unlock(>serial.lock); This therefore leaves only place where such inconsistency could occur: if padata->list is added at the same time on two different threads. This pdata pointer comes from the function call to padata_get_next(pd), which has in it the following block: next_queue = per_cpu_ptr(pd->pqueue, cpu); padata = NULL; reorder = _queue->reorder; if (!list_empty(>list)) { padata = list_entry(reorder->list.next, struct padata_priv, list); spin_lock(>lock); list_del_init(>list); atomic_dec(>reorder_objects); spin_unlock(>lock); pd->processed++; goto out; } out: return padata; I strongly suspect that the problem here is that two threads can race on reorder list. Even though the deletion is locked, call to list_entry is not locked, which means it's feasible that two threads pick up the same padata object and subsequently call list_add_tail on them at the same time. The fix would thus be to hoist that lock outside of that block. This theory is unconfirmed at the moment, but I'll be playing with some patches to see if this fixes the issue and then I'll get back to you. In the meantime, if you have any insight before I potentially waste some time, I'm all ears. Thanks, Jason
race condition in kernel/padata.c
Hey Steffen, WireGuard makes really heavy use of padata, feeding it units of work from different cores in different contexts all at the same time. For the most part, everything has been fine, but one particular user has consistently run into mysterious bugs. He's using a rather old dual core CPU, which have a tendency to bring out race conditions sometimes. After struggling with getting a good backtrace, we finally managed to extract this from list debugging: [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 __list_add+0xae/0x130 [87487.301868] list_add corruption. prev->next should be next (b17abfc043d0), but was 8dba70872c80. (prev=8dba70872b00). [87487.339011] [] dump_stack+0x68/0xa3 [87487.342198] [] ? console_unlock+0x281/0x6d0 [87487.345364] [] __warn+0xff/0x140 [87487.348513] [] warn_slowpath_fmt+0x4a/0x50 [87487.351659] [] __list_add+0xae/0x130 [87487.354772] [] ? _raw_spin_lock+0x64/0x70 [87487.357915] [] padata_reorder+0x1e6/0x420 [87487.361084] [] padata_do_serial+0xa5/0x120 padata_reorder calls list_add_tail with the list to which its adding locked, which seems correct: spin_lock(>serial.lock); list_add_tail(>list, >serial.list); spin_unlock(>serial.lock); This therefore leaves only place where such inconsistency could occur: if padata->list is added at the same time on two different threads. This pdata pointer comes from the function call to padata_get_next(pd), which has in it the following block: next_queue = per_cpu_ptr(pd->pqueue, cpu); padata = NULL; reorder = _queue->reorder; if (!list_empty(>list)) { padata = list_entry(reorder->list.next, struct padata_priv, list); spin_lock(>lock); list_del_init(>list); atomic_dec(>reorder_objects); spin_unlock(>lock); pd->processed++; goto out; } out: return padata; I strongly suspect that the problem here is that two threads can race on reorder list. Even though the deletion is locked, call to list_entry is not locked, which means it's feasible that two threads pick up the same padata object and subsequently call list_add_tail on them at the same time. The fix would thus be to hoist that lock outside of that block. This theory is unconfirmed at the moment, but I'll be playing with some patches to see if this fixes the issue and then I'll get back to you. In the meantime, if you have any insight before I potentially waste some time, I'm all ears. Thanks, Jason
Re: [RFC]: Possible race condition in kernel futex code
On Fri, 2015-10-09 at 10:06 +0100, Thomas Gleixner wrote: [...] > @stable: Can you please pick up ff47ab4ff3cd plus > > df90ca969035d x86, sparse: Do not force removal of __user when calling > copy_to/from_user_nocheck() > > for stable kernels <= 3.12? [...] I've finally queued these up for 3.2, thanks. Ben. -- Ben Hutchings For every action, there is an equal and opposite criticism. - Harrison signature.asc Description: This is a digitally signed message part
Re: [RFC]: Possible race condition in kernel futex code
On Fri, 2015-10-09 at 10:06 +0100, Thomas Gleixner wrote: [...] > @stable: Can you please pick up ff47ab4ff3cd plus > > df90ca969035d x86, sparse: Do not force removal of __user when calling > copy_to/from_user_nocheck() > > for stable kernels <= 3.12? [...] I've finally queued these up for 3.2, thanks. Ben. -- Ben Hutchings For every action, there is an equal and opposite criticism. - Harrison signature.asc Description: This is a digitally signed message part
Re: [RFC]: Possible race condition in kernel futex code
On Fri, Oct 09, 2015 at 10:06:41AM +0100, Thomas Gleixner wrote: > On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote: > > We did some tests with different compilers, kernel versions and kernel > > configs, with the following results: > > > > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 : > > copy_user_generic_unrolled being used, so race condition possible > > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 : > > copy_user_generic_unrolled being used, so race condition possible > > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no > > race condition > > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no > > race condition > > > > > > Our idea to fix this problem is use an explicit 32 bit read in > > get_futex_value_locked() instead of using the generic function > > copy_from_user_inatomic() and hoping the compiler uses an atomic > > access and the right access size. > > You cannot use an explicit 32bit read. We need an access which handles > the fault gracefully. > > In current mainline this is done proper: > > ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32)) > > __copy_from_user_nocheck(dst, src, size) > > if (!__builtin_constant_p(size)) > return copy_user_generic(dst, (__force void *)src, size); > > size is constant so we end up in the switch case > > switch(size) { > > case 4: > __get_user_asm(*(u32 *)dst, (u32 __user *)src, > ret, "l", "k", "=r", 4); > return ret; > > > In 3.12 this is different: > >__copy_from_user_inatomic() > copy_user_generic() > copy_user_generic_unrolled() > > So this is only an issue for kernel versions < 3.13. It was fixed with > > ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit > __copy_{from,to}_user_inatomic > > but nobody noticed that the race you described can happen, so it was > never backported to the stable kernels. > > @stable: Can you please pick up ff47ab4ff3cd plus > > df90ca969035d x86, sparse: Do not force removal of __user when calling > copy_to/from_user_nocheck() > > for stable kernels <= 3.12? > > If that's too much of churn, then I can come up with an explicit fix > for this. Let me know. Now applied to 3.10-stable, thanks. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
On Fri, Oct 09, 2015 at 10:06:41AM +0100, Thomas Gleixner wrote: > On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote: > > We did some tests with different compilers, kernel versions and kernel > > configs, with the following results: > > > > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 : > > copy_user_generic_unrolled being used, so race condition possible > > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 : > > copy_user_generic_unrolled being used, so race condition possible > > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no > > race condition > > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no > > race condition > > > > > > Our idea to fix this problem is use an explicit 32 bit read in > > get_futex_value_locked() instead of using the generic function > > copy_from_user_inatomic() and hoping the compiler uses an atomic > > access and the right access size. > > You cannot use an explicit 32bit read. We need an access which handles > the fault gracefully. > > In current mainline this is done proper: > > ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32)) > > __copy_from_user_nocheck(dst, src, size) > > if (!__builtin_constant_p(size)) > return copy_user_generic(dst, (__force void *)src, size); > > size is constant so we end up in the switch case > > switch(size) { > > case 4: > __get_user_asm(*(u32 *)dst, (u32 __user *)src, > ret, "l", "k", "=r", 4); > return ret; > > > In 3.12 this is different: > >__copy_from_user_inatomic() > copy_user_generic() > copy_user_generic_unrolled() > > So this is only an issue for kernel versions < 3.13. It was fixed with > > ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit > __copy_{from,to}_user_inatomic > > but nobody noticed that the race you described can happen, so it was > never backported to the stable kernels. > > @stable: Can you please pick up ff47ab4ff3cd plus > > df90ca969035d x86, sparse: Do not force removal of __user when calling > copy_to/from_user_nocheck() > > for stable kernels <= 3.12? > > If that's too much of churn, then I can come up with an explicit fix > for this. Let me know. Now applied to 3.10-stable, thanks. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
On Fri, Oct 09, 2015 at 11:25:09AM +0100, Thomas Gleixner wrote: > Hans, > > On Fri, 9 Oct 2015, Hans Zuidam wrote: > > On 9 okt. 2015, at 11:06, Thomas Gleixner wrote: > > > You cannot use an explicit 32bit read. We need an access which > > > handles the fault gracefully. > > > > The reason for the explicit read suggestion is to avoid the > > _builtin_constant_p() in __copy_from_user_nocheck(). The GCC manual > > says that there may be situations where it returns 0 even though the > > argument is a constant. > > That's insane at best. Right, but I bet that is for cases where constant propagation completely fails, and this is a trivial case, I have no problem relying on it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
Hans, On Fri, 9 Oct 2015, Hans Zuidam wrote: > On 9 okt. 2015, at 11:06, Thomas Gleixner wrote: > > You cannot use an explicit 32bit read. We need an access which > > handles the fault gracefully. > > The reason for the explicit read suggestion is to avoid the > _builtin_constant_p() in __copy_from_user_nocheck(). The GCC manual > says that there may be situations where it returns 0 even though the > argument is a constant. That's insane at best. > Although none of the compiler/kernel combinations we have tried > showed this happening, we think it is probably better to be safe > than sorry. So we would need something like: futex_copy_from_user() which can be mapped to __copy_from_user_inatomic() first. Then go through all architectures and the asm-generic stuff and provide the specific variants which are guaranteed to use a 32bit access. I really prefer that we don't have to do that and the compiler people get their act together and fix that _builtin_constant_p() thingy. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
Hi Thomas, On 9 okt. 2015, at 11:06, Thomas Gleixner wrote: On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote: >> We did some tests with different compilers, kernel versions and kernel >> configs, with the following results: > You cannot use an explicit 32bit read. We need an access which handles the > fault gracefully. The reason for the explicit read suggestion is to avoid the _builtin_constant_p() in __copy_from_user_nocheck(). The GCC manual says that there may be situations where it returns 0 even though the argument is a constant. Although none of the compiler/kernel combinations we have tried showed this happening, we think it is probably better to be safe than sorry. With kind regards, Hans Zuidam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote: > We did some tests with different compilers, kernel versions and kernel > configs, with the following results: > > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 : > copy_user_generic_unrolled being used, so race condition possible > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 : > copy_user_generic_unrolled being used, so race condition possible > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no > race condition > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no > race condition > > > Our idea to fix this problem is use an explicit 32 bit read in > get_futex_value_locked() instead of using the generic function > copy_from_user_inatomic() and hoping the compiler uses an atomic > access and the right access size. You cannot use an explicit 32bit read. We need an access which handles the fault gracefully. In current mainline this is done proper: ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32)) __copy_from_user_nocheck(dst, src, size) if (!__builtin_constant_p(size)) return copy_user_generic(dst, (__force void *)src, size); size is constant so we end up in the switch case switch(size) { case 4: __get_user_asm(*(u32 *)dst, (u32 __user *)src, ret, "l", "k", "=r", 4); return ret; In 3.12 this is different: __copy_from_user_inatomic() copy_user_generic() copy_user_generic_unrolled() So this is only an issue for kernel versions < 3.13. It was fixed with ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit __copy_{from,to}_user_inatomic but nobody noticed that the race you described can happen, so it was never backported to the stable kernels. @stable: Can you please pick up ff47ab4ff3cd plus df90ca969035d x86, sparse: Do not force removal of __user when calling copy_to/from_user_nocheck() for stable kernels <= 3.12? If that's too much of churn, then I can come up with an explicit fix for this. Let me know. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote: > We did some tests with different compilers, kernel versions and kernel > configs, with the following results: > > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 : > copy_user_generic_unrolled being used, so race condition possible > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 : > copy_user_generic_unrolled being used, so race condition possible > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no > race condition > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no > race condition > > > Our idea to fix this problem is use an explicit 32 bit read in > get_futex_value_locked() instead of using the generic function > copy_from_user_inatomic() and hoping the compiler uses an atomic > access and the right access size. You cannot use an explicit 32bit read. We need an access which handles the fault gracefully. In current mainline this is done proper: ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32)) __copy_from_user_nocheck(dst, src, size) if (!__builtin_constant_p(size)) return copy_user_generic(dst, (__force void *)src, size); size is constant so we end up in the switch case switch(size) { case 4: __get_user_asm(*(u32 *)dst, (u32 __user *)src, ret, "l", "k", "=r", 4); return ret; In 3.12 this is different: __copy_from_user_inatomic() copy_user_generic() copy_user_generic_unrolled() So this is only an issue for kernel versions < 3.13. It was fixed with ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit __copy_{from,to}_user_inatomic but nobody noticed that the race you described can happen, so it was never backported to the stable kernels. @stable: Can you please pick up ff47ab4ff3cd plus df90ca969035d x86, sparse: Do not force removal of __user when calling copy_to/from_user_nocheck() for stable kernels <= 3.12? If that's too much of churn, then I can come up with an explicit fix for this. Let me know. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
Hi Thomas, On 9 okt. 2015, at 11:06, Thomas Gleixnerwrote: On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote: >> We did some tests with different compilers, kernel versions and kernel >> configs, with the following results: > You cannot use an explicit 32bit read. We need an access which handles the > fault gracefully. The reason for the explicit read suggestion is to avoid the _builtin_constant_p() in __copy_from_user_nocheck(). The GCC manual says that there may be situations where it returns 0 even though the argument is a constant. Although none of the compiler/kernel combinations we have tried showed this happening, we think it is probably better to be safe than sorry. With kind regards, Hans Zuidam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
Hans, On Fri, 9 Oct 2015, Hans Zuidam wrote: > On 9 okt. 2015, at 11:06, Thomas Gleixnerwrote: > > You cannot use an explicit 32bit read. We need an access which > > handles the fault gracefully. > > The reason for the explicit read suggestion is to avoid the > _builtin_constant_p() in __copy_from_user_nocheck(). The GCC manual > says that there may be situations where it returns 0 even though the > argument is a constant. That's insane at best. > Although none of the compiler/kernel combinations we have tried > showed this happening, we think it is probably better to be safe > than sorry. So we would need something like: futex_copy_from_user() which can be mapped to __copy_from_user_inatomic() first. Then go through all architectures and the asm-generic stuff and provide the specific variants which are guaranteed to use a 32bit access. I really prefer that we don't have to do that and the compiler people get their act together and fix that _builtin_constant_p() thingy. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC]: Possible race condition in kernel futex code
On Fri, Oct 09, 2015 at 11:25:09AM +0100, Thomas Gleixner wrote: > Hans, > > On Fri, 9 Oct 2015, Hans Zuidam wrote: > > On 9 okt. 2015, at 11:06, Thomas Gleixnerwrote: > > > You cannot use an explicit 32bit read. We need an access which > > > handles the fault gracefully. > > > > The reason for the explicit read suggestion is to avoid the > > _builtin_constant_p() in __copy_from_user_nocheck(). The GCC manual > > says that there may be situations where it returns 0 even though the > > argument is a constant. > > That's insane at best. Right, but I bet that is for cases where constant propagation completely fails, and this is a trivial case, I have no problem relying on it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC]: Possible race condition in kernel futex code
Hello all, For a while now, we see our application crashing occasionally (due to an assert) in the middle of a pthread_mutex_lock() function call. To be more precise, our application is threaded (50 threads) where all threads share a custom memory allocator. This memory allocator is guarded with a recursive, priority inheritance pthread mutex. The application is running on a 8 core, 64 bit x86 system using the 3.12.42-rt58 kernel and glibc 2.13. The assert occurs when the pthread_mutex_lock() function checks the return value of the futex system call (see line 307 of the pthread_mutex_lock() source code on http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). In case of the assert, the kernel has returned -EDEADLK which results in a SIGABRT. The strange thing is that the kernel returns -EDEADLK, because it finds out that the calling thread is the owner of the mutex. But glibc has already checked this before executing the futex system call (see line 194 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). The case where a recursive mutex is locked again is fully handled in glibc. No futex system call is required. The kernel futex code finds out that the calling thread is the owner of the mutex by reading the value of the __data.__lock field of the pthread_mutex_t. This field is read by the get_futex_value_locked() function (see kernel/futex.c). The __data.__lock field in a pthread_mutex_t is written by an atomic 32bit compare exchange when locking the mutex (see line 288 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). Reading the value by the kernel should be an atomic action. Much to our surprise, the disassembly of the get_futex_value_locked() function is: 81098890 : 81098890:55 push %rbp 81098891:48 89 e5 mov%rsp,%rbp 81098894:48 83 ec 10 sub$0x10,%rsp 81098898:48 89 5d f0 mov%rbx,-0x10(%rbp) 8109889c:48 89 fb mov%rdi,%rbx 8109889f:4c 89 65 f8 mov%r12,-0x8(%rbp) 810988a3:49 89 f4 mov%rsi,%r12 810988a6:e8 35 e9 06 00 callq 811071e0 810988ab:48 89 df mov%rbx,%rdi 810988ae:4c 89 e6 mov%r12,%rsi 810988b1:ba 04 00 00 00 mov$0x4,%edx 810988b6:e8 85 f0 2b 00 callq 81357940 810988bb:89 c3mov%eax,%ebx 810988bd:e8 fe e8 06 00 callq 811071c0 810988c2:83 fb 01 cmp$0x1,%ebx 810988c5:4c 8b 65 f8 mov-0x8(%rbp),%r12 810988c9:19 c0sbb%eax,%eax 810988cb:48 8b 5d f0 mov-0x10(%rbp),%rbx 810988cf:c9 leaveq 810988d0:f7 d0not%eax 810988d2:83 e0 f2 and$0xfff2,%eax 810988d5:c3 retq 810988d6:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 810988dd:00 00 00 So the copy_user_generic_unrolled() function is called to read the 32bit __data.__lock field. The disassembly of the copy_user_generic_unrolled() function: 81357940 : 81357940:83 fa 08 cmp$0x8,%edx 81357943:0f 82 8c 00 00 00jb 813579d5 81357949:89 f9mov%edi,%ecx 8135794b:83 e1 07 and$0x7,%ecx 8135794e:74 15je 81357965 81357950:83 e9 08 sub$0x8,%ecx 81357953:f7 d9neg%ecx 81357955:29 casub%ecx,%edx 81357957:8a 06mov(%rsi),%al 81357959:88 07mov%al,(%rdi) 8135795b:48 ff c6 inc%rsi 8135795e:48 ff c7 inc%rdi 81357961:ff c9dec%ecx 81357963:75 f2jne81357957 81357965:89 d1mov%edx,%ecx 81357967:83 e2 3f and$0x3f,%edx 8135796a:c1 e9 06 shr$0x6,%ecx 8135796d:74 4aje 813579b9 8135796f:4c 8b 06 mov(%rsi),%r8 81357972:4c 8b 4e 08 mov0x8(%rsi),%r9 81357976:4c 8b 56 10 mov0x10(%rsi),%r10 8135797a:4c 8b 5e 18 mov0x18(%rsi),%r11 8135797e:4c 89 07 mov%r8,(%rdi) 81357981:4c 89 4f 08
[RFC]: Possible race condition in kernel futex code
Hello all, For a while now, we see our application crashing occasionally (due to an assert) in the middle of a pthread_mutex_lock() function call. To be more precise, our application is threaded (50 threads) where all threads share a custom memory allocator. This memory allocator is guarded with a recursive, priority inheritance pthread mutex. The application is running on a 8 core, 64 bit x86 system using the 3.12.42-rt58 kernel and glibc 2.13. The assert occurs when the pthread_mutex_lock() function checks the return value of the futex system call (see line 307 of the pthread_mutex_lock() source code on http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). In case of the assert, the kernel has returned -EDEADLK which results in a SIGABRT. The strange thing is that the kernel returns -EDEADLK, because it finds out that the calling thread is the owner of the mutex. But glibc has already checked this before executing the futex system call (see line 194 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). The case where a recursive mutex is locked again is fully handled in glibc. No futex system call is required. The kernel futex code finds out that the calling thread is the owner of the mutex by reading the value of the __data.__lock field of the pthread_mutex_t. This field is read by the get_futex_value_locked() function (see kernel/futex.c). The __data.__lock field in a pthread_mutex_t is written by an atomic 32bit compare exchange when locking the mutex (see line 288 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). Reading the value by the kernel should be an atomic action. Much to our surprise, the disassembly of the get_futex_value_locked() function is: 81098890 : 81098890:55 push %rbp 81098891:48 89 e5 mov%rsp,%rbp 81098894:48 83 ec 10 sub$0x10,%rsp 81098898:48 89 5d f0 mov%rbx,-0x10(%rbp) 8109889c:48 89 fb mov%rdi,%rbx 8109889f:4c 89 65 f8 mov%r12,-0x8(%rbp) 810988a3:49 89 f4 mov%rsi,%r12 810988a6:e8 35 e9 06 00 callq 811071e0 810988ab:48 89 df mov%rbx,%rdi 810988ae:4c 89 e6 mov%r12,%rsi 810988b1:ba 04 00 00 00 mov$0x4,%edx 810988b6:e8 85 f0 2b 00 callq 81357940 810988bb:89 c3mov%eax,%ebx 810988bd:e8 fe e8 06 00 callq 811071c0 810988c2:83 fb 01 cmp$0x1,%ebx 810988c5:4c 8b 65 f8 mov-0x8(%rbp),%r12 810988c9:19 c0sbb%eax,%eax 810988cb:48 8b 5d f0 mov-0x10(%rbp),%rbx 810988cf:c9 leaveq 810988d0:f7 d0not%eax 810988d2:83 e0 f2 and$0xfff2,%eax 810988d5:c3 retq 810988d6:66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 810988dd:00 00 00 So the copy_user_generic_unrolled() function is called to read the 32bit __data.__lock field. The disassembly of the copy_user_generic_unrolled() function: 81357940 : 81357940:83 fa 08 cmp$0x8,%edx 81357943:0f 82 8c 00 00 00jb 813579d581357949:89 f9mov%edi,%ecx 8135794b:83 e1 07 and$0x7,%ecx 8135794e:74 15je 81357965 81357950:83 e9 08 sub$0x8,%ecx 81357953:f7 d9neg%ecx 81357955:29 casub%ecx,%edx 81357957:8a 06mov(%rsi),%al 81357959:88 07mov%al,(%rdi) 8135795b:48 ff c6 inc%rsi 8135795e:48 ff c7 inc%rdi 81357961:ff c9dec%ecx 81357963:75 f2jne81357957 81357965:89 d1mov%edx,%ecx 81357967:83 e2 3f and$0x3f,%edx 8135796a:c1 e9 06 shr$0x6,%ecx 8135796d:74 4aje 813579b9 8135796f:4c 8b 06 mov(%rsi),%r8 81357972:4c 8b 4e 08 mov0x8(%rsi),%r9 81357976:4c 8b 56 10 mov0x10(%rsi),%r10 8135797a:4c 8b 5e 18 mov