Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Mike Christie
On 5/1/24 2:50 AM, Hillf Danton wrote:
> On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
>>
>> and then it failed testing.
>>
> So did my patch [1] but then the reason was spotted [2,3]
> 
> [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/

Just to make sure I understand the conclusion.

Edward's patch that just swaps the order of the calls:

https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

fixes the UAF. I tested the same in my setup. However, when you guys tested it
with sysbot, it also triggered a softirq/RCU warning.

The softirq/RCU part of the issue is fixed with this commit:

https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/

commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
Author: Zqiang 
Date:   Sat Apr 27 18:28:08 2024 +0800

softirq: Fix suspicious RCU usage in __do_softirq()

The problem was that I was testing with -next master which has that patch.
It looks like you guys were testing against bb7a2467e6be which didn't have
the patch, and so that's why you guys still hit the softirq/RCU issue. Later
when you added that patch to your patch, it worked with syzbot.

So is it safe to assume that the softirq/RCU patch above will be upstream
when the vhost changes go in or is there a tag I need to add to my patches?



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Mike Christie
On 4/30/24 7:15 PM, Hillf Danton wrote:
> On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
>> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
>>>  static int vhost_task_fn(void *data)
>>>  {
>>> struct vhost_task *vtsk = data;
>>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
>>> schedule();
>>> }
>>>  
>>> -   mutex_lock(>exit_mutex);
>>> +   mutex_lock(_mutex);
>>> /*
>>>  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
>>>  * When the vhost layer has called vhost_task_stop it's already stopped
>>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
>>> vtsk->handle_sigkill(vtsk->data);
>>> }
>>> complete(>exited);
>>> -   mutex_unlock(>exit_mutex);
>>> +   mutex_unlock(_mutex);
>>>  
>>
>> Edward, thanks for the patch. I think though I just needed to swap the
>> order of the calls above.
>>
>> Instead of:
>>
>> complete(>exited);
>> mutex_unlock(>exit_mutex);
>>
>> it should have been:
>>
>> mutex_unlock(>exit_mutex);
>> complete(>exited);
> 
> JFYI Edward did it [1]
> 
> [1] 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

Thanks.

I tested the code with that change and it no longer triggers the UAF.

I've fixed up the original patch that had the bug and am going to
resubmit the patchset like how Michael requested.




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Mike Christie
On 4/30/24 8:05 AM, Edward Adam Davis wrote:
>  static int vhost_task_fn(void *data)
>  {
>   struct vhost_task *vtsk = data;
> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
>   schedule();
>   }
>  
> - mutex_lock(>exit_mutex);
> + mutex_lock(_mutex);
>   /*
>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
>* When the vhost layer has called vhost_task_stop it's already stopped
> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
>   vtsk->handle_sigkill(vtsk->data);
>   }
>   complete(>exited);
> - mutex_unlock(>exit_mutex);
> + mutex_unlock(_mutex);
>  

Edward, thanks for the patch. I think though I just needed to swap the
order of the calls above.

Instead of:

complete(>exited);
mutex_unlock(>exit_mutex);

it should have been:

mutex_unlock(>exit_mutex);
complete(>exited);

If my analysis is correct, then Michael do you want me to resubmit a
patch on top of your vhost branch or resubmit the entire patchset?




Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Mike Christie
8 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
>>> RSP: :c9066e18 EFLAGS: 00010246
>>> RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
>>> RDX:  RSI:  RDI: 
>>> RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
>>> R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
>>> R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
>>> FS:  () GS:8880b940() knlGS:
>>> CS:  0010 DS:  ES:  CR0: 80050033
>>> CR2: 88823000 CR3: 0e132000 CR4: 003506f0
>>> DR0:  DR1:  DR2: 
>>> DR3:  DR6: fffe0ff0 DR7: 0400
>>> Call Trace:
>>>  
>>>  reset_page_owner include/linux/page_owner.h:25 [inline]
>>>  free_pages_prepare mm/page_alloc.c:1141 [inline]
>>>  __free_pages_ok+0xc54/0xd80 mm/page_alloc.c:1270
>>>  make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
>>>  vring_alloc_queue drivers/virtio/virtio_ring.c:319 [inline]
>>>  vring_alloc_queue_split+0x20a/0x600 drivers/virtio/virtio_ring.c:1108
>>>  vring_create_virtqueue_split+0xc6/0x310 drivers/virtio/virtio_ring.c:1158
>>>  vring_create_virtqueue+0xca/0x110 drivers/virtio/virtio_ring.c:2683
>>>  setup_vq+0xe9/0x2d0 drivers/virtio/virtio_pci_legacy.c:131
>>>  vp_setup_vq+0xbf/0x330 drivers/virtio/virtio_pci_common.c:189
>>>  vp_find_vqs_msix+0x8b2/0xc80 drivers/virtio/virtio_pci_common.c:331
>>>  vp_find_vqs+0x4c/0x4e0 drivers/virtio/virtio_pci_common.c:408
>>>  virtio_find_vqs include/linux/virtio_config.h:233 [inline]
>>>  virtscsi_init+0x8db/0xd00 drivers/scsi/virtio_scsi.c:887
>>>  virtscsi_probe+0x3ea/0xf60 drivers/scsi/virtio_scsi.c:945
>>>  virtio_dev_probe+0x991/0xaf0 drivers/virtio/virtio.c:311
>>>  really_probe+0x29e/0xc50 drivers/base/dd.c:658
>>>  __driver_probe_device+0x1a2/0x3e0 drivers/base/dd.c:800
>>>  driver_probe_device+0x50/0x430 drivers/base/dd.c:830
>>>  __driver_attach+0x45f/0x710 drivers/base/dd.c:1216
>>>  bus_for_each_dev+0x239/0x2b0 drivers/base/bus.c:368
>>>  bus_add_driver+0x347/0x620 drivers/base/bus.c:673
>>>  driver_register+0x23a/0x320 drivers/base/driver.c:246
>>>  virtio_scsi_init+0x65/0xe0 drivers/scsi/virtio_scsi.c:1083
>>>  do_one_initcall+0x248/0x880 init/main.c:1238
>>>  do_initcall_level+0x157/0x210 init/main.c:1300
>>>  do_initcalls+0x3f/0x80 init/main.c:1316
>>>  kernel_init_freeable+0x435/0x5d0 init/main.c:1548
>>>  kernel_init+0x1d/0x2b0 init/main.c:1437
>>>  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
>>>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
>>>  
>>>
>>
>> I think I saw this already and also with virtio scsi. virtio
>> core does not seem to be doing anything special here,
>> Cc virtio scsi maintainers.
> 
> The oldest commit that syzkaller found is a memory management pull
> request:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?id=e5eb28f6d1afebed4bb7d740a797d0390bd3a357
> 
> I can't reproduce the issue locally with QEMU 8.2.0 so I don't have a
> way to bisect.
> 
> I reviewed the virtio_scsi.c git log and there have been few changes
> over the last several months. I couldn't spot an issue in this patch,
> but the most likely virtio-scsi commit is:
> 
>   commit 95e7249691f082a5178d4d6f60fcdee91da458ab
>   Author: Mike Christie 
>   Date:   Wed Dec 13 23:26:49 2023 -0600
> 
>   scsi: virtio_scsi: Add mq_poll support
> 
> Stefan

I also tested the current kernel and didn't hit it.

In this mail:

https://lore.kernel.org/all/ZfKPf_pGxv-xtSPN@localhost.localdomain/

from this thread:

https://lore.kernel.org/all/37cb2e7c-97f1-4179-a715-84cc02096...@i-love.sakura.ne.jp/T/

it looks like Oscar is saying he has a fix right?




Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

2024-01-11 Thread Mike Christie
On 1/10/24 9:09 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  
> wrote:
>>
>> To pass ownership of a live vdpa device to a new process, the user
>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>> mm.  Flush workers in suspend to guarantee that no worker sees the new
>> mm and old VA in between.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  drivers/vhost/vdpa.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 8fe1562d24af..9673e8e20d11 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>>  {
>> struct vdpa_device *vdpa = v->vdpa;
>> const struct vdpa_config_ops *ops = vdpa->config;
>> +   struct vhost_dev *vdev = >vdev;
>>
>> if (!ops->suspend)
>> return -EOPNOTSUPP;
>>
>> +   if (vdev->use_worker)
>> +   vhost_dev_flush(vdev);
> 
> It looks to me like it's better to check use_woker in vhost_dev_flush.
> 

You can now just call vhost_dev_flush and it will do the right thing.
The xa_for_each loop will only flush workers if they have been setup,
so for vdpa it will not find/flush anything.






Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Mike Christie
On 12/8/23 3:24 AM, Tobias Huschle wrote:
> On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
>> On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
>>> 3. vhost looping endlessly, waiting for kworker to be scheduled
>>>
>>> I dug a little deeper on what the vhost is doing. I'm not an expert on
>>> virtio whatsoever, so these are just educated guesses that maybe
>>> someone can verify/correct. Please bear with me probably messing up 
>>> the terminology.
>>>
>>> - vhost is looping through available queues.
>>> - vhost wants to wake up a kworker to process a found queue.
>>> - kworker does something with that queue and terminates quickly.
>>>
>>> What I found by throwing in some very noisy trace statements was that,
>>> if the kworker is not woken up, the vhost just keeps looping accross
>>> all available queues (and seems to repeat itself). So it essentially
>>> relies on the scheduler to schedule the kworker fast enough. Otherwise
>>> it will just keep on looping until it is migrated off the CPU.
>>
>>
>> Normally it takes the buffers off the queue and is done with it.
>> I am guessing that at the same time guest is running on some other
>> CPU and keeps adding available buffers?
>>
> 
> It seems to do just that, there are multiple other vhost instances
> involved which might keep filling up thoses queues. 
> 
> Unfortunately, this makes the problematic vhost instance to stay on
> the CPU and prevents said kworker to get scheduled. The kworker is
> explicitly woken up by vhost, so it wants it to do something.
> 
> At this point it seems that there is an assumption about the scheduler
> in place which is no longer fulfilled by EEVDF. From the discussion so
> far, it seems like EEVDF does what is intended to do.
> 
> Shouldn't there be a more explicit mechanism in use that allows the
> kworker to be scheduled in favor of the vhost?
> 
> It is also concerning that the vhost seems cannot be preempted by the
> scheduler while executing that loop.
> 

Hey,

I recently noticed this change:

commit 05bfb338fa8dd40b008ce443e397fc374f6bd107
Author: Josh Poimboeuf 
Date:   Fri Feb 24 08:50:01 2023 -0800

vhost: Fix livepatch timeouts in vhost_worker()

We used to do:

while (1)
for each vhost work item in list
execute work item
if (need_resched())
schedule();

and after that patch we do:

while (1)
for each vhost work item in list
execute work item
cond_resched()


Would the need_resched check we used to have give you what
you wanted?



Re: [PATCH 2/2] scsi: iscsi_tcp: Fix use-after-free in iscsi_sw_tcp_host_get_param()

2021-04-12 Thread Mike Christie
On 4/6/21 8:24 PM, Wenchao Hao wrote:
> iscsi_sw_tcp_host_get_param() would access struct iscsi_session, while
> struct iscsi_session might be freed by session destroy flow in
> iscsi_free_session(). This commit fix this condition by freeing session
> after host has already been removed.
> 
> Signed-off-by: Wenchao Hao 
> ---
>  drivers/scsi/iscsi_tcp.c | 27 ++-
>  1 file changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
> index dd33ce0e3737..d559abd3694c 100644
> --- a/drivers/scsi/iscsi_tcp.c
> +++ b/drivers/scsi/iscsi_tcp.c
> @@ -839,6 +839,18 @@ iscsi_sw_tcp_conn_get_stats(struct iscsi_cls_conn 
> *cls_conn,
>   iscsi_tcp_conn_get_stats(cls_conn, stats);
>  }
>  
> +static void
> +iscsi_sw_tcp_session_teardown(struct iscsi_cls_session *cls_session)
> +{
> + struct Scsi_Host *shost = iscsi_session_to_shost(cls_session);
> +
> + iscsi_session_destroy(cls_session);
> + iscsi_host_remove(shost);
> +
> + iscsi_free_session(cls_session);
> + iscsi_host_free(shost);
> +}

Can you add a comment about the iscsi_tcp dependency with the host
and session or maybe convert ib_iser too?

ib_iser does the same session per host scheme and so if you were
just scanning the code and going to make a API change, it's not
really clear why the drivers do it differently.


Re: [PATCH AUTOSEL 5.10 07/22] scsi: iscsi: Fix race condition between login and sync thread

2021-04-06 Thread Mike Christie
On 4/5/21 11:04 AM, Sasha Levin wrote:
> From: Gulam Mohamed 
> 
> [ Upstream commit 9e67600ed6b8565da4b85698ec659b5879a6c1c6 ]
> 
> A kernel panic was observed due to a timing issue between the sync thread
> and the initiator processing a login response from the target. The session
> reopen can be invoked both from the session sync thread when iscsid
> restarts and from iscsid through the error handler. Before the initiator
> receives the response to a login, another reopen request can be sent from
> the error handler/sync session. When the initial login response is
> subsequently processed, the connection has been closed and the socket has
> been released.
> 
> To fix this a new connection state, ISCSI_CONN_BOUND, is added:
> 
>  - Set the connection state value to ISCSI_CONN_DOWN upon
>iscsi_if_ep_disconnect() and iscsi_if_stop_conn()
> 
>  - Set the connection state to the newly created value ISCSI_CONN_BOUND
>after bind connection (transport->bind_conn())
> 
>  - In iscsi_set_param(), return -ENOTCONN if the connection state is not
>either ISCSI_CONN_BOUND or ISCSI_CONN_UP
> 
> Link: 
> https://urldefense.com/v3/__https://lore.kernel.org/r/20210325093248.284678-1-gulam.mohamed@oracle.com__;!!GqivPVa7Brio!Jiqrc6pu3EgrquzpG-KpNQkNebwKUgctkE0MN1MloQ2y5Y4OVOkKN0yCr2_W_CX2oRet$
>  
> Reviewed-by: Mike Christie 


There was a mistake in my review of this patch. It will also require
this "[PATCH 1/1] scsi: iscsi: fix iscsi cls conn state":

https://lore.kernel.org/linux-scsi/20210406171746.5016-1-michael.chris...@oracle.com/T/#u




Re: [PATCH] target: Fix a double put in transport_free_session

2021-03-26 Thread Mike Christie
On 3/26/21 7:31 AM, Maurizio Lombardi wrote:
> 
> 
> Dne 23. 03. 21 v 3:58 Lv Yunlong napsal(a):
>> In transport_free_session, se_nacl is got from se_sess
>> with the initial reference. If se_nacl->acl_sess_list is
>> empty, se_nacl->dynamic_stop is set to true. Then the first
>> target_put_nacl(se_nacl) will drop the initial reference
>> and free se_nacl. Later there is a second target_put_nacl()
>> to put se_nacl. It may cause error in race.
>>
>> My patch sets se_nacl->dynamic_stop to false to avoid the
>> double put.
>>
>> Signed-off-by: Lv Yunlong 
>> ---
>>  drivers/target/target_core_transport.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/target/target_core_transport.c 
>> b/drivers/target/target_core_transport.c
>> index 5ecb9f18a53d..c266defe694f 100644
>> --- a/drivers/target/target_core_transport.c
>> +++ b/drivers/target/target_core_transport.c
>> @@ -584,8 +584,10 @@ void transport_free_session(struct se_session *se_sess)
>>  }
>>  mutex_unlock(_tpg->acl_node_mutex);
>>  
>> -if (se_nacl->dynamic_stop)
>> +if (se_nacl->dynamic_stop) {
>>  target_put_nacl(se_nacl);
>> +se_nacl->dynamic_stop = false;
>> +}
>>  
>>  target_put_nacl(se_nacl);
>>  }
>>
> 
> FYI,
> 
> I have received a bug report against the 5.8 kernel about task hangs that 
> seems to involve the nacl "dynamic_stop" code
> 
> Mar  4 16:49:44 gzboot kernel: [186322.177819] INFO: task targetcli:2359053 
> blocked for more than 120 seconds.
> Mar  4 16:49:44 gzboot kernel: [186322.178862]   Tainted: P   O   
>5.8.0-44-generic #50-Ubuntu
> Mar  4 16:49:44 gzboot kernel: [186322.179746] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Mar  4 16:49:44 gzboot kernel: [186322.180583] targetcli   D0 2359053 
> 2359052 0x
> Mar  4 16:49:44 gzboot kernel: [186322.180586] Call Trace:
> Mar  4 16:49:44 gzboot kernel: [186322.180592]  __schedule+0x212/0x5d0
> Mar  4 16:49:44 gzboot kernel: [186322.180595]  ? usleep_range+0x90/0x90
> Mar  4 16:49:44 gzboot kernel: [186322.180596]  schedule+0x55/0xc0
> Mar  4 16:49:44 gzboot kernel: [186322.180597]  schedule_timeout+0x10f/0x160
> Mar  4 16:49:44 gzboot kernel: [186322.180601]  ? evict+0x14c/0x1b0
> Mar  4 16:49:44 gzboot kernel: [186322.180602]  __wait_for_common+0xa8/0x150
> Mar  4 16:49:44 gzboot kernel: [186322.180603]  wait_for_completion+0x24/0x30
> Mar  4 16:49:44 gzboot kernel: [186322.180637]  
> core_tpg_del_initiator_node_acl+0x8e/0x120 [target_core_mod]

We would need more details. We can hit that normally when the target driver
waits for stuck IO to complete before calling transport_deregister_session.
I think if you hit the put bug, then we would have seen the refcount warning
in refcount.h fire before the hung task because we do an extra put.

Did the user switch to ACLs? Did you see my comment about it looks like there
is a bug if the user were to add an acl while dynamic was used for the same
initiatorname. In that case, we do not do the put to match the one taken
core_tpg_check_initiator_node_acl. In that case we would hit your hang since
no one ever does the last put.




Re: [PATCH] target: Fix a double put in transport_free_session

2021-03-25 Thread Mike Christie
On 3/25/21 2:48 AM, lyl2...@mail.ustc.edu.cn wrote:
> 
> 
> 
>> -原始邮件-
>> 发件人: michael.chris...@oracle.com
>> 发送时间: 2021-03-24 00:28:35 (星期三)
>> 收件人: "Lv Yunlong" , martin.peter...@oracle.com
>> 抄送: linux-s...@vger.kernel.org, target-de...@vger.kernel.org, 
>> linux-kernel@vger.kernel.org
>> 主题: Re: [PATCH] target: Fix a double put in transport_free_session
>>
>> On 3/22/21 9:58 PM, Lv Yunlong wrote:
>>> In transport_free_session, se_nacl is got from se_sess
>>> with the initial reference. If se_nacl->acl_sess_list is
>>> empty, se_nacl->dynamic_stop is set to true. Then the first
>>> target_put_nacl(se_nacl) will drop the initial reference
>>> and free se_nacl. Later there is a second target_put_nacl()
>>> to put se_nacl. It may cause error in race.
 My patch sets se_nacl->dynamic_stop to false to avoid the
>>> double put.
>>>
>>> Signed-off-by: Lv Yunlong 
>>> ---
>>>  drivers/target/target_core_transport.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/target/target_core_transport.c 
>>> b/drivers/target/target_core_transport.c
>>> index 5ecb9f18a53d..c266defe694f 100644
>>> --- a/drivers/target/target_core_transport.c
>>> +++ b/drivers/target/target_core_transport.c
>>> @@ -584,8 +584,10 @@ void transport_free_session(struct se_session *se_sess)
>>> }
>>> mutex_unlock(_tpg->acl_node_mutex);
>>>  
>>> -   if (se_nacl->dynamic_stop)
>>> +   if (se_nacl->dynamic_stop) {
>>> target_put_nacl(se_nacl);
>>> +   se_nacl->dynamic_stop = false;
>>> +   }
>>>  
>>> target_put_nacl(se_nacl);
>> Could you describe the race a little more?
>>
>> Is the race:
>>
>> 1. thread1 called core_tpg_check_initiator_node_acl and found the acl.
>> sess->se_node_acl is set to the found acl.
>> 2. thread2 is running transport_free_session. It now grabs the acl_node_mutex
>> and sees se_nacl->acl_sess_list is empty.
>> 3. thread2 does the dynamic_stop=true operations in transport_free_session.
>> 4. thread1 now calls transport_register_session now adds the sess to acl's
>> acl_sess_list.
>>
>> Later when the session that thread 1 created is deleted dynamic_stop is still
>> set, so we do an extra target_put_nacl?
>>
>> I'm not sure your patch will handle this race. When we delete the session 
>> thread1
>> created dynamic_node_acl is still set, so this:
>>
>> mutex_lock(_tpg->acl_node_mutex);
>> if (se_nacl->dynamic_node_acl &&
>> !se_tfo->tpg_check_demo_mode_cache(se_tpg)) {
>> spin_lock_irqsave(_nacl->nacl_sess_lock, flags);
>> if (list_empty(_nacl->acl_sess_list))
>> se_nacl->dynamic_stop = true;
>>
>> can set dynamic_stop to true again and we can end up doing the extra put 
>> still.
>>
>> On top of the extra put we also do
>>
>> list_del(_nacl->acl_list);
>>
>> twice so we have to handle that as well.
>>
>> Is there also another bug in this code. If someone adds an acl while there 
>> is a
>> dynamic acl in place core_tpg_add_initiator_node_acl will clear 
>> dynamic_node_acl
>> but we leave the extra reference, so later when transport_free_session is 
>> called
>> we will not do the extra put.
>>
> 
> Ok, thanks for your answer. According the description above, i think it is a 
> false
> positive now.
> 

Did you hit this bug, are you using an inspection tool, or did you find this by 
code
review?

I think there was a misunderstanding. I was saying it looks like a race could 
happen.
There is no protection in lio core.

I think it's hard to hit because most drivers do not allow the combo:

tpg_check_demo_mode == true
tpg_check_demo_mode_cache = false

It looks like those settings are allowed with tcm_qla2xxx and usb, but:

usb - has a mutex around creation and removal so we can't race.
tcm qla - I don't know this driver will enough, but I cc'd the maintainer.


Re: [PATCH] iscsi: Do Not set param when sock is NULL

2021-01-18 Thread Mike Christie
On 1/7/21 9:48 AM, Gulam Mohamed wrote:
> Hi Michael,
> 
>  As per your suggestions in below mail, I have completed the 
> suggested changes and tested them. Can you please review and let me know your 
> comments? Here is the patch:
> 
> Description
> ===
> 1. This Kernel panic could be due to a timing issue when there is a race 
> between the sync thread and the initiator was processing of a login response 
> from the target. The session re-open can be invoked from two places
>   a. Sessions sync thread when the iscsid restart
>   b. From iscsid through iscsi error handler 2. The session reopen sequence 
> is as follows in user-space (iscsi-initiator-utils)
>a. Disconnect the connection
>b. Then send the stop connection request to the kernel which releases the 
> connection (releases the socket)
>c. Queues the reopen for 2 seconds delay
>d. Once the delay expires, create the TCP connection again by calling the 
> connect() call
>e. Poll for the connection
>f. When poll is successful i.e when the TCP connection is established, it 
> performs
>   i. Creation of session and connection data structures
>   ii. Bind the connection to the session. This is the place where we 
> assign the sock to tcp_sw_conn->sock
>   iii. Sets the parameters like target name, persistent address etc .
>   iv. Creates the login pdu
>v. Sends the login pdu to kernel
>   vi. Returns to the main loop to process further events. The kernel then 
> sends the login request over to the target node
>g. Once login response with success is received, it enters into full 
> feature phase and sets the negotiable parameters like max_recv_data_length, 
> max_transmit_length, data_digest etc .
> 3. While setting the negotiable parameters by calling 
> "iscsi_session_set_neg_params()", kernel panicked as sock was NULL
> 
> What happened here is
> -
> 1. Before initiator received the login response mentioned in above point 
> 2.f.v, another reopen request was sent from the error handler/sync session 
> for the same session, as the initiator utils was in main loop to process 
> further events (as mentioned in point 2.f.vi above).
> 2. While processing this reopen, it stopped the connection which released the 
> socket and queued this connection and at this point of time the login 
> response was received for the earlier on
> 
> Fix
> ---
> 
> 1. Create a flag "set_param_fail" in iscsi_cls_conn structure 2. On 
> ep_disconnect and stop_conn set this flag to indicate set_param calls for 
> connection level settings should fail 3. This way, scsi_transport_iscsi can 
> set and check this bit for all drivers 2. On bind_conn clear the bit
> 
> Signed-off-by: Gulam Mohamed 
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 6 ++  
> include/scsi/scsi_transport_iscsi.h | 3 +++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_transport_iscsi.c 
> b/drivers/scsi/scsi_transport_iscsi.c
> index 2e68c0a87698..15c5a7223a40 100644
> --- a/drivers/scsi/scsi_transport_iscsi.c
> +++ b/drivers/scsi/scsi_transport_iscsi.c
> @@ -2473,6 +2473,8 @@ static void iscsi_if_stop_conn(struct iscsi_cls_conn 
> *conn, int flag)
>* it works.
>*/
>   mutex_lock(_mutex);
> + if (!test_bit(ISCSI_SET_PARAM_FAIL_BIT, >set_param_fail))
> + set_bit(ISCSI_SET_PARAM_FAIL_BIT, >set_param_fail);

You can just do a test_and_set_bit.

>   conn->transport->stop_conn(conn, flag);
>   mutex_unlock(_mutex);
>  
> @@ -2895,6 +2897,8 @@ iscsi_set_param(struct iscsi_transport *transport, 
> struct iscsi_uevent *ev)
>   session->recovery_tmo = value;
>   break;
>   default:
> + if (test_bit(ISCSI_SET_PARAM_FAIL_BIT, >set_param_fail))
> + return -ENOTCONN;
>   err = transport->set_param(conn, ev->u.set_param.param,
>  data, ev->u.set_param.len);
>   }
> @@ -2956,6 +2960,7 @@ static int iscsi_if_ep_disconnect(struct 
> iscsi_transport *transport,
>   mutex_lock(>ep_mutex);
>   conn->ep = NULL;
>   mutex_unlock(>ep_mutex);
> + set_bit(ISCSI_SET_PARAM_FAIL_BIT, >set_param_fail);
>   }
>  
>   transport->ep_disconnect(ep);
> @@ -3716,6 +3721,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr 
> *nlh, uint32_t *group)
>   ev->r.retcode = transport->bind_conn(session, conn,
>   ev->u.b_conn.transport_eph,
>   ev->u.b_conn.is_leading);
> + clear_bit(ISCSI_SET_PARAM_FAIL_BIT, >set_param_fail);

You should check retcode and only clear if it indicates success.


>   mutex_unlock(_mutex);
>  
>   if (ev->r.retcode || !transport->ep_connect) diff --git 
> a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h
> 

Re: [PATCH] scsi: target: tcmu: Fix wrong uio handling causing big memory leak

2021-01-13 Thread Mike Christie
On 1/13/21 11:59 AM, Bodo Stroesser wrote:
> On 12.01.21 19:36, Mike Christie wrote:
>> On 12/18/20 8:15 AM, Bodo Stroesser wrote:
>>> tcmu calls uio_unregister_device from tcmu_destroy_device.
>>> After that uio will never call tcmu_release for this device.
>>> If userspace still had the uio device open and / or mmap'ed
>>> during uio_unregister_device, tcmu_release will not be called and
>>> udev->kref will never go down to 0.
>>>
>>
>> I didn't get why the release function is not called if you call
>> uio_unregister_device while a device is open. Does the device_destroy call in
>> uio_unregister_device completely free the device or does it set some bits so
>> uio_release is not called later?
> 
> uio_unregister_device() resets the pointer (idev->info) to the struct 
> uio_info which tcmu provided in uio_register_device().
> The uio device itself AFAICS is kept while it is open / mmap'ed.
> But no matter what userspace does, uio will not call tcmu's callbacks
> since info pointer now is NULL.
> 
> When userspace finally closes the uio device, uio_release is called, but
> tcmu_release can not be called.
> 
>>
>> Do other drivers hit this? Should uio have refcounting so uio_release is 
>> called
>> when the last ref (from userspace open/close/mmap calls and from the kernel 
>> by
>> drivers like target_core_user) is done?
>>
> 
> To be honest I don't know exactly.
> tcmu seems to be a special case in that is has it's own mmap callback.
> That allows us to map pages allocated by tcmu.
> As long as userspace still holds the mapping, we should not unmap those
> pages, because userspace then could get killed by SIGSEGV.
> So we have to wait for userspace closing uio before we may unmap and
> free the pages.


If we removed the clearing of idev->info in uio_unregister_device, and
then moved the idev->info->release call from uio_release to
uio_device_release it would work like you need right? The release callback
would always be called and called when userspace has dropped it's refs.
You need to also fix up the module refcount and some other bits because
it looks like uio uses the uio->info check to determine if the device is
being removed.

I don't know if that is the correct approach. It looks like non
target_core_user drivers could hit a similar issue. It seems like drivers
like qedi/bnx2i could hit the issue if their port is removed from the
kernel before their uio daemon closes the device. However, they also
could do a driver specific fix and handle the issue by adding some extra
kernel/userspace bits to sync the port removal.


Re: [PATCH] scsi: target: tcmu: Fix wrong uio handling causing big memory leak

2021-01-12 Thread Mike Christie
On 12/18/20 8:15 AM, Bodo Stroesser wrote:
> tcmu calls uio_unregister_device from tcmu_destroy_device.
> After that uio will never call tcmu_release for this device.
> If userspace still had the uio device open and / or mmap'ed
> during uio_unregister_device, tcmu_release will not be called and
> udev->kref will never go down to 0.
> 

I didn't get why the release function is not called if you call
uio_unregister_device while a device is open. Does the device_destroy call in
uio_unregister_device completely free the device or does it set some bits so
uio_release is not called later?

Do other drivers hit this? Should uio have refcounting so uio_release is called
when the last ref (from userspace open/close/mmap calls and from the kernel by
drivers like target_core_user) is done?


Re: [PATCH] vhost scsi: fix error return code in vhost_scsi_set_endpoint()

2020-12-04 Thread Mike Christie

On 12/4/20 2:43 AM, Zhang Changzhong wrote:

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 25b98b64e284 ("vhost scsi: alloc cmds per vq instead of session")
Reported-by: Hulk Robot 
Signed-off-by: Zhang Changzhong 
---
  drivers/vhost/scsi.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6ff8a5096..4ce9f00 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1643,7 +1643,8 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
if (!vhost_vq_is_setup(vq))
continue;
  
-			if (vhost_scsi_setup_vq_cmds(vq, vq->num))

+   ret = vhost_scsi_setup_vq_cmds(vq, vq->num);
+   if (ret)
goto destroy_vq_cmds;
}
  



Reviewed-by: Mike Christie 


Re: [PATCH] scsi: iscsi: fix inappropriate use of put_device

2020-12-02 Thread Mike Christie
On 11/20/20 1:48 AM, Qinglang Miao wrote:
> kfree(conn) is called inside put_device(>dev) so that
> another one would cause use-after-free. Besides, device_unregister
> should be used here rather than put_device.
> 
> Fixes: f3c893e3dbb5 ("scsi: iscsi: Fail session and connection on transport 
> registration failure")
> Reported-by: Hulk Robot 
> Signed-off-by: Qinglang Miao 
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_transport_iscsi.c 
> b/drivers/scsi/scsi_transport_iscsi.c
> index 2eb3e4f93..2e68c0a87 100644
> --- a/drivers/scsi/scsi_transport_iscsi.c
> +++ b/drivers/scsi/scsi_transport_iscsi.c
> @@ -2313,7 +2313,9 @@ iscsi_create_conn(struct iscsi_cls_session *session, 
> int dd_size, uint32_t cid)
>   return conn;
>  
>  release_conn_ref:
> - put_device(>dev);
> + device_unregister(>dev);
> + put_device(>dev);
> +     return NULL;
>  release_parent_ref:
>   put_device(>dev);
>  free_conn:
> 

Reviewed-by: Mike Christie 



Re: [PATCH] scsi: qedi: fix missing destroy_workqueue() on error in __qedi_probe

2020-12-02 Thread Mike Christie
On 11/9/20 3:15 AM, Qinglang Miao wrote:
> Add the missing destroy_workqueue() before return from
> __qedi_probe in the error handling case when fails to
> create workqueue qedi->offload_thread.
> 
> Fixes: ace7f46ba5fd ("scsi: qedi: Add QLogic FastLinQ offload iSCSI driver 
> framework.")
> Signed-off-by: Qinglang Miao 
> ---
>  drivers/scsi/qedi/qedi_main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
> index 61fab01d2d52..f5fc7f518f8a 100644
> --- a/drivers/scsi/qedi/qedi_main.c
> +++ b/drivers/scsi/qedi/qedi_main.c
> @@ -2766,7 +2766,7 @@ static int __qedi_probe(struct pci_dev *pdev, int mode)
>   QEDI_ERR(>dbg_ctx,
>"Unable to start offload thread!\n");
>   rc = -ENODEV;
> - goto free_cid_que;
> + goto free_tmf_thread;
>   }
>  
>   INIT_DELAYED_WORK(>recovery_work, qedi_recovery_handler);
> @@ -2790,6 +2790,8 @@ static int __qedi_probe(struct pci_dev *pdev, int mode)
>  
>   return 0;
>  
> +free_tmf_thread:
> + destroy_workqueue(qedi->tmf_thread);
>  free_cid_que:
>   qedi_release_cid_que(qedi);
>  free_uio:
> 

Reviewed-by: Mike Christie 


Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-30 Thread Mike Christie
On 11/30/20 11:52 AM, Paolo Bonzini wrote:
> On 30/11/20 18:38, Sasha Levin wrote:
>>> I am not aware of any public CI being done _at all_ done on vhost-scsi, by 
>>> CKI or everyone else.  So autoselection should be done only on subsystems 
>>> that have very high coverage in CI.
>>
>> Where can I find a testsuite for virtio/vhost? I see one for KVM, but
>> where is the one that the maintainers of virtio/vhost run on patches
>> that come in?
> 
> I don't know of any, especially for vhost-scsi.  MikeC?
> 

Sorry for the late reply on the thread. I was out of the office.

I have never seen a public/open-source vhost-scsi testsuite.

For patch 23 (the one that adds the lun reset support which is built on
patch 22), we can't add it to stable right now if you wanted to, because
it has a bug in it. Michael T, sent the fix:

https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?h=linux-next=b4fffc177fad3c99ee049611a508ca9561bb6871

to Linus today.


Re: [PATCH v2 1/1] scsi: libiscsi: fix NOP race condition

2020-10-08 Thread Mike Christie
On 10/8/20 12:11 PM, Mike Christie wrote:
> On 9/25/20 1:41 PM, ldun...@suse.com wrote:
>> From: Lee Duncan 
>>
>> iSCSI NOPs are sometimes "lost", mistakenly sent to the
>> user-land iscsid daemon instead of handled in the kernel,
>> as they should be, resulting in a message from the daemon like:
>>
>>> iscsid: Got nop in, but kernel supports nop handling.
>>
>> This can occur because of the forward- and back-locks
>> in the kernel iSCSI code, and the fact that an iSCSI NOP
>> response can be processed before processing of the NOP send
>> is complete. This can result in "conn->ping_task" being NULL
>> in iscsi_nop_out_rsp(), when the pointer is actually in
>> the process of being set.
>>
>> To work around this, we add a new state to the "ping_task"
>> pointer. In addition to NULL (not assigned) and a pointer
>> (assigned), we add the state "being set", which is signaled
>> with an INVALID pointer (using "-1").
>>
>> Signed-off-by: Lee Duncan 
>> ---
>>  drivers/scsi/libiscsi.c | 13 ++---
>>  include/scsi/libiscsi.h |  3 +++
>>  2 files changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
>> index 1e9c3171fa9f..cade108c33b6 100644
>> --- a/drivers/scsi/libiscsi.c
>> +++ b/drivers/scsi/libiscsi.c
>> @@ -738,6 +738,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct 
>> iscsi_hdr *hdr,
>> task->conn->session->age);
>>  }
>>  
>> +if (unlikely(READ_ONCE(conn->ping_task) == INVALID_SCSI_TASK))
>> +WRITE_ONCE(conn->ping_task, task);
>> +
>>  if (!ihost->workq) {
>>  if (iscsi_prep_mgmt_task(conn, task))
>>  goto free_task;
> 
> I think the API gets a little weird now where in some cases
> __iscsi_conn_send_pdu checks the opcode to see what type of request
> it is but above we the caller sets the ping_task.
> 
> For login, tmfs and passthrough, we assume the __iscsi_conn_send_pdu
> has sent or cleaned up everything. I think it might be nicer to just
> have __iscsi_conn_send_pdu set the ping_task field before doing the
> xmit/queue call. It would then work similar to the conn->login_task
> case where that function knows about that special task too.
> 
> So in __iscsi_conn_send_pdu add a "if (opcode == ISCSI_OP_NOOP_OUT)",
> and check if it's a nop we need to track. If so set conn->ping_task.
> 
Ignore this. It won't work nicely either. To figure out if the nop is
our internal transport test ping vs a userspace ping that also needs
a reply, we would need to do something like you did above so there is
no point.


Re: [PATCH v2 1/1] scsi: libiscsi: fix NOP race condition

2020-10-08 Thread Mike Christie
On 9/25/20 1:41 PM, ldun...@suse.com wrote:
> From: Lee Duncan 
> 
> iSCSI NOPs are sometimes "lost", mistakenly sent to the
> user-land iscsid daemon instead of handled in the kernel,
> as they should be, resulting in a message from the daemon like:
> 
>> iscsid: Got nop in, but kernel supports nop handling.
> 
> This can occur because of the forward- and back-locks
> in the kernel iSCSI code, and the fact that an iSCSI NOP
> response can be processed before processing of the NOP send
> is complete. This can result in "conn->ping_task" being NULL
> in iscsi_nop_out_rsp(), when the pointer is actually in
> the process of being set.
> 
> To work around this, we add a new state to the "ping_task"
> pointer. In addition to NULL (not assigned) and a pointer
> (assigned), we add the state "being set", which is signaled
> with an INVALID pointer (using "-1").
> 
> Signed-off-by: Lee Duncan 
> ---
>  drivers/scsi/libiscsi.c | 13 ++---
>  include/scsi/libiscsi.h |  3 +++
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> index 1e9c3171fa9f..cade108c33b6 100644
> --- a/drivers/scsi/libiscsi.c
> +++ b/drivers/scsi/libiscsi.c
> @@ -738,6 +738,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct 
> iscsi_hdr *hdr,
>  task->conn->session->age);
>   }
>  
> + if (unlikely(READ_ONCE(conn->ping_task) == INVALID_SCSI_TASK))
> + WRITE_ONCE(conn->ping_task, task);
> +
>   if (!ihost->workq) {
>   if (iscsi_prep_mgmt_task(conn, task))
>   goto free_task;

I think the API gets a little weird now where in some cases
__iscsi_conn_send_pdu checks the opcode to see what type of request
it is but above we the caller sets the ping_task.

For login, tmfs and passthrough, we assume the __iscsi_conn_send_pdu
has sent or cleaned up everything. I think it might be nicer to just
have __iscsi_conn_send_pdu set the ping_task field before doing the
xmit/queue call. It would then work similar to the conn->login_task
case where that function knows about that special task too.

So in __iscsi_conn_send_pdu add a "if (opcode == ISCSI_OP_NOOP_OUT)",
and check if it's a nop we need to track. If so set conn->ping_task.


Re: [PATCH ] scsi: page warning: 'page' may be used uninitialized

2020-10-02 Thread Mike Christie
On 9/23/20 7:19 PM, john.p.donne...@oracle.com wrote:
> From: John Donnelly 
> 
> corrects: drivers/target/target_core_user.c:688:6: warning: 'page' may be used
> uninitialized
> 
> Fixes: 3c58f737231e ("scsi: target: tcmu: Optimize use of
> flush_dcache_page")
> 
> To: linux-s...@vger.kernel.org
> Cc: Mike Christie 
> Signed-off-by: John Donnelly 
> ---
>  drivers/target/target_core_user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 9b7592350502..86b28117787e 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -681,7 +681,7 @@ static void scatter_data_area(struct tcmu_dev *udev,
>   void *from, *to = NULL;
>   size_t copy_bytes, to_offset, offset;
>   struct scatterlist *sg;
> - struct page *page;
> + struct page *page = NULL;
>  
>   for_each_sg(data_sg, sg, data_nents, i) {
>   int sg_remaining = sg->length;
> 

Looks ok for now. In the next kernel we can do the more invasive approach and
add a new struct/helpers to make the code cleaner and fix it properly.

Acked-by: Mike Christie 


Re: [PATCH] scsi: iscsi: Do not put host in iscsi_set_flashnode_param()

2020-07-25 Thread Mike Christie
On 6/15/20 3:12 AM, Jing Xiangfeng wrote:
> If scsi_host_lookup() failes we will jump to put_host, which may
> cause panic. Jump to exit_set_fnode to fix it.
> 
> Signed-off-by: Jing Xiangfeng 
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_transport_iscsi.c 
> b/drivers/scsi/scsi_transport_iscsi.c
> index f4cc08e..c5e99f9 100644
> --- a/drivers/scsi/scsi_transport_iscsi.c
> +++ b/drivers/scsi/scsi_transport_iscsi.c
> @@ -3291,7 +3291,7 @@ static int iscsi_set_flashnode_param(struct 
> iscsi_transport *transport,
>   pr_err("%s could not find host no %u\n",
>  __func__, ev->u.set_flashnode.host_no);
>   err = -ENODEV;
> - goto put_host;
> + goto exit_set_fnode;
>   }
>  
>   idx = ev->u.set_flashnode.flashnode_idx;
> 

Reviewed-by: Mike Christie 


Re: [PATCH 2/2] scsi: register sysfs for scsi/iscsi workqueues

2020-06-22 Thread Mike Christie

On 6/11/20 5:07 AM, Bob Liu wrote:

This patch enable setting cpu affinity through "cpumask" for below
scsi/iscsi workqueues, so as to get better isolation.
- scsi_wq_*
- scsi_tmf_*
- iscsi_q_xx
- iscsi_eh

Signed-off-by: Bob Liu 
---
  drivers/scsi/hosts.c| 4 ++--
  drivers/scsi/libiscsi.c | 2 +-
  drivers/scsi/scsi_transport_iscsi.c | 2 +-
  3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 1d669e4..4b9f80d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -272,7 +272,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
if (shost->transportt->create_work_queue) {
snprintf(shost->work_q_name, sizeof(shost->work_q_name),
 "scsi_wq_%d", shost->host_no);
-   shost->work_q = create_singlethread_workqueue(
+   shost->work_q = create_singlethread_workqueue_noorder(
shost->work_q_name);
if (!shost->work_q) {
error = -EINVAL;


This patch seems ok for the iscsi, fc, tmf, and non transport class scan 
uses. We are either heavy handed with flushes or did not need ordering.


I don't know about the zfcp use though, so I cc'd  the developers listed 
as maintainers. It looks like for zfcp we can do:


zfcp_scsi_rport_register->fc_remote_port_add->fc_remote_port_create->scsi_queue_work 
to scan the scsi target on the rport.


and then zfcp_scsi_rport_register can call zfcp_unit_queue_scsi_scan->
scsi_queue_work which will scan for a specific lun.

It looks ok if those are not ordered, but I would get their review to 
make sure.


Re: [PATCH v1 target] target: Add initiatorname to NON_EXISTENT_LUN error

2020-05-18 Thread Mike Christie
On 5/17/20 8:02 PM, Lance Digby wrote:
> The NON_EXISTENT_LUN error can be written without an error condition
> on the initiator responsible. Adding the initiatorname to this message
> will reduce the effort required to fix this when many initiators are
> supported by a target.
> 
> This version ensures the initiator name is also printed on the same
> message in transport_lookup_tmr_lun for consistency.
> 

Reviewed-by: Mike Christie 



Re: [PATCH target] target: Add initiatorname to NON_EXISTENT_LUN error

2020-05-17 Thread Mike Christie
On 5/16/20 6:29 PM, Lance Digby wrote:
> Mike,  Thanks for the review!
>   The pr_err  Detected NON_EXISTENT_LUN is the error messages issued
> for the TCM_NON_EXISTENT_LUN retcode so I believe they are the same.
>   Simply scanning for the wrong lun on an initiator will generate this
> error on the target but not generate an error on the initiator. And I
> have seen installs, with a lot of initiators, automate the scanning of
> such luns incorrectly deemed missing.
>   While this looks like a simple problem it can take days to get
> access or the tcp traces to sort it out.
> 
>Within the same routine there is another pr_err for
> TCM_WRITE_PROTECTED that I did not add the initiatorname to as I
> thought this would leave a heavy footprint on the initiator. If you

I'm not sure what you mean by heavy footprint on the initiator part means.

I would say do whatever is helpful to you to debug the problem. For
TCM_WRITE_PROTECTED I'm not sure the initiatorname is helpful. I think
the target name and tpg would be useful, because I think you sometimes
set it at the tpg level then it gets inherited by the LU. But I think
it's a pain to get to the target name from this code path, so I wouldn't
worry about adding it now.

> believe this should be changed for consistency please let me know and
> I will add this and change to nacl->initiatorname.

Just to make sure we are on the same page. I was just commenting about
the other NON_EXISTENT_LUN instace in transport_lookup_tmr_lun. I just
thought we would want/need the same info there.


> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sat, May 16, 2020 at 9:50 AM Mike Christie  wrote:
>>
>> On 5/13/20 11:01 PM, Lance Digby wrote:
>>> The NON_EXISTENT_LUN error can be written without an error condition
>>>  on the initiator responsible. Adding the initiatorname to this message
>>>  will reduce the effort required to fix this when many initiators are
>>> supported by a target.
>>>
>>> Signed-off-by: Lance Digby 
>>> ---
>>>  drivers/target/target_core_device.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/target/target_core_device.c 
>>> b/drivers/target/target_core_device.c
>>> index 4cee113..604dea0 100644
>>> --- a/drivers/target/target_core_device.c
>>> +++ b/drivers/target/target_core_device.c
>>> @@ -100,9 +100,10 @@
>>>*/
>>>   if (unpacked_lun != 0) {
>>>   pr_err("TARGET_CORE[%s]: Detected NON_EXISTENT_LUN"
>>> - " Access for 0x%08llx\n",
>>> + " Access for 0x%08llx from %s\n",
>>>   se_cmd->se_tfo->fabric_name,
>>> - unpacked_lun);
>>> + unpacked_lun,
>>> + se_sess->se_node_acl->initiatorname);
>>
>> You can do nacl->initiatorname.
>>
>> Do you also want add the name to the tmr case? It's probably not common,
>> but the error message would be consistent.
>>
>>>   return TCM_NON_EXISTENT_LUN;
>>>   }
>>>
>>
> 



Re: [PATCH target] target: Add initiatorname to NON_EXISTENT_LUN error

2020-05-15 Thread Mike Christie
On 5/13/20 11:01 PM, Lance Digby wrote:
> The NON_EXISTENT_LUN error can be written without an error condition
>  on the initiator responsible. Adding the initiatorname to this message
>  will reduce the effort required to fix this when many initiators are
> supported by a target.
> 
> Signed-off-by: Lance Digby 
> ---
>  drivers/target/target_core_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/target/target_core_device.c 
> b/drivers/target/target_core_device.c
> index 4cee113..604dea0 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -100,9 +100,10 @@
>*/
>   if (unpacked_lun != 0) {
>   pr_err("TARGET_CORE[%s]: Detected NON_EXISTENT_LUN"
> - " Access for 0x%08llx\n",
> + " Access for 0x%08llx from %s\n",
>   se_cmd->se_tfo->fabric_name,
> - unpacked_lun);
> + unpacked_lun,
> + se_sess->se_node_acl->initiatorname);

You can do nacl->initiatorname.

Do you also want add the name to the tmr case? It's probably not common,
but the error message would be consistent.

>   return TCM_NON_EXISTENT_LUN;
>   }
>  



Re: INFO: task hung in do_read_cache_page (3)

2020-05-11 Thread Mike Christie
On 5/11/20 8:19 AM, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 2da22da573481cc4837e246d0eee4d518b3f715e
> Author: Mike Christie 
> Date:   Tue Aug 13 16:39:52 2019 +
> 
> nbd: fix zero cmd timeout handling v2
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=11d6ab1410
> start commit:   e99332e7 gcc-10: mark more functions __init to avoid secti..
> git tree:   upstream
> final crash:https://syzkaller.appspot.com/x/report.txt?x=13d6ab1410
> console output: https://syzkaller.appspot.com/x/log.txt?x=15d6ab1410
> kernel config:  https://syzkaller.appspot.com/x/.config?x=8a96cf498e199d8b
> dashboard link: https://syzkaller.appspot.com/bug?extid=518c54e255b5031adde4
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=146e45ec10
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16a410
> 
> Reported-by: syzbot+518c54e255b5031ad...@syzkaller.appspotmail.com
> Fixes: 2da22da57348 ("nbd: fix zero cmd timeout handling v2")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> 

How do you adjust/modify what is expected from the test or what is
reported as an error?

The patch added back behavior that got removed. With the patch we expect
the hung task warnings, because we specifically want to hold onto
running commands for longer than 120/hung_task_timeout seconds



Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-22 Thread Mike Christie
On 10/22/2019 06:24 AM, Michal Hocko wrote:
> On Mon 21-10-19 16:41:37, Mike Christie wrote:
>> There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
>> amd nbd that have userspace components that can run in the IO path. For
>> example, iscsi and nbd's userspace deamons may need to recreate a socket
>> and/or send IO on it, and dm-multipath's daemon multipathd may need to
>> send IO to figure out the state of paths and re-set them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting a allocation that ended up
>> writing data back to the same device we are trying to allocate for.
> 
> Which code paths are we talking about here? Any ioctl or is this a
> general syscall path? Can we mark the process in a more generic way?

It depends on the daemon. The common one for example are iscsi and nbd
need network related calls like sendmsg, recvmsg, socket, etc.
tcmu-runner could need the network ones and also read and write when it
does IO to a FS or device. dm-multipath needs the sg io ioctls.


> E.g. we have PF_LESS_THROTTLE (used by nfsd). It doesn't affect the
> reclaim recursion but it shows a pattern that doesn't really exhibit
> too many internals. Maybe we need PF_IO_FLUSHER or similar?

I am not familiar with PF_IO_FLUSHER. If it prevents the recursion
problem then please send me details and I will look into it for the next
posting.

> 
>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>> with prctl during their initialization so later allocations cannot
>> calling back into them.
> 
> TBH I am not really happy to export these to the userspace. They are
> an internal implementation detail and the userspace shouldn't really

They care in these cases, because block/fs drivers must be able to make
forward progress during writes. To meet this guarantee kernel block
drivers use mempools and memalloc/GFP flags.

For these userspace components of the block/fs drivers they already do
things normal daemons do not to meet that guarantee like mlock their
memory, disable oom killer, and preallocate resources they have control
over. They have no control over reclaim like the kernel drivers do so
its easy for us to deadlock when memory gets low.

> care. So if this is really necessary then we need a very good argumnets
> and documentation to make the usage clear.
>  
>> Signed-off-by: Mike Christie 
>> ---
>>
>> V2:
>> - Use prctl instead of procfs.
>> - Add support for NOFS for fuse.
>> - Check permissions.
>>
>>  include/uapi/linux/prctl.h |  8 +++
>>  kernel/sys.c   | 44 ++
>>  2 files changed, 52 insertions(+)
>>
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 7da1b37b27aa..6f6b3af6633a 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -234,4 +234,12 @@ struct prctl_mm_map {
>>  #define PR_GET_TAGGED_ADDR_CTRL 56
>>  # define PR_TAGGED_ADDR_ENABLE  (1UL << 0)
>>  
>> +/* Control reclaim behavior when allocating memory */
>> +#define PR_SET_MEMALLOC 57
>> +#define PR_GET_MEMALLOC 58
>> +#define PR_MEMALLOC_SET_NOIO(1UL << 0)
>> +#define PR_MEMALLOC_CLEAR_NOIO  (1UL << 1)
>> +#define PR_MEMALLOC_SET_NOFS(1UL << 2)
>> +#define PR_MEMALLOC_CLEAR_NOFS  (1UL << 3)
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index a611d1d58c7d..34fedc9fc7e4 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2486,6 +2486,50 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
>> arg2, unsigned long, arg3,
>>  return -EINVAL;
>>  error = GET_TAGGED_ADDR_CTRL();
>>  break;
>> +case PR_SET_MEMALLOC:
>> +if (!capable(CAP_SYS_ADMIN))
>> +return -EPERM;
>> +
>> +if (arg3 || arg4 || arg5)
>> +return -EINVAL;
>> +
>> +switch (arg2) {
>> +case PR_MEMALLOC_SET_NOIO:
>> +if (current->flags & PF_MEMALLOC_NOFS)
>> +return -EINVAL;
>> +
>> +current->flags |= PF_MEMALLOC_NOIO;
>> +break;
>> +case PR_MEMALLOC_CLEAR_NOIO:
>> +current->flags &= ~PF_MEMAL

Re: INFO: task hung in nbd_ioctl

2019-10-01 Thread Mike Christie
On 09/30/2019 05:39 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:bb2aee77 Add linux-next specific files for 20190926
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=13385ca360
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e60af4ac5a01e964
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=24c12fa8d218ed26011a
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12abc2a360
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11712c0560
> 
> The bug was bisected to:
> 
> commit e9e006f5fcf2bab59149cb38a48a4817c1b538b4
> Author: Mike Christie 
> Date:   Sun Aug 4 19:10:06 2019 +
> 
> nbd: fix max number of supported devs
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1226f3c560
> final crash:https://syzkaller.appspot.com/x/report.txt?x=1126f3c560
> console output: https://syzkaller.appspot.com/x/log.txt?x=1626f3c560
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+24c12fa8d218ed260...@syzkaller.appspotmail.com
> Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs")
> 
> INFO: task syz-executor390:8778 can't die for more than 143 seconds.
> syz-executor390 D27432  8778   8777 0x4004
> Call Trace:
>  context_switch kernel/sched/core.c:3384 [inline]
>  __schedule+0x828/0x1c20 kernel/sched/core.c:4065
>  schedule+0xd9/0x260 kernel/sched/core.c:4132
>  schedule_timeout+0x717/0xc50 kernel/time/timer.c:1871
>  do_wait_for_common kernel/sched/completion.c:83 [inline]
>  __wait_for_common kernel/sched/completion.c:104 [inline]
>  wait_for_common kernel/sched/completion.c:115 [inline]
>  wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
>  flush_workqueue+0x40f/0x14c0 kernel/workqueue.c:2826
>  nbd_start_device_ioctl drivers/block/nbd.c:1272 [inline]
>  __nbd_ioctl drivers/block/nbd.c:1347 [inline]
>  nbd_ioctl+0xb2e/0xc44 drivers/block/nbd.c:1387
>  __blkdev_driver_ioctl block/ioctl.c:304 [inline]
>  blkdev_ioctl+0xedb/0x1c20 block/ioctl.c:606
>  block_ioctl+0xee/0x130 fs/block_dev.c:1954
>  vfs_ioctl fs/ioctl.c:47 [inline]
>  file_ioctl fs/ioctl.c:539 [inline]
>  do_vfs_ioctl+0xdb6/0x13e0 fs/ioctl.c:726
>  ksys_ioctl+0xab/0xd0 fs/ioctl.c:743
>  __do_sys_ioctl fs/ioctl.c:750 [inline]
>  __se_sys_ioctl fs/ioctl.c:748 [inline]
>  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:748
>  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4452d9
> Code: Bad RIP value.
> RSP: 002b:7ffde928d288 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX:  RCX: 004452d9
> RDX:  RSI: ab03 RDI: 0004
> RBP:  R08: 004025b0 R09: 004025b0
> R10:  R11: 0246 R12: 00402520
> R13: 004025b0 R14:  R15: 
> INFO: task syz-executor390:8778 blocked for more than 143 seconds.
>   Not tainted 5.3.0-next-20190926 #0
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> syz-executor390 D27432  8778   8777 0x4004
> Call Trace:
>  context_switch kernel/sched/core.c:3384 [inline]
>  __schedule+0x828/0x1c20 kernel/sched/core.c:4065
>  schedule+0xd9/0x260 kernel/sched/core.c:4132
>  schedule_timeout+0x717/0xc50 kernel/time/timer.c:1871
>  do_wait_for_common kernel/sched/completion.c:83 [inline]
>  __wait_for_common kernel/sched/completion.c:104 [inline]
>  wait_for_common kernel/sched/completion.c:115 [inline]
>  wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
>  flush_workqueue+0x40f/0x14c0 kernel/workqueue.c:2826
>  nbd_start_device_ioctl drivers/block/nbd.c:1272 [inline]
>  __nbd_ioctl drivers/block/nbd.c:1347 [inline]
>  nbd_ioctl+0xb2e/0xc44 drivers/block/nbd.c:1387
>  __blkdev_driver_ioctl block/ioctl.c:304 [inline]
>  blkdev_ioctl+0xedb/0x1c20 block/ioctl.c:606
>  block_ioctl+0xee/0x130 fs/block_dev.c:1954
>  vfs_ioctl fs/ioctl.c:47 [inline]
>  file_ioctl fs/ioctl.c:539 [inline]
>  do_vfs_ioctl+0xdb6/0x13e0 fs/ioctl.c:726
>  ksys_ioctl+0xab/0xd0 fs/ioctl.c:743
>  __do_sys_ioctl fs/ioctl.c:750 [inline]
>  __se_sys_ioctl fs/ioctl.c:748 [inline]
>  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:748
>  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4452d9
> Code: Bad RIP value.
> RSP: 002b:7ffde928d288 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 000

Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags

2019-09-11 Thread Mike Christie
On 09/11/2019 03:40 AM, Martin Raiber wrote:
> On 10.09.2019 10:35 Damien Le Moal wrote:
>> Mike,
>>
>> On 2019/09/09 19:26, Mike Christie wrote:
>>> Forgot to cc linux-mm.
>>>
>>> On 09/09/2019 11:28 AM, Mike Christie wrote:
>>>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>>>> have userspace components that can run in the IO path. For example,
>>>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>>>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>>>> to figure out the state of paths and re-set them up.
>>>>
>>>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>>>> memalloc_*_save/restore functions to control the allocation behavior,
>>>> but for userspace we would end up hitting a allocation that ended up
>>>> writing data back to the same device we are trying to allocate for.
>>>>
>>>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>>>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>>>> depending on what other drivers and userspace file systems need, for
>>>> the final version I can add the other flags for that file or do a file
>>>> per flag or just do a memalloc_noio file.
>> Awesome. That probably will be the perfect solution for the problem we hit 
>> with
>> tcmu-runner a while back (please see this thread:
>> https://www.spinics.net/lists/linux-fsdevel/msg148912.html).
>>
>> I think we definitely need nofs as well for dealing with cases where the 
>> backend
>> storage for the user daemon is a file.
>>
>> I will give this patch a try as soon as possible (I am traveling currently).
>>
>> Best regards.
> 
> I had issues with this as well, and work on this is appreciated! In my
> case it is a loop block device on a fuse file system.
> Setting PF_LESS_THROTTLE was the one that helped the most, though, so
> add an option for that as well? I set this via prctl() for the thread
> calling it (was easiest to add to).
> 
> Sorry, I have no idea about the current rationale, but wouldn't it be
> better to have a way to mask a set of block devices/file systems not to
> write-back to in a thread. So in my case I'd specify that the fuse
> daemon threads cannot write-back to the file system and loop device
> running on top of the fuse file system, while all other block
> devices/file systems can be write-back to (causing less swapping/OOM
> issues).

I'm not sure I understood you.

The storage daemons I mentioned normally kick off N threads per M
devices. The threads handle duties like IO and error handling for those
devices. Those threads would set the flag, so those IO/error-handler
related operations do not end up writing back to them. So it works
similar to how storage drivers work in the kernel where iscsi_tcp has an
xmit thread and that does memalloc_noreclaim_save. Only the threads for
those specific devices being would set the flag.

In your case, it sounds like you have a thread/threads that would
operate on multiple devices and some need the behavior and some do not.
Is that right?




Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags

2019-09-10 Thread Mike Christie
On 09/10/2019 05:00 AM, Kirill A. Shutemov wrote:
> On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
>> There are several storage drivers like dm-multipath, iscsi, and nbd that
>> have userspace components that can run in the IO path. For example,
>> iscsi and nbd's userspace deamons may need to recreate a socket and/or
>> send IO on it, and dm-multipath's daemon multipathd may need to send IO
>> to figure out the state of paths and re-set them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting a allocation that ended up
>> writing data back to the same device we are trying to allocate for.
>>
>> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
>> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
>> depending on what other drivers and userspace file systems need, for
>> the final version I can add the other flags for that file or do a file
>> per flag or just do a memalloc_noio file.
>>
>> Signed-off-by: Mike Christie 
>> ---
>>  Documentation/filesystems/proc.txt |  6 
>>  fs/proc/base.c | 53 ++
>>  2 files changed, 59 insertions(+)
>>
>> diff --git a/Documentation/filesystems/proc.txt 
>> b/Documentation/filesystems/proc.txt
>> index 99ca040e3f90..b5456a61a013 100644
>> --- a/Documentation/filesystems/proc.txt
>> +++ b/Documentation/filesystems/proc.txt
>> @@ -46,6 +46,7 @@ Table of Contents
>>3.10  /proc//timerslack_ns - Task timerslack value
>>3.11  /proc//patch_state - Livepatch patch operation state
>>3.12  /proc//arch_status - Task architecture specific information
>> +  3.13  /proc//memalloc - Control task's memory reclaim behavior
>>  
>>4 Configuring procfs
>>4.1   Mount options
>> @@ -1980,6 +1981,11 @@ Example
>>   $ cat /proc/6753/arch_status
>>   AVX512_elapsed_ms:  8
>>  
>> +3.13 /proc//memalloc - Control task's memory reclaim behavior
>> +---
>> +A value of "noio" indicates that when a task allocates memory it will not
>> +reclaim memory that requires starting phisical IO.
>> +
>>  Description
>>  ---
>>  
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ebea9501afb8..c4faa3464602 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -1223,6 +1223,57 @@ static const struct file_operations 
>> proc_oom_score_adj_operations = {
>>  .llseek = default_llseek,
>>  };
>>  
>> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t 
>> count,
>> + loff_t *ppos)
>> +{
>> +struct task_struct *task;
>> +ssize_t rc = 0;
>> +
>> +task = get_proc_task(file_inode(file));
>> +if (!task)
>> +return -ESRCH;
>> +
>> +if (task->flags & PF_MEMALLOC_NOIO)
>> +rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
>> +put_task_struct(task);
>> +return rc;
>> +}
>> +
>> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
>> +  size_t count, loff_t *ppos)
>> +{
>> +struct task_struct *task;
>> +char buffer[5];
>> +int rc = count;
>> +
>> +memset(buffer, 0, sizeof(buffer));
>> +if (count != sizeof(buffer) - 1)
>> +return -EINVAL;
>> +
>> +if (copy_from_user(buffer, buf, count))
>> +return -EFAULT;
>> +buffer[count] = '\0';
>> +
>> +task = get_proc_task(file_inode(file));
>> +if (!task)
>> +return -ESRCH;
>> +
>> +if (!strcmp(buffer, "noio")) {
>> +task->flags |= PF_MEMALLOC_NOIO;
>> +} else {
>> +rc = -EINVAL;
>> +}
> 
> Really? Without any privilege check? So any random user can tap into
> __GFP_NOIO allocations?

That was a mistake on my part. I will add it in.



Re: [PATCH] nbd: add a missed nbd_config_put() in nbd_xmit_timeout()

2019-08-14 Thread Mike Christie
Josef had ackd my patch for the same thing here:

https://www.spinics.net/lists/linux-block/msg43800.html

so maybe Jens will pick that up with the rest of the set Josef had acked:

https://www.spinics.net/lists/linux-block/msg43809.html

to make it easier.

On 08/14/2019 08:27 PM, sunke (E) wrote:
> ping
> 
> 在 2019/8/12 20:31, Sun Ke 写道:
>> When try to get the lock failed, before return, execute the
>> nbd_config_put() to decrease the nbd->config_refs.
>>
>> If the nbd->config_refs is added but not decreased. Then will not
>> execute nbd_clear_sock() in nbd_config_put(). bd->task_setup will
>> not be cleared away. Finally, print"Device being setup by another
>> task" in nbd_add_sock() and nbd device can not be reused.
>>
>> Fixes: 8f3ea35929a0 ("nbd: handle unexpected replies better")
>> Signed-off-by: Sun Ke 
>> ---
>>   drivers/block/nbd.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
>> index e21d2de..a69a90a 100644
>> --- a/drivers/block/nbd.c
>> +++ b/drivers/block/nbd.c
>> @@ -357,8 +357,10 @@ static enum blk_eh_timer_return
>> nbd_xmit_timeout(struct request *req,
>>   }
>>   config = nbd->config;
>>   -if (!mutex_trylock(>lock))
>> +if (!mutex_trylock(>lock)) {
>> +nbd_config_put(nbd);
>>   return BLK_EH_RESET_TIMER;
>> +}
>> if (config->num_connections > 1) {
>>   dev_err_ratelimited(nbd_to_dev(nbd),
>>
> 



Re: [PATCH v2] target: fix a missing check of match_int

2019-01-19 Thread Mike Christie
On 01/11/2019 11:31 PM, Kangjie Lu wrote:
> When match_int fails, "arg" is left uninitialized and may contain random
> value, thus should not be used.
> The fix checks if match_int fails, and if so, returns its error code.
> 
> Signed-off-by: Kangjie Lu 
> ---
>  drivers/target/target_core_rd.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c
> index a6e8106abd6f..3b7657b2f2f1 100644
> --- a/drivers/target/target_core_rd.c
> +++ b/drivers/target/target_core_rd.c
> @@ -559,6 +559,7 @@ static ssize_t rd_set_configfs_dev_params(struct 
> se_device *dev,
>   char *orig, *ptr, *opts;
>   substring_t args[MAX_OPT_ARGS];
>   int arg, token;
> + int ret;
>  
>   opts = kstrdup(page, GFP_KERNEL);
>   if (!opts)
> @@ -573,14 +574,24 @@ static ssize_t rd_set_configfs_dev_params(struct 
> se_device *dev,
>   token = match_token(ptr, tokens, args);
>   switch (token) {
>   case Opt_rd_pages:
> - match_int(args, );
> + ret = match_int(args, );
> + if (ret) {
> + kfree(orig);
> + return ret;

Just set ret to the return value and then break, so all the error and
success paths are going through the same code path.


Re: [PATCH] target: fix a missing check for match_int

2019-01-11 Thread Mike Christie
On 12/26/2018 12:48 AM, Kangjie Lu wrote:
> When match_int fails, "arg" is left uninitialized and may contain random
> value, thus should not be used.
> The fix checks if match_int fails, and if so, break.
> 
> Signed-off-by: Kangjie Lu 
> ---
>  drivers/target/target_core_rd.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c
> index a6e8106abd6f..3138123143e8 100644
> --- a/drivers/target/target_core_rd.c
> +++ b/drivers/target/target_core_rd.c
> @@ -573,7 +573,8 @@ static ssize_t rd_set_configfs_dev_params(struct 
> se_device *dev,
>   token = match_token(ptr, tokens, args);
>   switch (token) {
>   case Opt_rd_pages:
> - match_int(args, );
> + if (match_int(args, ))
> + break;

I think if this fails you would want to return an error.

Also, I think you want to add a similar check for the Opt_rd_nullio call
below this chunk because arg may initialized to junk.


>   rd_dev->rd_page_count = arg;
>   pr_debug("RAMDISK: Referencing Page"
>   " Count: %u\n", rd_dev->rd_page_count);
> 



Re: [PATCHv3] iscsi-target: Don't use stack buffer for scatterlist

2018-09-04 Thread Mike Christie
On 09/04/2018 01:47 PM, Laura Abbott wrote:
> Fedora got a bug report of a crash with iSCSI:
> 
> kernel BUG at include/linux/scatterlist.h:143!
> ...
> RIP: 0010:iscsit_do_crypto_hash_buf+0x154/0x180 [iscsi_target_mod]
> ...
>  Call Trace:
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsit_get_rx_pdu+0x4cd/0xa90 [iscsi_target_mod]
>   ? native_sched_clock+0x3e/0xa0
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsi_target_rx_thread+0x81/0xf0 [iscsi_target_mod]
>   kthread+0x120/0x140
>   ? kthread_create_worker_on_cpu+0x70/0x70
>   ret_from_fork+0x3a/0x50
> 
> This is a BUG_ON for using a stack buffer with a scatterlist.
> There are two cases that trigger this bug. Switch to using a
> dynamically allocated buffer for one case and do not assign
> a NULL buffer in another case.
> 
> Signed-off-by: Laura Abbott 
> ---
> v3: Switched to kcalloc to drop an extra memset call. I was asked
> offline about kmalloc vs. kmalloc_array and ended up asking about this
> at the security summit since I was there last week. The preferred style
> is kmalloc_array for all arrays, even those of one byte characters.
> I will also switch it if people feel really strongly about this or I've
> misunderstood the array guidelines.
> ---
>  drivers/target/iscsi/iscsi_target.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c 
> b/drivers/target/iscsi/iscsi_target.c
> index 94bad43c41ff..2a586b58a011 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -1416,7 +1416,8 @@ static void iscsit_do_crypto_hash_buf(struct 
> ahash_request *hash,
>  
>   sg_init_table(sg, ARRAY_SIZE(sg));
>   sg_set_buf(sg, buf, payload_length);
> - sg_set_buf(sg + 1, pad_bytes, padding);
> + if (padding)
> + sg_set_buf(sg + 1, pad_bytes, padding);
>  
>   ahash_request_set_crypt(hash, sg, data_crc, payload_length + padding);
>  
> @@ -3910,10 +3911,14 @@ static bool iscsi_target_check_conn_state(struct 
> iscsi_conn *conn)
>  static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>  {
>   int ret;
> - u8 buffer[ISCSI_HDR_LEN], opcode;
> + u8 *buffer, opcode;
>   u32 checksum = 0, digest = 0;
>   struct kvec iov;
>  
> + buffer = kcalloc(ISCSI_HDR_LEN, sizeof(*buffer), GFP_KERNEL);
> + if (!buffer)
> + return;
> +
>   while (!kthread_should_stop()) {
>   /*
>* Ensure that both TX and RX per connection kthreads
> @@ -3921,7 +3926,6 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>*/
>   iscsit_thread_check_cpumask(conn, current, 0);
>  
> - memset(buffer, 0, ISCSI_HDR_LEN);
>   memset(, 0, sizeof(struct kvec));
>  
>   iov.iov_base= buffer;
> @@ -3930,7 +3934,7 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   ret = rx_data(conn, , 1, ISCSI_HDR_LEN);
>   if (ret != ISCSI_HDR_LEN) {
>   iscsit_rx_thread_wait_for_tcp(conn);
> - return;
> + break;
>   }
>  
>   if (conn->conn_ops->HeaderDigest) {
> @@ -3940,7 +3944,7 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   ret = rx_data(conn, , 1, ISCSI_CRC_LEN);
>   if (ret != ISCSI_CRC_LEN) {
>   iscsit_rx_thread_wait_for_tcp(conn);
> - return;
> + break;
>   }
>  
>   iscsit_do_crypto_hash_buf(conn->conn_rx_hash, buffer,
> @@ -3964,7 +3968,7 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   }
>  
>   if (conn->conn_state == TARG_CONN_STATE_IN_LOGOUT)
> - return;
> + break;
>  
>   opcode = buffer[0] & ISCSI_OPCODE_MASK;
>  
> @@ -3975,13 +3979,15 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   " while in Discovery Session, rejecting.\n", opcode);
>   iscsit_add_reject(conn, ISCSI_REASON_PROTOCOL_ERROR,
> buffer);
> -     return;
> + break;
>   }
>  
>   ret = iscsi_target_rx_opcode(conn, buffer);
>   if (ret < 0)
> - return;
> + break;
>   }
> +
> + kfree(buffer);
>  }
>  
>  int iscsi_target_rx_thread(void *arg)
> 

Ok.

Reviewed-by: Mike Christie 


Re: [PATCHv3] iscsi-target: Don't use stack buffer for scatterlist

2018-09-04 Thread Mike Christie
On 09/04/2018 01:47 PM, Laura Abbott wrote:
> Fedora got a bug report of a crash with iSCSI:
> 
> kernel BUG at include/linux/scatterlist.h:143!
> ...
> RIP: 0010:iscsit_do_crypto_hash_buf+0x154/0x180 [iscsi_target_mod]
> ...
>  Call Trace:
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsit_get_rx_pdu+0x4cd/0xa90 [iscsi_target_mod]
>   ? native_sched_clock+0x3e/0xa0
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsi_target_rx_thread+0x81/0xf0 [iscsi_target_mod]
>   kthread+0x120/0x140
>   ? kthread_create_worker_on_cpu+0x70/0x70
>   ret_from_fork+0x3a/0x50
> 
> This is a BUG_ON for using a stack buffer with a scatterlist.
> There are two cases that trigger this bug. Switch to using a
> dynamically allocated buffer for one case and do not assign
> a NULL buffer in another case.
> 
> Signed-off-by: Laura Abbott 
> ---
> v3: Switched to kcalloc to drop an extra memset call. I was asked
> offline about kmalloc vs. kmalloc_array and ended up asking about this
> at the security summit since I was there last week. The preferred style
> is kmalloc_array for all arrays, even those of one byte characters.
> I will also switch it if people feel really strongly about this or I've
> misunderstood the array guidelines.
> ---
>  drivers/target/iscsi/iscsi_target.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c 
> b/drivers/target/iscsi/iscsi_target.c
> index 94bad43c41ff..2a586b58a011 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -1416,7 +1416,8 @@ static void iscsit_do_crypto_hash_buf(struct 
> ahash_request *hash,
>  
>   sg_init_table(sg, ARRAY_SIZE(sg));
>   sg_set_buf(sg, buf, payload_length);
> - sg_set_buf(sg + 1, pad_bytes, padding);
> + if (padding)
> + sg_set_buf(sg + 1, pad_bytes, padding);
>  
>   ahash_request_set_crypt(hash, sg, data_crc, payload_length + padding);
>  
> @@ -3910,10 +3911,14 @@ static bool iscsi_target_check_conn_state(struct 
> iscsi_conn *conn)
>  static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>  {
>   int ret;
> - u8 buffer[ISCSI_HDR_LEN], opcode;
> + u8 *buffer, opcode;
>   u32 checksum = 0, digest = 0;
>   struct kvec iov;
>  
> + buffer = kcalloc(ISCSI_HDR_LEN, sizeof(*buffer), GFP_KERNEL);
> + if (!buffer)
> + return;
> +
>   while (!kthread_should_stop()) {
>   /*
>* Ensure that both TX and RX per connection kthreads
> @@ -3921,7 +3926,6 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>*/
>   iscsit_thread_check_cpumask(conn, current, 0);
>  
> - memset(buffer, 0, ISCSI_HDR_LEN);
>   memset(, 0, sizeof(struct kvec));
>  
>   iov.iov_base= buffer;
> @@ -3930,7 +3934,7 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   ret = rx_data(conn, , 1, ISCSI_HDR_LEN);
>   if (ret != ISCSI_HDR_LEN) {
>   iscsit_rx_thread_wait_for_tcp(conn);
> - return;
> + break;
>   }
>  
>   if (conn->conn_ops->HeaderDigest) {
> @@ -3940,7 +3944,7 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   ret = rx_data(conn, , 1, ISCSI_CRC_LEN);
>   if (ret != ISCSI_CRC_LEN) {
>   iscsit_rx_thread_wait_for_tcp(conn);
> - return;
> + break;
>   }
>  
>   iscsit_do_crypto_hash_buf(conn->conn_rx_hash, buffer,
> @@ -3964,7 +3968,7 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   }
>  
>   if (conn->conn_state == TARG_CONN_STATE_IN_LOGOUT)
> - return;
> + break;
>  
>   opcode = buffer[0] & ISCSI_OPCODE_MASK;
>  
> @@ -3975,13 +3979,15 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   " while in Discovery Session, rejecting.\n", opcode);
>   iscsit_add_reject(conn, ISCSI_REASON_PROTOCOL_ERROR,
> buffer);
> -     return;
> + break;
>   }
>  
>   ret = iscsi_target_rx_opcode(conn, buffer);
>   if (ret < 0)
> - return;
> + break;
>   }
> +
> + kfree(buffer);
>  }
>  
>  int iscsi_target_rx_thread(void *arg)
> 

Ok.

Reviewed-by: Mike Christie 


Re: [PATCH] iscsi-target: Don't use stack buffer for scatterlist

2018-08-22 Thread Mike Christie
ccing Maurizio because he was working on the same issue.

On 08/22/2018 12:37 PM, Laura Abbott wrote:
> Fedora got a bug report of a crash with iSCSI:
> 
> kernel BUG at include/linux/scatterlist.h:143!
> ...
> RIP: 0010:iscsit_do_crypto_hash_buf+0x154/0x180 [iscsi_target_mod]
> ...
>  Call Trace:
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsit_get_rx_pdu+0x4cd/0xa90 [iscsi_target_mod]
>   ? native_sched_clock+0x3e/0xa0
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsi_target_rx_thread+0x81/0xf0 [iscsi_target_mod]
>   kthread+0x120/0x140
>   ? kthread_create_worker_on_cpu+0x70/0x70
>   ret_from_fork+0x3a/0x50
> 
> This is a BUG_ON for using a stack buffer with a scatterlist.
> There are two cases that trigger this bug. Switch to using a
> dynamically allocated buffer for one case and do not assign
> a NULL buffer in another case.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1615258
> Signed-off-by: Laura Abbott 
> ---
>  drivers/target/iscsi/iscsi_target.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c 
> b/drivers/target/iscsi/iscsi_target.c
> index 8e223799347a..8b40f0e99e2c 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -1419,7 +1419,8 @@ static void iscsit_do_crypto_hash_buf(struct 
> ahash_request *hash,
>  
>   sg_init_table(sg, ARRAY_SIZE(sg));
>   sg_set_buf(sg, buf, payload_length);
> - sg_set_buf(sg + 1, pad_bytes, padding);
> + if (padding)
> + sg_set_buf(sg + 1, pad_bytes, padding);
>  
>   ahash_request_set_crypt(hash, sg, data_crc, payload_length + padding);
>  
> @@ -3913,10 +3914,14 @@ static bool iscsi_target_check_conn_state(struct 
> iscsi_conn *conn)
>  static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>  {
>   int ret;
> - u8 buffer[ISCSI_HDR_LEN], opcode;
> + u8 *buffer, opcode;
>   u32 checksum = 0, digest = 0;
>   struct kvec iov;
>  
> + buffer = kmalloc_array(ISCSI_HDR_LEN, sizeof(*buffer), GFP_KERNEL);
> + if (!buffer)
> + return;
> +
>   while (!kthread_should_stop()) {
>   /*
>* Ensure that both TX and RX per connection kthreads
> @@ -3985,6 +3990,8 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   if (ret < 0)
>   return;
>   }

You need to also change all the returns to breaks between the kmalloc
above and kfree here.

> + kfree(buffer);
>  }
>  
>  int iscsi_target_rx_thread(void *arg)
> 



Re: [PATCH] iscsi-target: Don't use stack buffer for scatterlist

2018-08-22 Thread Mike Christie
ccing Maurizio because he was working on the same issue.

On 08/22/2018 12:37 PM, Laura Abbott wrote:
> Fedora got a bug report of a crash with iSCSI:
> 
> kernel BUG at include/linux/scatterlist.h:143!
> ...
> RIP: 0010:iscsit_do_crypto_hash_buf+0x154/0x180 [iscsi_target_mod]
> ...
>  Call Trace:
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsit_get_rx_pdu+0x4cd/0xa90 [iscsi_target_mod]
>   ? native_sched_clock+0x3e/0xa0
>   ? iscsi_target_tx_thread+0x200/0x200 [iscsi_target_mod]
>   iscsi_target_rx_thread+0x81/0xf0 [iscsi_target_mod]
>   kthread+0x120/0x140
>   ? kthread_create_worker_on_cpu+0x70/0x70
>   ret_from_fork+0x3a/0x50
> 
> This is a BUG_ON for using a stack buffer with a scatterlist.
> There are two cases that trigger this bug. Switch to using a
> dynamically allocated buffer for one case and do not assign
> a NULL buffer in another case.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1615258
> Signed-off-by: Laura Abbott 
> ---
>  drivers/target/iscsi/iscsi_target.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c 
> b/drivers/target/iscsi/iscsi_target.c
> index 8e223799347a..8b40f0e99e2c 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -1419,7 +1419,8 @@ static void iscsit_do_crypto_hash_buf(struct 
> ahash_request *hash,
>  
>   sg_init_table(sg, ARRAY_SIZE(sg));
>   sg_set_buf(sg, buf, payload_length);
> - sg_set_buf(sg + 1, pad_bytes, padding);
> + if (padding)
> + sg_set_buf(sg + 1, pad_bytes, padding);
>  
>   ahash_request_set_crypt(hash, sg, data_crc, payload_length + padding);
>  
> @@ -3913,10 +3914,14 @@ static bool iscsi_target_check_conn_state(struct 
> iscsi_conn *conn)
>  static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>  {
>   int ret;
> - u8 buffer[ISCSI_HDR_LEN], opcode;
> + u8 *buffer, opcode;
>   u32 checksum = 0, digest = 0;
>   struct kvec iov;
>  
> + buffer = kmalloc_array(ISCSI_HDR_LEN, sizeof(*buffer), GFP_KERNEL);
> + if (!buffer)
> + return;
> +
>   while (!kthread_should_stop()) {
>   /*
>* Ensure that both TX and RX per connection kthreads
> @@ -3985,6 +3990,8 @@ static void iscsit_get_rx_pdu(struct iscsi_conn *conn)
>   if (ret < 0)
>   return;
>   }

You need to also change all the returns to breaks between the kmalloc
above and kfree here.

> + kfree(buffer);
>  }
>  
>  int iscsi_target_rx_thread(void *arg)
> 



Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-09 Thread Mike Christie
On 07/06/2018 08:28 PM, Xiubo Li wrote:
> On 2018/7/7 2:23, Mike Christie wrote:
>> On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>>>   static irqreturn_t uio_interrupt(int irq, void *dev_id)
>>>   {
>>>   struct uio_device *idev = (struct uio_device *)dev_id;
>>> -irqreturn_t ret = idev->info->handler(irq, idev->info);
>>> +irqreturn_t ret;
>>> +
>>> +mutex_lock(>info_lock);
>>> +if (!idev->info) {
>>> +ret = IRQ_NONE;
>>> +goto out;
>>> +}
>>>   +ret = idev->info->handler(irq, idev->info);
>>>   if (ret == IRQ_HANDLED)
>>>   uio_event_notify(idev->info);
>>>   +out:
>>> +mutex_unlock(>info_lock);
>>>   return ret;
>>>   }
>>
>> Do you need the interrupt related changes in this patch and the first
>> one?
> Actually, the NULL checking is not a must, we can remove this. But the
> lock/unlock is needed.
>>   When we do uio_unregister_device -> free_irq does free_irq return
>> when there are no longer running interrupt handlers that we requested?
>>
>> If that is not the case then I think we can hit a similar bug. We do:
>>
>> __uio_register_device -> device_register -> device's refcount goes to
>> zero so we do -> uio_device_release -> kfree(idev)
>>
>> and if it is possible the interrupt handler could still run after
>> free_irq then we would end up doing:
>>
>> uio_interrupt -> mutex_lock(>info_lock) -> idev access freed
>> memory.
> 
> I think this shouldn't happen. Because the free_irq function does not
> return until any executing interrupts for this IRQ have completed.
> 

If free_irq returns after executing interrupts and does not allow new
executions what is the lock protecting in uio_interrupt?



Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-09 Thread Mike Christie
On 07/06/2018 08:28 PM, Xiubo Li wrote:
> On 2018/7/7 2:23, Mike Christie wrote:
>> On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>>>   static irqreturn_t uio_interrupt(int irq, void *dev_id)
>>>   {
>>>   struct uio_device *idev = (struct uio_device *)dev_id;
>>> -irqreturn_t ret = idev->info->handler(irq, idev->info);
>>> +irqreturn_t ret;
>>> +
>>> +mutex_lock(>info_lock);
>>> +if (!idev->info) {
>>> +ret = IRQ_NONE;
>>> +goto out;
>>> +}
>>>   +ret = idev->info->handler(irq, idev->info);
>>>   if (ret == IRQ_HANDLED)
>>>   uio_event_notify(idev->info);
>>>   +out:
>>> +mutex_unlock(>info_lock);
>>>   return ret;
>>>   }
>>
>> Do you need the interrupt related changes in this patch and the first
>> one?
> Actually, the NULL checking is not a must, we can remove this. But the
> lock/unlock is needed.
>>   When we do uio_unregister_device -> free_irq does free_irq return
>> when there are no longer running interrupt handlers that we requested?
>>
>> If that is not the case then I think we can hit a similar bug. We do:
>>
>> __uio_register_device -> device_register -> device's refcount goes to
>> zero so we do -> uio_device_release -> kfree(idev)
>>
>> and if it is possible the interrupt handler could still run after
>> free_irq then we would end up doing:
>>
>> uio_interrupt -> mutex_lock(>info_lock) -> idev access freed
>> memory.
> 
> I think this shouldn't happen. Because the free_irq function does not
> return until any executing interrupts for this IRQ have completed.
> 

If free_irq returns after executing interrupts and does not allow new
executions what is the lock protecting in uio_interrupt?



Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-09 Thread Mike Christie
On 07/06/2018 08:47 PM, Xiubo Li wrote:
> On 2018/7/7 2:58, Mike Christie wrote:
>> On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>>>   void uio_event_notify(struct uio_info *info)
>>>   {
>>> -struct uio_device *idev = info->uio_dev;
>>> +struct uio_device *idev;
>>> +
>>> +if (!info)
>>> +return;
>>> +
>>> +idev = info->uio_dev;
>>>   
>> For this one too, I am not sure if it is needed.
>>
>> uio_interrupt -> uio_event_notify. See other mail.
>>
>> driver XYZ -> uio_event_notify. I think drivers need to handle this and
>> set some bits and/or perform some cleanup to make sure they are not
>> calling uio_event_notify after it has called uio_unregister_device. The
>> problem with the above test is if they do not they could have called
>> uio_unregister_device right after the info test so you could still hit
>> the problem.
> 
> When we are tcmu_destroy_device(), if the netlink notify event to
> userspace is not successful then the TCMU will call the uio unregister,
> which will set the idev->info = NULL, without close and deleting the
> device in userspace. But the TCMU could still queue cmds to the ring
> buffer, then the uio_event_notify will be called.

Before tcmu_destroy_device is called LIO will have stopped new IO and
waited for outstanding IO to be completed, so it would never be called
after uio_unregister_device. If it does, it's a bug in LIO.


Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-09 Thread Mike Christie
On 07/06/2018 08:47 PM, Xiubo Li wrote:
> On 2018/7/7 2:58, Mike Christie wrote:
>> On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>>>   void uio_event_notify(struct uio_info *info)
>>>   {
>>> -struct uio_device *idev = info->uio_dev;
>>> +struct uio_device *idev;
>>> +
>>> +if (!info)
>>> +return;
>>> +
>>> +idev = info->uio_dev;
>>>   
>> For this one too, I am not sure if it is needed.
>>
>> uio_interrupt -> uio_event_notify. See other mail.
>>
>> driver XYZ -> uio_event_notify. I think drivers need to handle this and
>> set some bits and/or perform some cleanup to make sure they are not
>> calling uio_event_notify after it has called uio_unregister_device. The
>> problem with the above test is if they do not they could have called
>> uio_unregister_device right after the info test so you could still hit
>> the problem.
> 
> When we are tcmu_destroy_device(), if the netlink notify event to
> userspace is not successful then the TCMU will call the uio unregister,
> which will set the idev->info = NULL, without close and deleting the
> device in userspace. But the TCMU could still queue cmds to the ring
> buffer, then the uio_event_notify will be called.

Before tcmu_destroy_device is called LIO will have stopped new IO and
waited for outstanding IO to be completed, so it would never be called
after uio_unregister_device. If it does, it's a bug in LIO.


Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-06 Thread Mike Christie
On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>  void uio_event_notify(struct uio_info *info)
>  {
> - struct uio_device *idev = info->uio_dev;
> + struct uio_device *idev;
> +
> + if (!info)
> + return;
> +
> + idev = info->uio_dev;
>  

For this one too, I am not sure if it is needed.

uio_interrupt -> uio_event_notify. See other mail.

driver XYZ -> uio_event_notify. I think drivers need to handle this and
set some bits and/or perform some cleanup to make sure they are not
calling uio_event_notify after it has called uio_unregister_device. The
problem with the above test is if they do not they could have called
uio_unregister_device right after the info test so you could still hit
the problem.


Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-06 Thread Mike Christie
On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>  void uio_event_notify(struct uio_info *info)
>  {
> - struct uio_device *idev = info->uio_dev;
> + struct uio_device *idev;
> +
> + if (!info)
> + return;
> +
> + idev = info->uio_dev;
>  

For this one too, I am not sure if it is needed.

uio_interrupt -> uio_event_notify. See other mail.

driver XYZ -> uio_event_notify. I think drivers need to handle this and
set some bits and/or perform some cleanup to make sure they are not
calling uio_event_notify after it has called uio_unregister_device. The
problem with the above test is if they do not they could have called
uio_unregister_device right after the info test so you could still hit
the problem.


Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-06 Thread Mike Christie
On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>  static irqreturn_t uio_interrupt(int irq, void *dev_id)
>  {
>   struct uio_device *idev = (struct uio_device *)dev_id;
> - irqreturn_t ret = idev->info->handler(irq, idev->info);
> + irqreturn_t ret;
> +
> + mutex_lock(>info_lock);
> + if (!idev->info) {
> + ret = IRQ_NONE;
> + goto out;
> + }
>  
> + ret = idev->info->handler(irq, idev->info);
>   if (ret == IRQ_HANDLED)
>   uio_event_notify(idev->info);
>  
> +out:
> + mutex_unlock(>info_lock);
>   return ret;
>  }


Do you need the interrupt related changes in this patch and the first
one? When we do uio_unregister_device -> free_irq does free_irq return
when there are no longer running interrupt handlers that we requested?

If that is not the case then I think we can hit a similar bug. We do:

__uio_register_device -> device_register -> device's refcount goes to
zero so we do -> uio_device_release -> kfree(idev)

and if it is possible the interrupt handler could still run after
free_irq then we would end up doing:

uio_interrupt -> mutex_lock(>info_lock) -> idev access freed memory.


Re: [PATCH v3 3/3] uio: fix crash after the device is unregistered

2018-07-06 Thread Mike Christie
On 07/05/2018 09:57 PM, xiu...@redhat.com wrote:
>  static irqreturn_t uio_interrupt(int irq, void *dev_id)
>  {
>   struct uio_device *idev = (struct uio_device *)dev_id;
> - irqreturn_t ret = idev->info->handler(irq, idev->info);
> + irqreturn_t ret;
> +
> + mutex_lock(>info_lock);
> + if (!idev->info) {
> + ret = IRQ_NONE;
> + goto out;
> + }
>  
> + ret = idev->info->handler(irq, idev->info);
>   if (ret == IRQ_HANDLED)
>   uio_event_notify(idev->info);
>  
> +out:
> + mutex_unlock(>info_lock);
>   return ret;
>  }


Do you need the interrupt related changes in this patch and the first
one? When we do uio_unregister_device -> free_irq does free_irq return
when there are no longer running interrupt handlers that we requested?

If that is not the case then I think we can hit a similar bug. We do:

__uio_register_device -> device_register -> device's refcount goes to
zero so we do -> uio_device_release -> kfree(idev)

and if it is possible the interrupt handler could still run after
free_irq then we would end up doing:

uio_interrupt -> mutex_lock(>info_lock) -> idev access freed memory.


uio regression in 4.18-rc1 due to "uio: Prevent device destruction while fds are open"

2018-06-20 Thread Mike Christie
Hi Hamish,

I am hitting a regression with your patch:

commit a93e7b331568227500186a465fee3c2cb5dffd1f
Author: Hamish Martin 
Date:   Mon May 14 13:32:23 2018 +1200

uio: Prevent device destruction while fds are open

The problem is the addition of spin_lock_irqsave in uio_write. This
leads to hitting  uio_write -> copy_from_user -> _copy_from_user ->
might_fault and the logs filling up with sleeping warnings.

I also noticed some uio drivers allocate memory, sleep, grab mutexes
from callouts like open() and release and uio is now doing
spin_lock_irqsave while calling them.

Note target_core_user's irqcontrol looks buggy and was just doing more
work than it should in that callout. I can fix that driver.


uio regression in 4.18-rc1 due to "uio: Prevent device destruction while fds are open"

2018-06-20 Thread Mike Christie
Hi Hamish,

I am hitting a regression with your patch:

commit a93e7b331568227500186a465fee3c2cb5dffd1f
Author: Hamish Martin 
Date:   Mon May 14 13:32:23 2018 +1200

uio: Prevent device destruction while fds are open

The problem is the addition of spin_lock_irqsave in uio_write. This
leads to hitting  uio_write -> copy_from_user -> _copy_from_user ->
might_fault and the logs filling up with sleeping warnings.

I also noticed some uio drivers allocate memory, sleep, grab mutexes
from callouts like open() and release and uio is now doing
spin_lock_irqsave while calling them.

Note target_core_user's irqcontrol looks buggy and was just doing more
work than it should in that callout. I can fix that driver.


Re: [PATCH] target: make device_mutex and device_list static

2017-07-05 Thread Mike Christie
On 07/04/2017 03:44 AM, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> Variables device_mutex and device_list static are local to the source,
> so make them static.
> 
> Cleans up sparse warnings:
> "symbol 'device_list' was not declared. Should it be static?"
> "symbol 'device_mutex' was not declared. Should it be static?"
> 
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---
>  drivers/target/target_core_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/target/target_core_device.c 
> b/drivers/target/target_core_device.c
> index 3ae8fbf01bdf..bbcef3bc66c8 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -49,8 +49,8 @@
>  #include "target_core_pr.h"
>  #include "target_core_ua.h"
>  
> -DEFINE_MUTEX(device_mutex);
> -LIST_HEAD(device_list);
> +static DEFINE_MUTEX(device_mutex);
> +static LIST_HEAD(device_list);
>  static DEFINE_IDR(devices_idr);
>  
>  static struct se_hba *lun0_hba;
> 

My fault. Thanks.

Reviewed-by: Mike Christie <mchri...@redhat.com>


Re: [PATCH] target: make device_mutex and device_list static

2017-07-05 Thread Mike Christie
On 07/04/2017 03:44 AM, Colin King wrote:
> From: Colin Ian King 
> 
> Variables device_mutex and device_list static are local to the source,
> so make them static.
> 
> Cleans up sparse warnings:
> "symbol 'device_list' was not declared. Should it be static?"
> "symbol 'device_mutex' was not declared. Should it be static?"
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/target/target_core_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/target/target_core_device.c 
> b/drivers/target/target_core_device.c
> index 3ae8fbf01bdf..bbcef3bc66c8 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -49,8 +49,8 @@
>  #include "target_core_pr.h"
>  #include "target_core_ua.h"
>  
> -DEFINE_MUTEX(device_mutex);
> -LIST_HEAD(device_list);
> +static DEFINE_MUTEX(device_mutex);
> +static LIST_HEAD(device_list);
>  static DEFINE_IDR(devices_idr);
>  
>  static struct se_hba *lun0_hba;
> 

My fault. Thanks.

Reviewed-by: Mike Christie 


Re: [PATCH 1/1] uio: Fix uio_device memory leak

2017-06-13 Thread Mike Christie
On 06/13/2017 07:16 PM, Mike Christie wrote:
> On 06/13/2017 09:01 AM, Greg KH wrote:
>> > On Wed, Jun 07, 2017 at 03:06:44PM -0500, Mike Christie wrote:
>>> >> It looks like there might be 2 issues with the uio_device allocation, or 
>>> >> it
>>> >> looks like we are leaking the device for possibly a specific type of 
>>> >> device
>>> >> case that I could not find but one of you may know about.
>>> >>
>>> >> Issues:
>>> >> 1. We use devm_kzalloc to allocate the uio_device, but the release
>>> >> function, devm_kmalloc_release, is just a noop, so the memory is never 
>>> >> freed.
>> > 
>> > What do you mean by this?  If the release function is a noop, lots of
>> > memory in the kernel is leaking.  UIO shouldn't have to do anything
>> > special here, is the devm api somehow broken?
> Sorry. I misdiagnosed the problem. It's a noop, but we did kfree on the
> entire devres and its data later.
> 
> The problem I was hitting is that memory is not freed until the parent
> is removed. __uio_register_device does:
> 
> idev = devm_kzalloc(parent, sizeof(*idev), GFP_KERNEL);
> if (!idev) {
> return -ENOMEM;
> }
> 
> so the devres's memory is associated with the parent. Is that intentional?
> 

What I meant is that it I can send a patch to just fix up the
devm_kzalloc use in uio.c, so it gets the device struct for the uio
device instead of the parent.

However, it looks like the existing code using the parent prevents a
crash. If the child is hot unplugged/removed, and uio_unregister_device
ends up freeing the idev, then later when the userspace application does
a close on the uio device we would try to access the freed idev in
uio_release.

If the devm_kzalloc parent use was meant for that hot unplug case, then
I can also look into how to fix the drivers too.


Re: [PATCH 1/1] uio: Fix uio_device memory leak

2017-06-13 Thread Mike Christie
On 06/13/2017 07:16 PM, Mike Christie wrote:
> On 06/13/2017 09:01 AM, Greg KH wrote:
>> > On Wed, Jun 07, 2017 at 03:06:44PM -0500, Mike Christie wrote:
>>> >> It looks like there might be 2 issues with the uio_device allocation, or 
>>> >> it
>>> >> looks like we are leaking the device for possibly a specific type of 
>>> >> device
>>> >> case that I could not find but one of you may know about.
>>> >>
>>> >> Issues:
>>> >> 1. We use devm_kzalloc to allocate the uio_device, but the release
>>> >> function, devm_kmalloc_release, is just a noop, so the memory is never 
>>> >> freed.
>> > 
>> > What do you mean by this?  If the release function is a noop, lots of
>> > memory in the kernel is leaking.  UIO shouldn't have to do anything
>> > special here, is the devm api somehow broken?
> Sorry. I misdiagnosed the problem. It's a noop, but we did kfree on the
> entire devres and its data later.
> 
> The problem I was hitting is that memory is not freed until the parent
> is removed. __uio_register_device does:
> 
> idev = devm_kzalloc(parent, sizeof(*idev), GFP_KERNEL);
> if (!idev) {
> return -ENOMEM;
> }
> 
> so the devres's memory is associated with the parent. Is that intentional?
> 

What I meant is that it I can send a patch to just fix up the
devm_kzalloc use in uio.c, so it gets the device struct for the uio
device instead of the parent.

However, it looks like the existing code using the parent prevents a
crash. If the child is hot unplugged/removed, and uio_unregister_device
ends up freeing the idev, then later when the userspace application does
a close on the uio device we would try to access the freed idev in
uio_release.

If the devm_kzalloc parent use was meant for that hot unplug case, then
I can also look into how to fix the drivers too.


Re: [PATCH 1/1] uio: Fix uio_device memory leak

2017-06-13 Thread Mike Christie
On 06/13/2017 09:01 AM, Greg KH wrote:
> On Wed, Jun 07, 2017 at 03:06:44PM -0500, Mike Christie wrote:
>> It looks like there might be 2 issues with the uio_device allocation, or it
>> looks like we are leaking the device for possibly a specific type of device
>> case that I could not find but one of you may know about.
>>
>> Issues:
>> 1. We use devm_kzalloc to allocate the uio_device, but the release
>> function, devm_kmalloc_release, is just a noop, so the memory is never freed.
> 
> What do you mean by this?  If the release function is a noop, lots of
> memory in the kernel is leaking.  UIO shouldn't have to do anything
> special here, is the devm api somehow broken?

Sorry. I misdiagnosed the problem. It's a noop, but we did kfree on the
entire devres and its data later.

The problem I was hitting is that memory is not freed until the parent
is removed. __uio_register_device does:

idev = devm_kzalloc(parent, sizeof(*idev), GFP_KERNEL);
if (!idev) {
return -ENOMEM;
}

so the devres's memory is associated with the parent. Is that intentional?




Re: [PATCH 1/1] uio: Fix uio_device memory leak

2017-06-13 Thread Mike Christie
On 06/13/2017 09:01 AM, Greg KH wrote:
> On Wed, Jun 07, 2017 at 03:06:44PM -0500, Mike Christie wrote:
>> It looks like there might be 2 issues with the uio_device allocation, or it
>> looks like we are leaking the device for possibly a specific type of device
>> case that I could not find but one of you may know about.
>>
>> Issues:
>> 1. We use devm_kzalloc to allocate the uio_device, but the release
>> function, devm_kmalloc_release, is just a noop, so the memory is never freed.
> 
> What do you mean by this?  If the release function is a noop, lots of
> memory in the kernel is leaking.  UIO shouldn't have to do anything
> special here, is the devm api somehow broken?

Sorry. I misdiagnosed the problem. It's a noop, but we did kfree on the
entire devres and its data later.

The problem I was hitting is that memory is not freed until the parent
is removed. __uio_register_device does:

idev = devm_kzalloc(parent, sizeof(*idev), GFP_KERNEL);
if (!idev) {
return -ENOMEM;
}

so the devres's memory is associated with the parent. Is that intentional?




[PATCH 1/1] uio: Fix uio_device memory leak

2017-06-07 Thread Mike Christie
It looks like there might be 2 issues with the uio_device allocation, or it
looks like we are leaking the device for possibly a specific type of device
case that I could not find but one of you may know about.

Issues:
1. We use devm_kzalloc to allocate the uio_device, but the release
function, devm_kmalloc_release, is just a noop, so the memory is never freed.

2. We are actually assigning the resource allocated by devm_kzalloc to
the parent device and not the device that goes with the uio_device.
If devm_kmalloc_release did free the memory, it would not actually be
freed until the parent is which in a lot of cases is not until module
removal, so there could have been thousands of uio_registe/unregister_device
sequences.

It seems like this is either a bug, or was done deliberately to support
some type of device that needs that memory to not be freed. I
checked upstream uio users, but did not see any type of device like this
though.

If this is a bug, this patch, made over Linus's tree, fixes the problems by 
just allocating the uio_device with the uio_info struct since they need to be
allocated/freed/accessed at the same times.

Signed-off-by: Mike Christie <mchri...@redhat.com>
---
 drivers/target/target_core_user.c |  6 +++---
 drivers/uio/uio.c | 17 +
 include/linux/uio_driver.h|  2 +-
 3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index beb5f09..00fde52 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1299,7 +1299,7 @@ static int tcmu_configure_device(struct se_device *dev)
kref_get(>kref);
 
ret = tcmu_netlink_event(TCMU_CMD_ADDED_DEVICE, udev->uio_info.name,
-udev->uio_info.uio_dev->minor);
+udev->uio_info.uio_dev.minor);
if (ret)
goto err_netlink;
 
@@ -1332,7 +1332,7 @@ static int tcmu_check_and_free_pending_cmd(struct 
tcmu_cmd *cmd)
 
 static bool tcmu_dev_configured(struct tcmu_dev *udev)
 {
-   return udev->uio_info.uio_dev ? true : false;
+   return !!(udev->se_dev.dev_flags & DF_CONFIGURED);
 }
 
 static void tcmu_blocks_release(struct tcmu_dev *udev)
@@ -1381,7 +1381,7 @@ static void tcmu_free_device(struct se_device *dev)
 
if (tcmu_dev_configured(udev)) {
tcmu_netlink_event(TCMU_CMD_REMOVED_DEVICE, udev->uio_info.name,
-  udev->uio_info.uio_dev->minor);
+  udev->uio_info.uio_dev.minor);
 
uio_unregister_device(>uio_info);
}
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
index ff04b7f..7593db6 100644
--- a/drivers/uio/uio.c
+++ b/drivers/uio/uio.c
@@ -399,7 +399,7 @@ static void uio_free_minor(struct uio_device *idev)
  */
 void uio_event_notify(struct uio_info *info)
 {
-   struct uio_device *idev = info->uio_dev;
+   struct uio_device *idev = >uio_dev;
 
atomic_inc(>event);
wake_up_interruptible(>wait);
@@ -812,13 +812,8 @@ int __uio_register_device(struct module *owner,
if (!parent || !info || !info->name || !info->version)
return -EINVAL;
 
-   info->uio_dev = NULL;
-
-   idev = devm_kzalloc(parent, sizeof(*idev), GFP_KERNEL);
-   if (!idev) {
-   return -ENOMEM;
-   }
-
+   idev = >uio_dev;
+   memset(idev, 0, sizeof(*idev));
idev->owner = owner;
idev->info = info;
init_waitqueue_head(>wait);
@@ -841,8 +836,6 @@ int __uio_register_device(struct module *owner,
if (ret)
goto err_uio_dev_add_attributes;
 
-   info->uio_dev = idev;
-
if (info->irq && (info->irq != UIO_IRQ_CUSTOM)) {
/*
 * Note that we deliberately don't use devm_request_irq
@@ -879,10 +872,10 @@ void uio_unregister_device(struct uio_info *info)
 {
struct uio_device *idev;
 
-   if (!info || !info->uio_dev)
+   if (!info)
return;
 
-   idev = info->uio_dev;
+   idev = >uio_dev;
 
uio_free_minor(idev);
 
diff --git a/include/linux/uio_driver.h b/include/linux/uio_driver.h
index 3c85c81..4b1b1c8 100644
--- a/include/linux/uio_driver.h
+++ b/include/linux/uio_driver.h
@@ -95,7 +95,7 @@ struct uio_device {
  * @irqcontrol:disable/enable irqs when 0/1 is written to 
/dev/uioX
  */
 struct uio_info {
-   struct uio_device   *uio_dev;
+   struct uio_device   uio_dev;
const char  *name;
const char  *version;
struct uio_mem  mem[MAX_UIO_MAPS];
-- 
2.7.2



[PATCH 1/1] uio: Fix uio_device memory leak

2017-06-07 Thread Mike Christie
It looks like there might be 2 issues with the uio_device allocation, or it
looks like we are leaking the device for possibly a specific type of device
case that I could not find but one of you may know about.

Issues:
1. We use devm_kzalloc to allocate the uio_device, but the release
function, devm_kmalloc_release, is just a noop, so the memory is never freed.

2. We are actually assigning the resource allocated by devm_kzalloc to
the parent device and not the device that goes with the uio_device.
If devm_kmalloc_release did free the memory, it would not actually be
freed until the parent is which in a lot of cases is not until module
removal, so there could have been thousands of uio_registe/unregister_device
sequences.

It seems like this is either a bug, or was done deliberately to support
some type of device that needs that memory to not be freed. I
checked upstream uio users, but did not see any type of device like this
though.

If this is a bug, this patch, made over Linus's tree, fixes the problems by 
just allocating the uio_device with the uio_info struct since they need to be
allocated/freed/accessed at the same times.

Signed-off-by: Mike Christie 
---
 drivers/target/target_core_user.c |  6 +++---
 drivers/uio/uio.c | 17 +
 include/linux/uio_driver.h|  2 +-
 3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index beb5f09..00fde52 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1299,7 +1299,7 @@ static int tcmu_configure_device(struct se_device *dev)
kref_get(>kref);
 
ret = tcmu_netlink_event(TCMU_CMD_ADDED_DEVICE, udev->uio_info.name,
-udev->uio_info.uio_dev->minor);
+udev->uio_info.uio_dev.minor);
if (ret)
goto err_netlink;
 
@@ -1332,7 +1332,7 @@ static int tcmu_check_and_free_pending_cmd(struct 
tcmu_cmd *cmd)
 
 static bool tcmu_dev_configured(struct tcmu_dev *udev)
 {
-   return udev->uio_info.uio_dev ? true : false;
+   return !!(udev->se_dev.dev_flags & DF_CONFIGURED);
 }
 
 static void tcmu_blocks_release(struct tcmu_dev *udev)
@@ -1381,7 +1381,7 @@ static void tcmu_free_device(struct se_device *dev)
 
if (tcmu_dev_configured(udev)) {
tcmu_netlink_event(TCMU_CMD_REMOVED_DEVICE, udev->uio_info.name,
-  udev->uio_info.uio_dev->minor);
+  udev->uio_info.uio_dev.minor);
 
uio_unregister_device(>uio_info);
}
diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
index ff04b7f..7593db6 100644
--- a/drivers/uio/uio.c
+++ b/drivers/uio/uio.c
@@ -399,7 +399,7 @@ static void uio_free_minor(struct uio_device *idev)
  */
 void uio_event_notify(struct uio_info *info)
 {
-   struct uio_device *idev = info->uio_dev;
+   struct uio_device *idev = >uio_dev;
 
atomic_inc(>event);
wake_up_interruptible(>wait);
@@ -812,13 +812,8 @@ int __uio_register_device(struct module *owner,
if (!parent || !info || !info->name || !info->version)
return -EINVAL;
 
-   info->uio_dev = NULL;
-
-   idev = devm_kzalloc(parent, sizeof(*idev), GFP_KERNEL);
-   if (!idev) {
-   return -ENOMEM;
-   }
-
+   idev = >uio_dev;
+   memset(idev, 0, sizeof(*idev));
idev->owner = owner;
idev->info = info;
init_waitqueue_head(>wait);
@@ -841,8 +836,6 @@ int __uio_register_device(struct module *owner,
if (ret)
goto err_uio_dev_add_attributes;
 
-   info->uio_dev = idev;
-
if (info->irq && (info->irq != UIO_IRQ_CUSTOM)) {
/*
 * Note that we deliberately don't use devm_request_irq
@@ -879,10 +872,10 @@ void uio_unregister_device(struct uio_info *info)
 {
struct uio_device *idev;
 
-   if (!info || !info->uio_dev)
+   if (!info)
return;
 
-   idev = info->uio_dev;
+   idev = >uio_dev;
 
uio_free_minor(idev);
 
diff --git a/include/linux/uio_driver.h b/include/linux/uio_driver.h
index 3c85c81..4b1b1c8 100644
--- a/include/linux/uio_driver.h
+++ b/include/linux/uio_driver.h
@@ -95,7 +95,7 @@ struct uio_device {
  * @irqcontrol:disable/enable irqs when 0/1 is written to 
/dev/uioX
  */
 struct uio_info {
-   struct uio_device   *uio_dev;
+   struct uio_device   uio_dev;
const char  *name;
const char  *version;
struct uio_mem  mem[MAX_UIO_MAPS];
-- 
2.7.2



Re: [PATCH] tcmu: Add fifo type waiter list support to avoid starvation

2017-06-03 Thread Mike Christie
On 05/04/2017 09:51 PM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> The fifo type waiter list will hold the udevs who are waiting for the
> blocks from the data global pool. The unmap thread will try to feed the
> first udevs in waiter list, if the global free blocks available are
> not enough, it will stop traversing the list and abort waking up the
> others.
> 
> Signed-off-by: Xiubo Li 
> ---
>  drivers/target/target_core_user.c | 82 
> +++
>  1 file changed, 66 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 9045837..10c99e7 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -97,6 +97,7 @@ struct tcmu_hba {
>  
>  struct tcmu_dev {
>   struct list_head node;
> + struct list_head waiter;
>  
>   struct se_device se_dev;
>  
> @@ -123,7 +124,7 @@ struct tcmu_dev {
>   wait_queue_head_t wait_cmdr;
>   struct mutex cmdr_lock;
>  
> - bool waiting_global;
> + uint32_t waiting_blocks;
>   uint32_t dbi_max;
>   uint32_t dbi_thresh;
>   DECLARE_BITMAP(data_bitmap, DATA_BLOCK_BITS);
> @@ -165,6 +166,10 @@ struct tcmu_cmd {
>  static DEFINE_MUTEX(root_udev_mutex);
>  static LIST_HEAD(root_udev);
>  
> +/* The data blocks global pool waiter list */
> +static DEFINE_MUTEX(root_udev_waiter_mutex);

Sorry for the delay.


Could you just add a comment about the lock ordering. It will be helpful
to new engineers/reviewers to check for errors.


> +static LIST_HEAD(root_udev_waiter);
> +
>  static atomic_t global_db_count = ATOMIC_INIT(0);
>  
>  static struct kmem_cache *tcmu_cmd_cache;
> @@ -195,6 +200,11 @@ enum tcmu_multicast_groups {
>  #define tcmu_cmd_set_dbi(cmd, index) ((cmd)->dbi[(cmd)->dbi_cur++] = (index))
>  #define tcmu_cmd_get_dbi(cmd) ((cmd)->dbi[(cmd)->dbi_cur++])
>  
> +static inline bool is_in_waiter_list(struct tcmu_dev *udev)
> +{
> + return !!udev->waiting_blocks;
> +}
> +
>  static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd, uint32_t len)
>  {
>   struct tcmu_dev *udev = tcmu_cmd->tcmu_dev;
> @@ -250,8 +260,6 @@ static bool tcmu_get_empty_blocks(struct tcmu_dev *udev,
>  {
>   int i;
>  
> - udev->waiting_global = false;
> -
>   for (i = tcmu_cmd->dbi_cur; i < tcmu_cmd->dbi_cnt; i++) {
>   if (!tcmu_get_empty_block(udev, tcmu_cmd))
>   goto err;
> @@ -259,9 +267,7 @@ static bool tcmu_get_empty_blocks(struct tcmu_dev *udev,
>   return true;
>  
>  err:
> - udev->waiting_global = true;
> - /* Try to wake up the unmap thread */
> - wake_up(_wait);
> + udev->waiting_blocks += tcmu_cmd->dbi_cnt - i;
>   return false;
>  }
>  
> @@ -671,6 +677,7 @@ static inline size_t tcmu_cmd_get_cmd_size(struct 
> tcmu_cmd *tcmu_cmd,
>  
>   while (!is_ring_space_avail(udev, tcmu_cmd, command_size, data_length)) 
> {
>   int ret;
> + bool is_waiting_blocks = !!udev->waiting_blocks;

You can use your helper is_in_waiter_list().

>   DEFINE_WAIT(__wait);
>  
>   prepare_to_wait(>wait_cmdr, &__wait, TASK_INTERRUPTIBLE);
> @@ -688,6 +695,18 @@ static inline size_t tcmu_cmd_get_cmd_size(struct 
> tcmu_cmd *tcmu_cmd,
>   return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>   }
>  
> + /*
> +  * Try to insert the current udev to waiter list and
> +  * then wake up the unmap thread
> +  */
> + if (is_waiting_blocks) {
> + mutex_lock(_udev_waiter_mutex);
> + list_add_tail(>waiter, _udev_waiter);
> + mutex_unlock(_udev_waiter_mutex);
> +
> + wake_up(_wait);
> + }

Is this supposed to go before the schedule_timeout call?

If we are here and schedule_timeout returned non zero then our wait_cmdr
has been woken up from the unmap thread because it freed some space, so
we do not want to again add ourself and call unmap again.


> +
>   mutex_lock(>cmdr_lock);
>  
>   /* We dropped cmdr_lock, cmd_head is stale */
> @@ -902,8 +921,6 @@ static unsigned int tcmu_handle_completions(struct 
> tcmu_dev *udev)
>   if (mb->cmd_tail == mb->cmd_head)
>   del_timer(>timeout); /* no more pending cmds */
>  
> - wake_up(>wait_cmdr);
> -
>   return handled;
>  }
>  
> @@ -996,7 +1013,17 @@ static int tcmu_irqcontrol(struct uio_info *info, s32 
> irq_on)
>   struct tcmu_dev *tcmu_dev = container_of(info, struct tcmu_dev, 
> uio_info);
>  
>   mutex_lock(_dev->cmdr_lock);
> - tcmu_handle_completions(tcmu_dev);
> + /*
> +  * If the current udev is also in waiter list, this will
> +  * make sure that the other waiters in list be feeded ahead
> +  * of it.
> +  */
> + if 

Re: [PATCH] tcmu: Add fifo type waiter list support to avoid starvation

2017-06-03 Thread Mike Christie
On 05/04/2017 09:51 PM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> The fifo type waiter list will hold the udevs who are waiting for the
> blocks from the data global pool. The unmap thread will try to feed the
> first udevs in waiter list, if the global free blocks available are
> not enough, it will stop traversing the list and abort waking up the
> others.
> 
> Signed-off-by: Xiubo Li 
> ---
>  drivers/target/target_core_user.c | 82 
> +++
>  1 file changed, 66 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 9045837..10c99e7 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -97,6 +97,7 @@ struct tcmu_hba {
>  
>  struct tcmu_dev {
>   struct list_head node;
> + struct list_head waiter;
>  
>   struct se_device se_dev;
>  
> @@ -123,7 +124,7 @@ struct tcmu_dev {
>   wait_queue_head_t wait_cmdr;
>   struct mutex cmdr_lock;
>  
> - bool waiting_global;
> + uint32_t waiting_blocks;
>   uint32_t dbi_max;
>   uint32_t dbi_thresh;
>   DECLARE_BITMAP(data_bitmap, DATA_BLOCK_BITS);
> @@ -165,6 +166,10 @@ struct tcmu_cmd {
>  static DEFINE_MUTEX(root_udev_mutex);
>  static LIST_HEAD(root_udev);
>  
> +/* The data blocks global pool waiter list */
> +static DEFINE_MUTEX(root_udev_waiter_mutex);

Sorry for the delay.


Could you just add a comment about the lock ordering. It will be helpful
to new engineers/reviewers to check for errors.


> +static LIST_HEAD(root_udev_waiter);
> +
>  static atomic_t global_db_count = ATOMIC_INIT(0);
>  
>  static struct kmem_cache *tcmu_cmd_cache;
> @@ -195,6 +200,11 @@ enum tcmu_multicast_groups {
>  #define tcmu_cmd_set_dbi(cmd, index) ((cmd)->dbi[(cmd)->dbi_cur++] = (index))
>  #define tcmu_cmd_get_dbi(cmd) ((cmd)->dbi[(cmd)->dbi_cur++])
>  
> +static inline bool is_in_waiter_list(struct tcmu_dev *udev)
> +{
> + return !!udev->waiting_blocks;
> +}
> +
>  static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd, uint32_t len)
>  {
>   struct tcmu_dev *udev = tcmu_cmd->tcmu_dev;
> @@ -250,8 +260,6 @@ static bool tcmu_get_empty_blocks(struct tcmu_dev *udev,
>  {
>   int i;
>  
> - udev->waiting_global = false;
> -
>   for (i = tcmu_cmd->dbi_cur; i < tcmu_cmd->dbi_cnt; i++) {
>   if (!tcmu_get_empty_block(udev, tcmu_cmd))
>   goto err;
> @@ -259,9 +267,7 @@ static bool tcmu_get_empty_blocks(struct tcmu_dev *udev,
>   return true;
>  
>  err:
> - udev->waiting_global = true;
> - /* Try to wake up the unmap thread */
> - wake_up(_wait);
> + udev->waiting_blocks += tcmu_cmd->dbi_cnt - i;
>   return false;
>  }
>  
> @@ -671,6 +677,7 @@ static inline size_t tcmu_cmd_get_cmd_size(struct 
> tcmu_cmd *tcmu_cmd,
>  
>   while (!is_ring_space_avail(udev, tcmu_cmd, command_size, data_length)) 
> {
>   int ret;
> + bool is_waiting_blocks = !!udev->waiting_blocks;

You can use your helper is_in_waiter_list().

>   DEFINE_WAIT(__wait);
>  
>   prepare_to_wait(>wait_cmdr, &__wait, TASK_INTERRUPTIBLE);
> @@ -688,6 +695,18 @@ static inline size_t tcmu_cmd_get_cmd_size(struct 
> tcmu_cmd *tcmu_cmd,
>   return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>   }
>  
> + /*
> +  * Try to insert the current udev to waiter list and
> +  * then wake up the unmap thread
> +  */
> + if (is_waiting_blocks) {
> + mutex_lock(_udev_waiter_mutex);
> + list_add_tail(>waiter, _udev_waiter);
> + mutex_unlock(_udev_waiter_mutex);
> +
> + wake_up(_wait);
> + }

Is this supposed to go before the schedule_timeout call?

If we are here and schedule_timeout returned non zero then our wait_cmdr
has been woken up from the unmap thread because it freed some space, so
we do not want to again add ourself and call unmap again.


> +
>   mutex_lock(>cmdr_lock);
>  
>   /* We dropped cmdr_lock, cmd_head is stale */
> @@ -902,8 +921,6 @@ static unsigned int tcmu_handle_completions(struct 
> tcmu_dev *udev)
>   if (mb->cmd_tail == mb->cmd_head)
>   del_timer(>timeout); /* no more pending cmds */
>  
> - wake_up(>wait_cmdr);
> -
>   return handled;
>  }
>  
> @@ -996,7 +1013,17 @@ static int tcmu_irqcontrol(struct uio_info *info, s32 
> irq_on)
>   struct tcmu_dev *tcmu_dev = container_of(info, struct tcmu_dev, 
> uio_info);
>  
>   mutex_lock(_dev->cmdr_lock);
> - tcmu_handle_completions(tcmu_dev);
> + /*
> +  * If the current udev is also in waiter list, this will
> +  * make sure that the other waiters in list be feeded ahead
> +  * of it.
> +  */
> + if (is_in_waiter_list(tcmu_dev)) {
> + wake_up(_wait);
> + } 

Re: [PATCH] iscsi-target: Fix initial login PDU asynchronous socket close OOPs

2017-05-31 Thread Mike Christie
On 05/30/2017 11:58 PM, Nicholas A. Bellinger wrote:
> Hey MNC,
> 
> On Fri, 2017-05-26 at 22:14 -0500, Mike Christie wrote:
>> Thanks for the patch.
>>
> 
> Btw, after running DATERA's internal longevity and scale tests across
> ~20 racks on v4.1.y with this patch over the long weekend, there haven't
> been any additional regressions.
> 
>> On 05/26/2017 12:32 AM, Nicholas A. Bellinger wrote:
>>>  
>>> -   state = iscsi_target_sk_state_check(sk);
>>> -   write_unlock_bh(>sk_callback_lock);
>>> -
>>> -   pr_debug("iscsi_target_sk_state_change: state: %d\n", state);
>>> +   orig_state_change(sk);
>>>  
>>> -   if (!state) {
>>> -   pr_debug("iscsi_target_sk_state_change got failed state\n");
>>> -   schedule_delayed_work(>login_cleanup_work, 0);
>>
>> I think login_cleanup_work is no longer used so you can also remove it
>> and its code.
> 
> Yep, since this needs to goto stable, I left that part out for now..
> 
> Will take care of that post -rc4.
> 
>>
>> The patch fixes the crash for me. However, is there a possible
>> regression where if the initiator attempts new relogins we could run out
>> of memory? With the old code, we would free the login attempts resources
>> at this time, but with the new code the initiator will send more login
>> attempts and so we just keep allocating more memory for each attempt
>> until we run out or the login is finally able to complete.
> 
> AFAICT, no. For the two cases in question:
> 
>  - Initial login request PDU processing done within iscsi_np kthread
> context in iscsi_target_start_negotiation(), and
>  - subsequent login request PDU processing done by delayed work-queue
> kthread context in iscsi_target_do_login_rx() 
> 
> this patch doesn't change how aggressively connection cleanup happens
> for failed login attempts in the face of new connection login attempts
> for either case.
> 
> For the first case when iscsi_np process context invokes
> iscsi_target_start_negotiation() -> iscsi_target_do_login() ->
> iscsi_check_for_session_reinstatement() to wait for backend I/O to
> complete, it still blocks other new connections from being accepted on
> the specific iscsi_np process context.
> 
> This patch doesn't change this behavior.
> 
> What it does change is when the host closes the connection and
> iscsi_target_sk_state_change() gets invoked, iscsi_np process context
> waits for iscsi_check_for_session_reinstatement() to complete before
> releasing the connection resources.
> 
> However since iscsi_np process context is blocked, new connections won't
> be accepted until the new connection forcing session reinstatement
> finishes waiting for outstanding backend I/O to complete.

I was seeing this. My original mail asked about iscsi login resources
incorrectly, but like you said we do not get that far. I get a giant
backlog (1 connection request per 5 seconds that we waited) of tcp level
connection requests and drops. When the wait is done I get a flood of
"iSCSI Login negotiation failed" due to the target handling all those
now stale requests/drops.

If we do not care about the memory use at the network level for this
case (it seems like a little and reconnects are not aggressive), then
patch works ok for me. I am guessing it gets nasty to handle, so maybe
not worth it to handle right now? I tried to do it in my patch which is
why it got all crazy with the waits/wakeups :)

Thanks, and you can add a tested-by or reviewed-by from me.


Re: [PATCH] iscsi-target: Fix initial login PDU asynchronous socket close OOPs

2017-05-31 Thread Mike Christie
On 05/30/2017 11:58 PM, Nicholas A. Bellinger wrote:
> Hey MNC,
> 
> On Fri, 2017-05-26 at 22:14 -0500, Mike Christie wrote:
>> Thanks for the patch.
>>
> 
> Btw, after running DATERA's internal longevity and scale tests across
> ~20 racks on v4.1.y with this patch over the long weekend, there haven't
> been any additional regressions.
> 
>> On 05/26/2017 12:32 AM, Nicholas A. Bellinger wrote:
>>>  
>>> -   state = iscsi_target_sk_state_check(sk);
>>> -   write_unlock_bh(>sk_callback_lock);
>>> -
>>> -   pr_debug("iscsi_target_sk_state_change: state: %d\n", state);
>>> +   orig_state_change(sk);
>>>  
>>> -   if (!state) {
>>> -   pr_debug("iscsi_target_sk_state_change got failed state\n");
>>> -   schedule_delayed_work(>login_cleanup_work, 0);
>>
>> I think login_cleanup_work is no longer used so you can also remove it
>> and its code.
> 
> Yep, since this needs to goto stable, I left that part out for now..
> 
> Will take care of that post -rc4.
> 
>>
>> The patch fixes the crash for me. However, is there a possible
>> regression where if the initiator attempts new relogins we could run out
>> of memory? With the old code, we would free the login attempts resources
>> at this time, but with the new code the initiator will send more login
>> attempts and so we just keep allocating more memory for each attempt
>> until we run out or the login is finally able to complete.
> 
> AFAICT, no. For the two cases in question:
> 
>  - Initial login request PDU processing done within iscsi_np kthread
> context in iscsi_target_start_negotiation(), and
>  - subsequent login request PDU processing done by delayed work-queue
> kthread context in iscsi_target_do_login_rx() 
> 
> this patch doesn't change how aggressively connection cleanup happens
> for failed login attempts in the face of new connection login attempts
> for either case.
> 
> For the first case when iscsi_np process context invokes
> iscsi_target_start_negotiation() -> iscsi_target_do_login() ->
> iscsi_check_for_session_reinstatement() to wait for backend I/O to
> complete, it still blocks other new connections from being accepted on
> the specific iscsi_np process context.
> 
> This patch doesn't change this behavior.
> 
> What it does change is when the host closes the connection and
> iscsi_target_sk_state_change() gets invoked, iscsi_np process context
> waits for iscsi_check_for_session_reinstatement() to complete before
> releasing the connection resources.
> 
> However since iscsi_np process context is blocked, new connections won't
> be accepted until the new connection forcing session reinstatement
> finishes waiting for outstanding backend I/O to complete.

I was seeing this. My original mail asked about iscsi login resources
incorrectly, but like you said we do not get that far. I get a giant
backlog (1 connection request per 5 seconds that we waited) of tcp level
connection requests and drops. When the wait is done I get a flood of
"iSCSI Login negotiation failed" due to the target handling all those
now stale requests/drops.

If we do not care about the memory use at the network level for this
case (it seems like a little and reconnects are not aggressive), then
patch works ok for me. I am guessing it gets nasty to handle, so maybe
not worth it to handle right now? I tried to do it in my patch which is
why it got all crazy with the waits/wakeups :)

Thanks, and you can add a tested-by or reviewed-by from me.


Re: [PATCH] iscsi-target: Fix initial login PDU asynchronous socket close OOPs

2017-05-26 Thread Mike Christie
Thanks for the patch.

On 05/26/2017 12:32 AM, Nicholas A. Bellinger wrote:
>  
> - state = iscsi_target_sk_state_check(sk);
> - write_unlock_bh(>sk_callback_lock);
> -
> - pr_debug("iscsi_target_sk_state_change: state: %d\n", state);
> + orig_state_change(sk);
>  
> - if (!state) {
> - pr_debug("iscsi_target_sk_state_change got failed state\n");
> - schedule_delayed_work(>login_cleanup_work, 0);

I think login_cleanup_work is no longer used so you can also remove it
and its code.

The patch fixes the crash for me. However, is there a possible
regression where if the initiator attempts new relogins we could run out
of memory? With the old code, we would free the login attempts resources
at this time, but with the new code the initiator will send more login
attempts and so we just keep allocating more memory for each attempt
until we run out or the login is finally able to complete.


Re: [PATCH] iscsi-target: Fix initial login PDU asynchronous socket close OOPs

2017-05-26 Thread Mike Christie
Thanks for the patch.

On 05/26/2017 12:32 AM, Nicholas A. Bellinger wrote:
>  
> - state = iscsi_target_sk_state_check(sk);
> - write_unlock_bh(>sk_callback_lock);
> -
> - pr_debug("iscsi_target_sk_state_change: state: %d\n", state);
> + orig_state_change(sk);
>  
> - if (!state) {
> - pr_debug("iscsi_target_sk_state_change got failed state\n");
> - schedule_delayed_work(>login_cleanup_work, 0);

I think login_cleanup_work is no longer used so you can also remove it
and its code.

The patch fixes the crash for me. However, is there a possible
regression where if the initiator attempts new relogins we could run out
of memory? With the old code, we would free the login attempts resources
at this time, but with the new code the initiator will send more login
attempts and so we just keep allocating more memory for each attempt
until we run out or the login is finally able to complete.


Re: [PATCH] tcmu: Recalculate the tcmu_cmd size to save cmd area memories

2017-05-02 Thread Mike Christie
On 05/02/2017 02:54 AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li <lixi...@cmss.chinamobile.com>
> 
> For the "struct tcmu_cmd_entry" in cmd area, the minimum size
> will be sizeof(struct tcmu_cmd_entry) == 112 Bytes. And it could
> fill about (sizeof(struct rsp) - sizeof(struct req)) /
> sizeof(struct iovec) == 68 / 16 ~= 4 data regions(iov[4]) by
> default.
> 
> For most tcmu_cmds, the data block indexes allocated from the
> data area will be continuous. And for the continuous blocks they
> will be merged into the same region using only one iovec. For
> the current code, it will always allocates the same number of
> iovecs with blocks for each tcmu_cmd, and it will wastes much
> memories.
> 
> For example, when the block size is 4K and the DATA_OUT buffer
> size is 64K, and the regions needed is less than 5(on my
> environment is almost 99.7%). The current code will allocate
> about 16 iovecs, and there will be (16 - 4) * sizeof(struct
> iovec) = 192 Bytes cmd area memories wasted.
> 
> Here adds two helpers to calculate the base size and full size
> of the tcmu_cmd. And will recalculate them again when it make sure
> how many iovs is needed before insert it to cmd area.
> 
> Signed-off-by: Xiubo Li <lixi...@cmss.chinamobile.com>

Looks ok to me. Thanks.

Acked-by: Mike Christie <mchri...@redhat.com>



Re: [PATCH] tcmu: Recalculate the tcmu_cmd size to save cmd area memories

2017-05-02 Thread Mike Christie
On 05/02/2017 02:54 AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> For the "struct tcmu_cmd_entry" in cmd area, the minimum size
> will be sizeof(struct tcmu_cmd_entry) == 112 Bytes. And it could
> fill about (sizeof(struct rsp) - sizeof(struct req)) /
> sizeof(struct iovec) == 68 / 16 ~= 4 data regions(iov[4]) by
> default.
> 
> For most tcmu_cmds, the data block indexes allocated from the
> data area will be continuous. And for the continuous blocks they
> will be merged into the same region using only one iovec. For
> the current code, it will always allocates the same number of
> iovecs with blocks for each tcmu_cmd, and it will wastes much
> memories.
> 
> For example, when the block size is 4K and the DATA_OUT buffer
> size is 64K, and the regions needed is less than 5(on my
> environment is almost 99.7%). The current code will allocate
> about 16 iovecs, and there will be (16 - 4) * sizeof(struct
> iovec) = 192 Bytes cmd area memories wasted.
> 
> Here adds two helpers to calculate the base size and full size
> of the tcmu_cmd. And will recalculate them again when it make sure
> how many iovs is needed before insert it to cmd area.
> 
> Signed-off-by: Xiubo Li 

Looks ok to me. Thanks.

Acked-by: Mike Christie 



Re: [PATCH v6 2/2] tcmu: Add global data block pool support

2017-05-01 Thread Mike Christie
On 04/30/2017 06:29 AM, Xiubo Li wrote:
> [...]
>>> +static struct page *tcmu_try_get_block_page(struct tcmu_dev *udev,
>>> uint32_t dbi)
>>> +{
>>> +struct page *page;
>>> +int ret;
>>> +
>>> +mutex_lock(>cmdr_lock);
>>> +page = tcmu_get_block_page(udev, dbi);
>>> +if (likely(page)) {
>>> +mutex_unlock(>cmdr_lock);
>>> +return page;
>>> +}
>>> +
>>> +/*
>>> + * Normally it shouldn't be here:
>>> + * Only when the userspace has touched the blocks which
>>> + * are out of the tcmu_cmd's data iov[], and will return
>>> + * one zeroed page.
>>
>> Is it a userspace bug when this happens? Do you know when it is
>> occcuring?
> Since the UIO will map the whole ring buffer to the user space at the
> beginning, and the userspace is allowed and legal to access any block
> within the limits of the mapped ring area.
> 
> But actually when this happens, it normally will be one bug of the
> userspace. Without this checking the kernel will output many page fault
> dump traces.
> 
> Maybe here outputing some warning message is a good idea, and will be
> easy to debug for userspace.

Yeah.

> 
> 
> [...]
>>> @@ -1388,6 +1509,81 @@ static ssize_t tcmu_cmd_time_out_store(struct
>>> config_item *item, const char *pag
>>>   .tb_dev_attrib_attrs= NULL,
>>>   };
>>>   +static int unmap_thread_fn(void *data)
>>> +{
>>> +struct tcmu_dev *udev;
>>> +loff_t off;
>>> +uint32_t start, end, block;
>>> +struct page *page;
>>> +int i;
>>> +
>>> +while (1) {
>>> +DEFINE_WAIT(__wait);
>>> +
>>> +prepare_to_wait(_wait, &__wait, TASK_INTERRUPTIBLE);
>>> +schedule();
>>> +finish_wait(_wait, &__wait);
>>> +
>>> +mutex_lock(_udev_mutex);
>>> +list_for_each_entry(udev, _udev, node) {
>>> +mutex_lock(>cmdr_lock);
>>> +
>>> +/* Try to complete the finished commands first */
>>> +tcmu_handle_completions(udev);
>>> +
>>> +/* Skip the udevs waiting the global pool or in idle */
>>> +if (udev->waiting_global || !udev->dbi_thresh) {
>>> +mutex_unlock(>cmdr_lock);
>>> +continue;
>>> +}
>>> +
>>> +end = udev->dbi_max + 1;
>>> +block = find_last_bit(udev->data_bitmap, end);
>>> +if (block == udev->dbi_max) {
>>> +/*
>>> + * The last bit is dbi_max, so there is
>>> + * no need to shrink any blocks.
>>> + */
>>> +mutex_unlock(>cmdr_lock);
>>> +continue;
>>> +} else if (block == end) {
>>> +/* The current udev will goto idle state */
>>> +udev->dbi_thresh = start = 0;
>>> +udev->dbi_max = 0;
>>> +} else {
>>> +udev->dbi_thresh = start = block + 1;
>>> +udev->dbi_max = block;
>>> +}
>>> +
>>> +/* Here will truncate the data area from off */
>>> +off = udev->data_off + start * DATA_BLOCK_SIZE;
>>> +unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
>>> +
>>> +/* Release the block pages */
>>> +for (i = start; i < end; i++) {
>>> +page = radix_tree_delete(>data_blocks, i);
>>> +if (page) {
>>> +__free_page(page);
>>> +atomic_dec(_db_count);
>>> +}
>>> +}
>>> +mutex_unlock(>cmdr_lock);
>>> +}
>>> +
>>> +/*
>>> + * Try to wake up the udevs who are waiting
>>> + * for the global data pool.
>>> + */
>>> +list_for_each_entry(udev, _udev, node) {
>>> +if (udev->waiting_global)
>>> +wake_up(>wait_cmdr);
>>> +}
>>
>> To avoid starvation, I think you want a second list/fifo that holds the
>> watiers. In tcmu_get_empty_block if the list is not empty, record how
>> many pages we needed and then add the device to the list and wait in
>> tcmu_queue_cmd_ring.
>>
>> Above if we freed enough pages for the device at head then wake up the
>> device.
>>
>> I think you also need a wake_up call in the completion path in case the
>> initial call could not free enough pages. It could probably check if the
>> completion was going to free enough pages for a waiter and then call
>> wake.
>>
> Yes, I meant to introduce this later after this series to not let the
> patches too
> complex to review.
> 
> If you agree I will do this later, or in V7 series ?


Yeah, I am ok with adding it after the initial patches go in.


Re: [PATCH v6 2/2] tcmu: Add global data block pool support

2017-05-01 Thread Mike Christie
On 04/30/2017 06:29 AM, Xiubo Li wrote:
> [...]
>>> +static struct page *tcmu_try_get_block_page(struct tcmu_dev *udev,
>>> uint32_t dbi)
>>> +{
>>> +struct page *page;
>>> +int ret;
>>> +
>>> +mutex_lock(>cmdr_lock);
>>> +page = tcmu_get_block_page(udev, dbi);
>>> +if (likely(page)) {
>>> +mutex_unlock(>cmdr_lock);
>>> +return page;
>>> +}
>>> +
>>> +/*
>>> + * Normally it shouldn't be here:
>>> + * Only when the userspace has touched the blocks which
>>> + * are out of the tcmu_cmd's data iov[], and will return
>>> + * one zeroed page.
>>
>> Is it a userspace bug when this happens? Do you know when it is
>> occcuring?
> Since the UIO will map the whole ring buffer to the user space at the
> beginning, and the userspace is allowed and legal to access any block
> within the limits of the mapped ring area.
> 
> But actually when this happens, it normally will be one bug of the
> userspace. Without this checking the kernel will output many page fault
> dump traces.
> 
> Maybe here outputing some warning message is a good idea, and will be
> easy to debug for userspace.

Yeah.

> 
> 
> [...]
>>> @@ -1388,6 +1509,81 @@ static ssize_t tcmu_cmd_time_out_store(struct
>>> config_item *item, const char *pag
>>>   .tb_dev_attrib_attrs= NULL,
>>>   };
>>>   +static int unmap_thread_fn(void *data)
>>> +{
>>> +struct tcmu_dev *udev;
>>> +loff_t off;
>>> +uint32_t start, end, block;
>>> +struct page *page;
>>> +int i;
>>> +
>>> +while (1) {
>>> +DEFINE_WAIT(__wait);
>>> +
>>> +prepare_to_wait(_wait, &__wait, TASK_INTERRUPTIBLE);
>>> +schedule();
>>> +finish_wait(_wait, &__wait);
>>> +
>>> +mutex_lock(_udev_mutex);
>>> +list_for_each_entry(udev, _udev, node) {
>>> +mutex_lock(>cmdr_lock);
>>> +
>>> +/* Try to complete the finished commands first */
>>> +tcmu_handle_completions(udev);
>>> +
>>> +/* Skip the udevs waiting the global pool or in idle */
>>> +if (udev->waiting_global || !udev->dbi_thresh) {
>>> +mutex_unlock(>cmdr_lock);
>>> +continue;
>>> +}
>>> +
>>> +end = udev->dbi_max + 1;
>>> +block = find_last_bit(udev->data_bitmap, end);
>>> +if (block == udev->dbi_max) {
>>> +/*
>>> + * The last bit is dbi_max, so there is
>>> + * no need to shrink any blocks.
>>> + */
>>> +mutex_unlock(>cmdr_lock);
>>> +continue;
>>> +} else if (block == end) {
>>> +/* The current udev will goto idle state */
>>> +udev->dbi_thresh = start = 0;
>>> +udev->dbi_max = 0;
>>> +} else {
>>> +udev->dbi_thresh = start = block + 1;
>>> +udev->dbi_max = block;
>>> +}
>>> +
>>> +/* Here will truncate the data area from off */
>>> +off = udev->data_off + start * DATA_BLOCK_SIZE;
>>> +unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
>>> +
>>> +/* Release the block pages */
>>> +for (i = start; i < end; i++) {
>>> +page = radix_tree_delete(>data_blocks, i);
>>> +if (page) {
>>> +__free_page(page);
>>> +atomic_dec(_db_count);
>>> +}
>>> +}
>>> +mutex_unlock(>cmdr_lock);
>>> +}
>>> +
>>> +/*
>>> + * Try to wake up the udevs who are waiting
>>> + * for the global data pool.
>>> + */
>>> +list_for_each_entry(udev, _udev, node) {
>>> +if (udev->waiting_global)
>>> +wake_up(>wait_cmdr);
>>> +}
>>
>> To avoid starvation, I think you want a second list/fifo that holds the
>> watiers. In tcmu_get_empty_block if the list is not empty, record how
>> many pages we needed and then add the device to the list and wait in
>> tcmu_queue_cmd_ring.
>>
>> Above if we freed enough pages for the device at head then wake up the
>> device.
>>
>> I think you also need a wake_up call in the completion path in case the
>> initial call could not free enough pages. It could probably check if the
>> completion was going to free enough pages for a waiter and then call
>> wake.
>>
> Yes, I meant to introduce this later after this series to not let the
> patches too
> complex to review.
> 
> If you agree I will do this later, or in V7 series ?


Yeah, I am ok with adding it after the initial patches go in.


Re: [PATCH v6 1/2] tcmu: Add dynamic growing data area featuresupport

2017-05-01 Thread Mike Christie
On 04/30/2017 05:22 AM, Xiubo Li wrote:
> On 2017年04月30日 13:48, Mike Christie wrote:
>> On 04/26/2017 01:25 AM, lixi...@cmss.chinamobile.com wrote:
>>>   for_each_sg(data_sg, sg, data_nents, i) {
>>> @@ -275,22 +371,26 @@ static void alloc_and_scatter_data_area(struct
>>> tcmu_dev *udev,
>>>   from = kmap_atomic(sg_page(sg)) + sg->offset;
>>>   while (sg_remaining > 0) {
>>>   if (block_remaining == 0) {
>>> -block = find_first_zero_bit(udev->data_bitmap,
>>> -DATA_BLOCK_BITS);
>>>   block_remaining = DATA_BLOCK_SIZE;
>>> -set_bit(block, udev->data_bitmap);
>>> +dbi = tcmu_get_empty_block(udev, );
>>> +if (dbi < 0)
>>
>> I know it you fixed the missing kunmap_atomic here and missing unlock in
>> tcmu_queue_cmd_ring in the next patch, but I think normally people
>> prefer that one patch does not add a bug, then the next patch fixes it.
> Do you mean the following kmap_atomic() ?
> 
> from = kmap_atomic(sg_page(sg)) + sg->offset;
> 
> In this patch there has no new kmap/kunmap introduced. This is the old
> code and
> the kunmap is at the end of aasda().

You added a new return in the error path in this patch in the if case
above, but did not add a kunmap_atomic.


Re: [PATCH v6 1/2] tcmu: Add dynamic growing data area featuresupport

2017-05-01 Thread Mike Christie
On 04/30/2017 05:22 AM, Xiubo Li wrote:
> On 2017年04月30日 13:48, Mike Christie wrote:
>> On 04/26/2017 01:25 AM, lixi...@cmss.chinamobile.com wrote:
>>>   for_each_sg(data_sg, sg, data_nents, i) {
>>> @@ -275,22 +371,26 @@ static void alloc_and_scatter_data_area(struct
>>> tcmu_dev *udev,
>>>   from = kmap_atomic(sg_page(sg)) + sg->offset;
>>>   while (sg_remaining > 0) {
>>>   if (block_remaining == 0) {
>>> -block = find_first_zero_bit(udev->data_bitmap,
>>> -DATA_BLOCK_BITS);
>>>   block_remaining = DATA_BLOCK_SIZE;
>>> -set_bit(block, udev->data_bitmap);
>>> +dbi = tcmu_get_empty_block(udev, );
>>> +if (dbi < 0)
>>
>> I know it you fixed the missing kunmap_atomic here and missing unlock in
>> tcmu_queue_cmd_ring in the next patch, but I think normally people
>> prefer that one patch does not add a bug, then the next patch fixes it.
> Do you mean the following kmap_atomic() ?
> 
> from = kmap_atomic(sg_page(sg)) + sg->offset;
> 
> In this patch there has no new kmap/kunmap introduced. This is the old
> code and
> the kunmap is at the end of aasda().

You added a new return in the error path in this patch in the if case
above, but did not add a kunmap_atomic.


Re: [PATCH v6 2/2] tcmu: Add global data block pool support

2017-04-30 Thread Mike Christie
On 04/26/2017 01:25 AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> For each target there will be one ring, when the target number
> grows larger and larger, it could eventually runs out of the
> system memories.
> 
> In this patch for each target ring, currently for the cmd area
> the size will be fixed to 8MB and for the data area the size
> will grow from 0 to max 256K * PAGE_SIZE(1G for 4K page size).
> 
> For all the targets' data areas, they will get empty blocks
> from the "global data block pool", which has limited to 512K *
> PAGE_SIZE(2G for 4K page size) for now.
> 
> When the "global data block pool" has been used up, then any
> target could wake up the unmap thread routine to shrink other
> targets' data area memories. And the unmap thread routine will
> always try to truncate the ring vma from the last using block
> offset.
> 
> When user space has touched the data blocks out of tcmu_cmd
> iov[], the tcmu_page_fault() will try to return one zeroed blocks.
> 
> Here we move the timeout's tcmu_handle_completions() into unmap
> thread routine, that's to say when the timeout fired, it will
> only do the tcmu_check_expired_cmd() and then wake up the unmap
> thread to do the completions() and then try to shrink its idle
> memories. Then the cmdr_lock could be a mutex and could simplify
> this patch because the unmap_mapping_range() or zap_* may go to
> sleep.
> 
> Signed-off-by: Xiubo Li 
> Signed-off-by: Jianfei Hu 
> ---
>  drivers/target/target_core_user.c | 446 
> --
>  1 file changed, 327 insertions(+), 119 deletions(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 8491752..46f5a1c 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -31,6 +31,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -67,17 +69,24 @@
>  
>  #define TCMU_TIME_OUT (30 * MSEC_PER_SEC)
>  
> -/* For cmd area, the size is fixed 2M */
> -#define CMDR_SIZE (2 * 1024 * 1024)
> +/* For cmd area, the size is fixed 8MB */
> +#define CMDR_SIZE (8 * 1024 * 1024)
>  
> -/* For data area, the size is fixed 32M */
> -#define DATA_BLOCK_BITS (8 * 1024)
> -#define DATA_BLOCK_SIZE 4096
> +/*
> + * For data area, the block size is PAGE_SIZE and
> + * the total size is 256K * PAGE_SIZE.
> + */
> +#define DATA_BLOCK_SIZE PAGE_SIZE
> +#define DATA_BLOCK_BITS (256 * 1024)
>  #define DATA_SIZE (DATA_BLOCK_BITS * DATA_BLOCK_SIZE)
> +#define DATA_BLOCK_INIT_BITS 128
>  
> -/* The ring buffer size is 34M */
> +/* The total size of the ring is 8M + 256K * PAGE_SIZE */
>  #define TCMU_RING_SIZE (CMDR_SIZE + DATA_SIZE)
>  
> +/* Default maximum of the global data blocks(512K * PAGE_SIZE) */
> +#define TCMU_GLOBAL_MAX_BLOCKS (512 * 1024)
> +
>  static struct device *tcmu_root_device;
>  
>  struct tcmu_hba {
> @@ -87,6 +96,8 @@ struct tcmu_hba {
>  #define TCMU_CONFIG_LEN 256
>  
>  struct tcmu_dev {
> + struct list_head node;
> +
>   struct se_device se_dev;
>  
>   char *name;
> @@ -98,6 +109,8 @@ struct tcmu_dev {
>  
>   struct uio_info uio_info;
>  
> + struct inode *inode;
> +
>   struct tcmu_mailbox *mb_addr;
>   size_t dev_size;
>   u32 cmdr_size;
> @@ -108,10 +121,11 @@ struct tcmu_dev {
>   size_t data_size;
>  
>   wait_queue_head_t wait_cmdr;
> - /* TODO should this be a mutex? */
> - spinlock_t cmdr_lock;
> + struct mutex cmdr_lock;
>  
> + bool waiting_global;
>   uint32_t dbi_max;
> + uint32_t dbi_thresh;
>   DECLARE_BITMAP(data_bitmap, DATA_BLOCK_BITS);
>   struct radix_tree_root data_blocks;
>  
> @@ -146,6 +160,13 @@ struct tcmu_cmd {
>   unsigned long flags;
>  };
>  
> +static struct task_struct *unmap_thread;
> +static wait_queue_head_t unmap_wait;
> +static DEFINE_MUTEX(root_udev_mutex);
> +static LIST_HEAD(root_udev);
> +
> +static atomic_t global_db_count = ATOMIC_INIT(0);
> +
>  static struct kmem_cache *tcmu_cmd_cache;
>  
>  /* multicast group */
> @@ -174,48 +195,79 @@ enum tcmu_multicast_groups {
>  #define tcmu_cmd_set_dbi(cmd, index) ((cmd)->dbi[(cmd)->dbi_cur++] = (index))
>  #define tcmu_cmd_get_dbi(cmd) ((cmd)->dbi[(cmd)->dbi_cur++])
>  
> -static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd)
> +static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd, uint32_t len)
>  {
>   struct tcmu_dev *udev = tcmu_cmd->tcmu_dev;
>   uint32_t i;
>  
> - for (i = 0; i < tcmu_cmd->dbi_cnt; i++)
> + for (i = 0; i < len; i++)
>   clear_bit(tcmu_cmd->dbi[i], udev->data_bitmap);
>  }
>  
> -static int tcmu_get_empty_block(struct tcmu_dev *udev, void **addr)
> +static inline bool tcmu_get_empty_block(struct tcmu_dev *udev,
> + struct tcmu_cmd *tcmu_cmd)
>  {
> - void *p;
> - uint32_t 

Re: [PATCH v6 2/2] tcmu: Add global data block pool support

2017-04-30 Thread Mike Christie
On 04/26/2017 01:25 AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> For each target there will be one ring, when the target number
> grows larger and larger, it could eventually runs out of the
> system memories.
> 
> In this patch for each target ring, currently for the cmd area
> the size will be fixed to 8MB and for the data area the size
> will grow from 0 to max 256K * PAGE_SIZE(1G for 4K page size).
> 
> For all the targets' data areas, they will get empty blocks
> from the "global data block pool", which has limited to 512K *
> PAGE_SIZE(2G for 4K page size) for now.
> 
> When the "global data block pool" has been used up, then any
> target could wake up the unmap thread routine to shrink other
> targets' data area memories. And the unmap thread routine will
> always try to truncate the ring vma from the last using block
> offset.
> 
> When user space has touched the data blocks out of tcmu_cmd
> iov[], the tcmu_page_fault() will try to return one zeroed blocks.
> 
> Here we move the timeout's tcmu_handle_completions() into unmap
> thread routine, that's to say when the timeout fired, it will
> only do the tcmu_check_expired_cmd() and then wake up the unmap
> thread to do the completions() and then try to shrink its idle
> memories. Then the cmdr_lock could be a mutex and could simplify
> this patch because the unmap_mapping_range() or zap_* may go to
> sleep.
> 
> Signed-off-by: Xiubo Li 
> Signed-off-by: Jianfei Hu 
> ---
>  drivers/target/target_core_user.c | 446 
> --
>  1 file changed, 327 insertions(+), 119 deletions(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 8491752..46f5a1c 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -31,6 +31,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -67,17 +69,24 @@
>  
>  #define TCMU_TIME_OUT (30 * MSEC_PER_SEC)
>  
> -/* For cmd area, the size is fixed 2M */
> -#define CMDR_SIZE (2 * 1024 * 1024)
> +/* For cmd area, the size is fixed 8MB */
> +#define CMDR_SIZE (8 * 1024 * 1024)
>  
> -/* For data area, the size is fixed 32M */
> -#define DATA_BLOCK_BITS (8 * 1024)
> -#define DATA_BLOCK_SIZE 4096
> +/*
> + * For data area, the block size is PAGE_SIZE and
> + * the total size is 256K * PAGE_SIZE.
> + */
> +#define DATA_BLOCK_SIZE PAGE_SIZE
> +#define DATA_BLOCK_BITS (256 * 1024)
>  #define DATA_SIZE (DATA_BLOCK_BITS * DATA_BLOCK_SIZE)
> +#define DATA_BLOCK_INIT_BITS 128
>  
> -/* The ring buffer size is 34M */
> +/* The total size of the ring is 8M + 256K * PAGE_SIZE */
>  #define TCMU_RING_SIZE (CMDR_SIZE + DATA_SIZE)
>  
> +/* Default maximum of the global data blocks(512K * PAGE_SIZE) */
> +#define TCMU_GLOBAL_MAX_BLOCKS (512 * 1024)
> +
>  static struct device *tcmu_root_device;
>  
>  struct tcmu_hba {
> @@ -87,6 +96,8 @@ struct tcmu_hba {
>  #define TCMU_CONFIG_LEN 256
>  
>  struct tcmu_dev {
> + struct list_head node;
> +
>   struct se_device se_dev;
>  
>   char *name;
> @@ -98,6 +109,8 @@ struct tcmu_dev {
>  
>   struct uio_info uio_info;
>  
> + struct inode *inode;
> +
>   struct tcmu_mailbox *mb_addr;
>   size_t dev_size;
>   u32 cmdr_size;
> @@ -108,10 +121,11 @@ struct tcmu_dev {
>   size_t data_size;
>  
>   wait_queue_head_t wait_cmdr;
> - /* TODO should this be a mutex? */
> - spinlock_t cmdr_lock;
> + struct mutex cmdr_lock;
>  
> + bool waiting_global;
>   uint32_t dbi_max;
> + uint32_t dbi_thresh;
>   DECLARE_BITMAP(data_bitmap, DATA_BLOCK_BITS);
>   struct radix_tree_root data_blocks;
>  
> @@ -146,6 +160,13 @@ struct tcmu_cmd {
>   unsigned long flags;
>  };
>  
> +static struct task_struct *unmap_thread;
> +static wait_queue_head_t unmap_wait;
> +static DEFINE_MUTEX(root_udev_mutex);
> +static LIST_HEAD(root_udev);
> +
> +static atomic_t global_db_count = ATOMIC_INIT(0);
> +
>  static struct kmem_cache *tcmu_cmd_cache;
>  
>  /* multicast group */
> @@ -174,48 +195,79 @@ enum tcmu_multicast_groups {
>  #define tcmu_cmd_set_dbi(cmd, index) ((cmd)->dbi[(cmd)->dbi_cur++] = (index))
>  #define tcmu_cmd_get_dbi(cmd) ((cmd)->dbi[(cmd)->dbi_cur++])
>  
> -static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd)
> +static void tcmu_cmd_free_data(struct tcmu_cmd *tcmu_cmd, uint32_t len)
>  {
>   struct tcmu_dev *udev = tcmu_cmd->tcmu_dev;
>   uint32_t i;
>  
> - for (i = 0; i < tcmu_cmd->dbi_cnt; i++)
> + for (i = 0; i < len; i++)
>   clear_bit(tcmu_cmd->dbi[i], udev->data_bitmap);
>  }
>  
> -static int tcmu_get_empty_block(struct tcmu_dev *udev, void **addr)
> +static inline bool tcmu_get_empty_block(struct tcmu_dev *udev,
> + struct tcmu_cmd *tcmu_cmd)
>  {
> - void *p;
> - uint32_t dbi;
> - int ret;
> + struct page *page;
> + int ret, dbi;
>  
> - dbi = 

Re: [PATCH v6 1/2] tcmu: Add dynamic growing data area feature support

2017-04-29 Thread Mike Christie
On 04/26/2017 01:25 AM, lixi...@cmss.chinamobile.com wrote:
>   for_each_sg(data_sg, sg, data_nents, i) {
> @@ -275,22 +371,26 @@ static void alloc_and_scatter_data_area(struct tcmu_dev 
> *udev,
>   from = kmap_atomic(sg_page(sg)) + sg->offset;
>   while (sg_remaining > 0) {
>   if (block_remaining == 0) {
> - block = find_first_zero_bit(udev->data_bitmap,
> - DATA_BLOCK_BITS);
>   block_remaining = DATA_BLOCK_SIZE;
> - set_bit(block, udev->data_bitmap);
> + dbi = tcmu_get_empty_block(udev, );
> + if (dbi < 0)


I know it you fixed the missing kunmap_atomic here and missing unlock in
tcmu_queue_cmd_ring in the next patch, but I think normally people
prefer that one patch does not add a bug, then the next patch fixes it.


Re: [PATCH v6 1/2] tcmu: Add dynamic growing data area feature support

2017-04-29 Thread Mike Christie
On 04/26/2017 01:25 AM, lixi...@cmss.chinamobile.com wrote:
>   for_each_sg(data_sg, sg, data_nents, i) {
> @@ -275,22 +371,26 @@ static void alloc_and_scatter_data_area(struct tcmu_dev 
> *udev,
>   from = kmap_atomic(sg_page(sg)) + sg->offset;
>   while (sg_remaining > 0) {
>   if (block_remaining == 0) {
> - block = find_first_zero_bit(udev->data_bitmap,
> - DATA_BLOCK_BITS);
>   block_remaining = DATA_BLOCK_SIZE;
> - set_bit(block, udev->data_bitmap);
> + dbi = tcmu_get_empty_block(udev, );
> + if (dbi < 0)


I know it you fixed the missing kunmap_atomic here and missing unlock in
tcmu_queue_cmd_ring in the next patch, but I think normally people
prefer that one patch does not add a bug, then the next patch fixes it.


Re: [PATCH] target/user: Add daynmic growing data area featuresupport

2017-02-28 Thread Mike Christie
On 02/27/2017 07:22 PM, Xiubo Li wrote:
> Hi Mike
> 
> Thanks verrry much for your work and test cases.
> 
> 
> From: Xiubo Li 
>
> Currently for the TCMU, the ring buffer size is fixed to 64K cmd
> area + 1M data area, and this will be bottlenecks for high iops.
>>> Hi Xiubo, thanks for your work.
>>>
>>> daynmic -> dynamic
>>>
>>> Have you benchmarked this patch and determined what kind of iops
>>> improvement it allows? Do you see the data area reaching its
>>> fully-allocated size?
>>>
>> I tested this patch with Venky's tcmu-runner rbd aio patches, with one
>> 10 gig iscsi session, and for pretty basic fio direct io (64 -256K
>> read/writes with a queue depth of 64 numjobs between 1 and 4) tests read
>> throughput goes from about 80 to 500 MB/s.
> Looks nice.
> 
>> Write throughput is pretty
>> low at around 150 MB/s.
> What's the original write throughput without this patch? Is it also
> around 80 MB/s ?

It is around 20-30 MB/s. Same fio args except using --rw=write.


Re: [PATCH] target/user: Add daynmic growing data area featuresupport

2017-02-28 Thread Mike Christie
On 02/27/2017 07:22 PM, Xiubo Li wrote:
> Hi Mike
> 
> Thanks verrry much for your work and test cases.
> 
> 
> From: Xiubo Li 
>
> Currently for the TCMU, the ring buffer size is fixed to 64K cmd
> area + 1M data area, and this will be bottlenecks for high iops.
>>> Hi Xiubo, thanks for your work.
>>>
>>> daynmic -> dynamic
>>>
>>> Have you benchmarked this patch and determined what kind of iops
>>> improvement it allows? Do you see the data area reaching its
>>> fully-allocated size?
>>>
>> I tested this patch with Venky's tcmu-runner rbd aio patches, with one
>> 10 gig iscsi session, and for pretty basic fio direct io (64 -256K
>> read/writes with a queue depth of 64 numjobs between 1 and 4) tests read
>> throughput goes from about 80 to 500 MB/s.
> Looks nice.
> 
>> Write throughput is pretty
>> low at around 150 MB/s.
> What's the original write throughput without this patch? Is it also
> around 80 MB/s ?

It is around 20-30 MB/s. Same fio args except using --rw=write.


Re: [PATCH] target/user: Add daynmic growing data area feature support

2017-02-27 Thread Mike Christie
On 02/22/2017 02:32 PM, Andy Grover wrote:
> On 02/17/2017 01:24 AM, lixi...@cmss.chinamobile.com wrote:
>> > From: Xiubo Li 
>> > 
>> > Currently for the TCMU, the ring buffer size is fixed to 64K cmd
>> > area + 1M data area, and this will be bottlenecks for high iops.
> Hi Xiubo, thanks for your work.
> 
> daynmic -> dynamic
> 
> Have you benchmarked this patch and determined what kind of iops
> improvement it allows? Do you see the data area reaching its
> fully-allocated size?
> 

I tested this patch with Venky's tcmu-runner rbd aio patches, with one
10 gig iscsi session, and for pretty basic fio direct io (64 -256K
read/writes with a queue depth of 64 numjobs between 1 and 4) tests read
throughput goes from about 80 to 500 MB/s. Write throughput is pretty
low at around 150 MB/s.

I did not hit the fully allocated size. I did not drive a lot of IO though.


Re: [PATCH] target/user: Add daynmic growing data area feature support

2017-02-27 Thread Mike Christie
On 02/22/2017 02:32 PM, Andy Grover wrote:
> On 02/17/2017 01:24 AM, lixi...@cmss.chinamobile.com wrote:
>> > From: Xiubo Li 
>> > 
>> > Currently for the TCMU, the ring buffer size is fixed to 64K cmd
>> > area + 1M data area, and this will be bottlenecks for high iops.
> Hi Xiubo, thanks for your work.
> 
> daynmic -> dynamic
> 
> Have you benchmarked this patch and determined what kind of iops
> improvement it allows? Do you see the data area reaching its
> fully-allocated size?
> 

I tested this patch with Venky's tcmu-runner rbd aio patches, with one
10 gig iscsi session, and for pretty basic fio direct io (64 -256K
read/writes with a queue depth of 64 numjobs between 1 and 4) tests read
throughput goes from about 80 to 500 MB/s. Write throughput is pretty
low at around 150 MB/s.

I did not hit the fully allocated size. I did not drive a lot of IO though.


Re: [PATCH] target/user: Fix use-after-free cmd->se_cmd if the cmd is expired

2017-01-03 Thread Mike Christie
On 01/03/2017 02:46 AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> This is another use-after-free bug, the crash Call Trace is like:
> [  368.909498] RIP: 0010:[]  []
> memcpy+0x16/0x110
> ..
> [  368.909547] Call Trace:
> [  368.909550]  [] ?gather_data_area+0x109/0x180
> [  368.909552]  [] tcmu_handle_completions+0x2ff/0x450
> [  368.909554]  [] tcmu_irqcontrol+0x15/0x20
> [  368.909555]  [] uio_write+0x7b/0xc0
> [  368.909558]  [] vfs_write+0xbd/0x1e0
> [  368.909559]  [] SyS_write+0x7f/0xe0
> [  368.909562]  [] system_call_fastpath+0x16/0x1b
> 
> Don't free se_cmd of the expired cmds in tcmu_check_expired_cmd(),
> it will be dereferenced by tcmu_handle_completions()--->
> tcmu_handle_completion(), after userspace ever resumes processing.
> 
> It will be freed by tcmu_handle_completion() if userspace ever recovers,
> or tcmu_free_device if not.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Xiubo Li 
> Signed-off-by: Jianfei Hu 
> ---
>  drivers/target/target_core_user.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 2e33100..6396581 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -684,7 +684,6 @@ static int tcmu_check_expired_cmd(int id, void *p, void 
> *data)
>  
>   set_bit(TCMU_CMD_BIT_EXPIRED, >flags);
>   target_complete_cmd(cmd->se_cmd, SAM_STAT_CHECK_CONDITION);
> - cmd->se_cmd = NULL;
>  

How did tcmu_handle_completion get to a point it was accessing the
se_cmd if the TCMU_CMD_BIT_EXPIRED bit was set? Were memory accesses out
of order? CPU1 set the TCMU_CMD_BIT_EXPIRED bit then cleared
cmd->se_cmd, but CPU2 copied cmd->se_cmd to se_cmd and saw it was NULL
but did not yet see the TCMU_CMD_BIT_EXPIRED bit set?

It looks like, if you do the above patch, the above function will call
target_complete_cmd and tcmu_handle_completion will call it again, so we
will have a double free issue.


Re: [PATCH] target/user: Fix use-after-free cmd->se_cmd if the cmd is expired

2017-01-03 Thread Mike Christie
On 01/03/2017 02:46 AM, lixi...@cmss.chinamobile.com wrote:
> From: Xiubo Li 
> 
> This is another use-after-free bug, the crash Call Trace is like:
> [  368.909498] RIP: 0010:[]  []
> memcpy+0x16/0x110
> ..
> [  368.909547] Call Trace:
> [  368.909550]  [] ?gather_data_area+0x109/0x180
> [  368.909552]  [] tcmu_handle_completions+0x2ff/0x450
> [  368.909554]  [] tcmu_irqcontrol+0x15/0x20
> [  368.909555]  [] uio_write+0x7b/0xc0
> [  368.909558]  [] vfs_write+0xbd/0x1e0
> [  368.909559]  [] SyS_write+0x7f/0xe0
> [  368.909562]  [] system_call_fastpath+0x16/0x1b
> 
> Don't free se_cmd of the expired cmds in tcmu_check_expired_cmd(),
> it will be dereferenced by tcmu_handle_completions()--->
> tcmu_handle_completion(), after userspace ever resumes processing.
> 
> It will be freed by tcmu_handle_completion() if userspace ever recovers,
> or tcmu_free_device if not.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Xiubo Li 
> Signed-off-by: Jianfei Hu 
> ---
>  drivers/target/target_core_user.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/target/target_core_user.c 
> b/drivers/target/target_core_user.c
> index 2e33100..6396581 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -684,7 +684,6 @@ static int tcmu_check_expired_cmd(int id, void *p, void 
> *data)
>  
>   set_bit(TCMU_CMD_BIT_EXPIRED, >flags);
>   target_complete_cmd(cmd->se_cmd, SAM_STAT_CHECK_CONDITION);
> - cmd->se_cmd = NULL;
>  

How did tcmu_handle_completion get to a point it was accessing the
se_cmd if the TCMU_CMD_BIT_EXPIRED bit was set? Were memory accesses out
of order? CPU1 set the TCMU_CMD_BIT_EXPIRED bit then cleared
cmd->se_cmd, but CPU2 copied cmd->se_cmd to se_cmd and saw it was NULL
but did not yet see the TCMU_CMD_BIT_EXPIRED bit set?

It looks like, if you do the above patch, the above function will call
target_complete_cmd and tcmu_handle_completion will call it again, so we
will have a double free issue.


Re: [PATCHv2] MAINTAINERS: Update open-iscsi maintainers

2016-09-27 Thread Mike Christie
On 09/26/2016 05:25 PM, Lee Duncan wrote:
> Chris Leech and I are taking over as open-iscsi maintainers.
> 
> Changes since v1:
>  * Updated URL to open-iscsi.com
>  * Removed git repository, since code in tree
> ---
>  MAINTAINERS | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 01bff8ea28d8..81384a2562e7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6448,10 +6448,10 @@ S:Maintained
>  F:   drivers/firmware/iscsi_ibft*
>  
>  ISCSI
> -M:   Mike Christie <micha...@cs.wisc.edu>
> +M:   Lee Duncan <ldun...@suse.com>
> +M:   Chris Leech <cle...@redhat.com>
>  L:   open-is...@googlegroups.com
> -W:   www.open-iscsi.org
> -T:   git 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnc/linux-2.6-iscsi.git
> +W:   www.open-iscsi.com
>  S:   Maintained
>  F:   drivers/scsi/*iscsi*
>  F:   include/scsi/*iscsi*
> 

After over 10 years, I will get to take a vacation where I do not have
to think about iSCSI :)

Reviewed-by: Mike Christie <mchri...@redhat.com>


Re: [PATCHv2] MAINTAINERS: Update open-iscsi maintainers

2016-09-27 Thread Mike Christie
On 09/26/2016 05:25 PM, Lee Duncan wrote:
> Chris Leech and I are taking over as open-iscsi maintainers.
> 
> Changes since v1:
>  * Updated URL to open-iscsi.com
>  * Removed git repository, since code in tree
> ---
>  MAINTAINERS | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 01bff8ea28d8..81384a2562e7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6448,10 +6448,10 @@ S:Maintained
>  F:   drivers/firmware/iscsi_ibft*
>  
>  ISCSI
> -M:   Mike Christie 
> +M:   Lee Duncan 
> +M:   Chris Leech 
>  L:   open-is...@googlegroups.com
> -W:   www.open-iscsi.org
> -T:   git 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnc/linux-2.6-iscsi.git
> +W:   www.open-iscsi.com
>  S:   Maintained
>  F:   drivers/scsi/*iscsi*
>  F:   include/scsi/*iscsi*
> 

After over 10 years, I will get to take a vacation where I do not have
to think about iSCSI :)

Reviewed-by: Mike Christie 


Re: ERROR: "__ucmpdi2" [drivers/scsi/sd_mod.ko] undefined!

2016-09-06 Thread Mike Christie
Adding Steven and adi-buildroot-devel list.

On 09/04/2016 12:12 PM, kbuild test robot wrote:
> Hi Mike,
> 
> FYI, the error/warning still remains.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   9ca581b50dab6103183396852cc08e440fcda18e
> commit: 4e1b2d52a80d79296a5d899d73249748dea71a53 block, fs, drivers: remove 
> REQ_OP compat defs and related code
> date:   3 months ago
> config: blackfin-CM-BF548_defconfig (attached as .config)
> compiler: bfin-uclinux-gcc (GCC) 4.6.3
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> git checkout 4e1b2d52a80d79296a5d899d73249748dea71a53
> # save the attached .config to linux build tree
> make.cross ARCH=blackfin 
> 
> All errors (new ones prefixed by >>):
> 
>>> ERROR: "__ucmpdi2" [drivers/scsi/sd_mod.ko] undefined!
> 

Blackfin is still maintained by Steven Miao right?

I had sent mail about this before, but did not hear a response, so I am
not sure what to next. I had seen in the past there was a similar error
for blackfin and someone submitted a patch

https://lkml.org/lkml/2015/4/5/4

to add __ucmpdi2. Were there issues with that patch or did you guys want
it fixed differently?



Re: ERROR: "__ucmpdi2" [drivers/scsi/sd_mod.ko] undefined!

2016-09-06 Thread Mike Christie
Adding Steven and adi-buildroot-devel list.

On 09/04/2016 12:12 PM, kbuild test robot wrote:
> Hi Mike,
> 
> FYI, the error/warning still remains.
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   9ca581b50dab6103183396852cc08e440fcda18e
> commit: 4e1b2d52a80d79296a5d899d73249748dea71a53 block, fs, drivers: remove 
> REQ_OP compat defs and related code
> date:   3 months ago
> config: blackfin-CM-BF548_defconfig (attached as .config)
> compiler: bfin-uclinux-gcc (GCC) 4.6.3
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> git checkout 4e1b2d52a80d79296a5d899d73249748dea71a53
> # save the attached .config to linux build tree
> make.cross ARCH=blackfin 
> 
> All errors (new ones prefixed by >>):
> 
>>> ERROR: "__ucmpdi2" [drivers/scsi/sd_mod.ko] undefined!
> 

Blackfin is still maintained by Steven Miao right?

I had sent mail about this before, but did not hear a response, so I am
not sure what to next. I had seen in the past there was a similar error
for blackfin and someone submitted a patch

https://lkml.org/lkml/2015/4/5/4

to add __ucmpdi2. Were there issues with that patch or did you guys want
it fixed differently?



Re: [PATCH 37/45] drivers: use req op accessor

2016-08-03 Thread Mike Christie
On 08/03/2016 07:30 PM, Shaun Tancheff wrote:
> On Wed, Aug 3, 2016 at 6:47 PM, Mike Christie <mchri...@redhat.com> wrote:
>> On 08/03/2016 05:33 PM, Ross Zwisler wrote:
>>> On Sun, Jun 5, 2016 at 1:32 PM,  <mchri...@redhat.com> wrote:
>>>> From: Mike Christie <mchri...@redhat.com>
>>>>
>>>> The req operation REQ_OP is separated from the rq_flag_bits
>>>> definition. This converts the block layer drivers to
>>>> use req_op to get the op from the request struct.
>>>>
>>>> Signed-off-by: Mike Christie <mchri...@redhat.com>
>>>> ---
>>>>  drivers/block/loop.c  |  6 +++---
>>>>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>>>>  drivers/block/nbd.c   |  2 +-
>>>>  drivers/block/rbd.c   |  4 ++--
>>>>  drivers/block/xen-blkfront.c  |  8 +---
>>>>  drivers/ide/ide-floppy.c  |  2 +-
>>>>  drivers/md/dm.c   |  2 +-
>>>>  drivers/mmc/card/block.c  |  7 +++
>>>>  drivers/mmc/card/queue.c  |  6 ++
>>>
>>> Dave Chinner reported a deadlock with XFS + DAX, which I reproduced
>>> and bisected to this commit:
>>>
>>> commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34
>>> Author: Mike Christie <mchri...@redhat.com>
>>> Date:   Sun Jun 5 14:32:17 2016 -0500
>>> drivers: use req op accessor
>>>
>>> Here are the steps to reproduce the deadlock with a BRD ramdisk:
>>>
>>> mkfs.xfs -f /dev/ram0
>>> mount -o dax /dev/ram0 /mnt/scratch
>>
>> When using ramdisks, we need the attached patch like in your other bug
>> report. I think it will fix some hangs people are seeing.
>>
>> I do not think that it should cause the failure to run issue you saw
>> when doing generic/008 and ext2.
>>
> 
> I think the translation in loop.c is suspicious here:
> 
> "if use DIO && not (a flush_flag or discard_flag)"
> should translate to:
> "if use DIO && not ((a flush_flag) || op == discard)"
> 
> But in the patch I read:
> "if use DIO && ((not a flush_flag) || op == discard)
> 
> Which would have DIO && discards follow the AIO path?
> 
> So I would humbly suggest something like the following
> (on top of commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34):
> [Please excuse the messed up patch format ... gmail eats tabs]
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index b9b737c..0754d83 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1659,8 +1659,9 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> if (lo->lo_state != Lo_bound)
> return -EIO;
> 
> -   if (lo->use_dio && (!(cmd->rq->cmd_flags & REQ_FLUSH) ||
> -   req_op(cmd->rq) == REQ_OP_DISCARD))
> +   if (lo->use_dio && !(
> +   (cmd->rq->cmd_flags & REQ_FLUSH) ||
> +req_op(cmd->rq) == REQ_OP_DISCARD))
> cmd->use_aio = true;
> else
> cmd->use_aio = false;
> 

You are right. The translation was bad and your code above is correct.

I think we need my patch in the other mail though too, because for the
rw_page user case if WB_SYNC_ALL is set, then the IO gets sent down as a
read instead of a write.


Re: [PATCH 37/45] drivers: use req op accessor

2016-08-03 Thread Mike Christie
On 08/03/2016 07:30 PM, Shaun Tancheff wrote:
> On Wed, Aug 3, 2016 at 6:47 PM, Mike Christie  wrote:
>> On 08/03/2016 05:33 PM, Ross Zwisler wrote:
>>> On Sun, Jun 5, 2016 at 1:32 PM,   wrote:
>>>> From: Mike Christie 
>>>>
>>>> The req operation REQ_OP is separated from the rq_flag_bits
>>>> definition. This converts the block layer drivers to
>>>> use req_op to get the op from the request struct.
>>>>
>>>> Signed-off-by: Mike Christie 
>>>> ---
>>>>  drivers/block/loop.c  |  6 +++---
>>>>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>>>>  drivers/block/nbd.c   |  2 +-
>>>>  drivers/block/rbd.c   |  4 ++--
>>>>  drivers/block/xen-blkfront.c  |  8 +---
>>>>  drivers/ide/ide-floppy.c  |  2 +-
>>>>  drivers/md/dm.c   |  2 +-
>>>>  drivers/mmc/card/block.c  |  7 +++
>>>>  drivers/mmc/card/queue.c  |  6 ++
>>>
>>> Dave Chinner reported a deadlock with XFS + DAX, which I reproduced
>>> and bisected to this commit:
>>>
>>> commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34
>>> Author: Mike Christie 
>>> Date:   Sun Jun 5 14:32:17 2016 -0500
>>> drivers: use req op accessor
>>>
>>> Here are the steps to reproduce the deadlock with a BRD ramdisk:
>>>
>>> mkfs.xfs -f /dev/ram0
>>> mount -o dax /dev/ram0 /mnt/scratch
>>
>> When using ramdisks, we need the attached patch like in your other bug
>> report. I think it will fix some hangs people are seeing.
>>
>> I do not think that it should cause the failure to run issue you saw
>> when doing generic/008 and ext2.
>>
> 
> I think the translation in loop.c is suspicious here:
> 
> "if use DIO && not (a flush_flag or discard_flag)"
> should translate to:
> "if use DIO && not ((a flush_flag) || op == discard)"
> 
> But in the patch I read:
> "if use DIO && ((not a flush_flag) || op == discard)
> 
> Which would have DIO && discards follow the AIO path?
> 
> So I would humbly suggest something like the following
> (on top of commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34):
> [Please excuse the messed up patch format ... gmail eats tabs]
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index b9b737c..0754d83 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1659,8 +1659,9 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> if (lo->lo_state != Lo_bound)
> return -EIO;
> 
> -   if (lo->use_dio && (!(cmd->rq->cmd_flags & REQ_FLUSH) ||
> -   req_op(cmd->rq) == REQ_OP_DISCARD))
> +   if (lo->use_dio && !(
> +   (cmd->rq->cmd_flags & REQ_FLUSH) ||
> +req_op(cmd->rq) == REQ_OP_DISCARD))
> cmd->use_aio = true;
> else
> cmd->use_aio = false;
> 

You are right. The translation was bad and your code above is correct.

I think we need my patch in the other mail though too, because for the
rw_page user case if WB_SYNC_ALL is set, then the IO gets sent down as a
read instead of a write.


Re: [PATCH 37/45] drivers: use req op accessor

2016-08-03 Thread Mike Christie
On 08/03/2016 05:33 PM, Ross Zwisler wrote:
> On Sun, Jun 5, 2016 at 1:32 PM,  <mchri...@redhat.com> wrote:
>> From: Mike Christie <mchri...@redhat.com>
>>
>> The req operation REQ_OP is separated from the rq_flag_bits
>> definition. This converts the block layer drivers to
>> use req_op to get the op from the request struct.
>>
>> Signed-off-by: Mike Christie <mchri...@redhat.com>
>> ---
>>  drivers/block/loop.c  |  6 +++---
>>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>>  drivers/block/nbd.c   |  2 +-
>>  drivers/block/rbd.c   |  4 ++--
>>  drivers/block/xen-blkfront.c  |  8 +---
>>  drivers/ide/ide-floppy.c  |  2 +-
>>  drivers/md/dm.c   |  2 +-
>>  drivers/mmc/card/block.c  |  7 +++
>>  drivers/mmc/card/queue.c  |  6 ++
> 
> Dave Chinner reported a deadlock with XFS + DAX, which I reproduced
> and bisected to this commit:
> 
> commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34
> Author: Mike Christie <mchri...@redhat.com>
> Date:   Sun Jun 5 14:32:17 2016 -0500
> drivers: use req op accessor
> 
> Here are the steps to reproduce the deadlock with a BRD ramdisk:
> 
> mkfs.xfs -f /dev/ram0
> mount -o dax /dev/ram0 /mnt/scratch

When using ramdisks, we need the attached patch like in your other bug
report. I think it will fix some hangs people are seeing.

I do not think that it should cause the failure to run issue you saw
when doing generic/008 and ext2.

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3022dad..9fbbeba 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -300,20 +300,20 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
  * Process a single bvec of a bio.
  */
 static int brd_do_bvec(struct brd_device *brd, struct page *page,
-			unsigned int len, unsigned int off, int rw,
+			unsigned int len, unsigned int off, int op,
 			sector_t sector)
 {
 	void *mem;
 	int err = 0;
 
-	if (rw != READ) {
+	if (op_is_write(op)) {
 		err = copy_to_brd_setup(brd, sector, len);
 		if (err)
 			goto out;
 	}
 
 	mem = kmap_atomic(page);
-	if (rw == READ) {
+	if (!op_is_write(op)) {
 		copy_from_brd(mem + off, brd, sector, len);
 		flush_dcache_page(page);
 	} else {
@@ -330,7 +330,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct brd_device *brd = bdev->bd_disk->private_data;
-	int rw;
 	struct bio_vec bvec;
 	sector_t sector;
 	struct bvec_iter iter;
@@ -347,14 +346,12 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
-	rw = bio_data_dir(bio);
-
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 		int err;
 
 		err = brd_do_bvec(brd, bvec.bv_page, len,
-	bvec.bv_offset, rw, sector);
+	bvec.bv_offset, bio_op(bio), sector);
 		if (err)
 			goto io_error;
 		sector += len >> SECTOR_SHIFT;
@@ -369,11 +366,11 @@ io_error:
 }
 
 static int brd_rw_page(struct block_device *bdev, sector_t sector,
-		   struct page *page, int rw)
+		   struct page *page, int op, int op_flags)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
-	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, rw, sector);
-	page_endio(page, rw & WRITE, err);
+	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector);
+	page_endio(page, op, err);
 	return err;
 }
 
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7454cf1..f0e126c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -843,15 +843,15 @@ static void zram_bio_discard(struct zram *zram, u32 index,
 }
 
 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
-			int offset, int rw)
+			int offset, int op)
 {
 	unsigned long start_time = jiffies;
 	int ret;
 
-	generic_start_io_acct(rw, bvec->bv_len >> SECTOR_SHIFT,
+	generic_start_io_acct(op, bvec->bv_len >> SECTOR_SHIFT,
 			>disk->part0);
 
-	if (rw == READ) {
+	if (!op_is_write(op)) {
 		atomic64_inc(>stats.num_reads);
 		ret = zram_bvec_read(zram, bvec, index, offset);
 	} else {
@@ -859,10 +859,10 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 		ret = zram_bvec_write(zram, bvec, index, offset);
 	}
 
-	generic_end_io_acct(rw, >disk->part0, start_time);
+	generic_end_io_acct(op, >disk->part0, start_time);
 
 	if (unlikely(ret)) {
-		if (rw == READ)
+		if (!op_is_write(op))
 			atomic64_inc(>stats.failed_reads);
 		else
 			atomic64_inc(>stats.failed_writes);
@@ -873,7 +873,7 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 
 static void __zram_make_request(struct zram *zram, struct bio *bio)
 {
-	int offset, rw;
+	int offset;
 	u32 index;
 	struc

Re: [PATCH 37/45] drivers: use req op accessor

2016-08-03 Thread Mike Christie
On 08/03/2016 05:33 PM, Ross Zwisler wrote:
> On Sun, Jun 5, 2016 at 1:32 PM,   wrote:
>> From: Mike Christie 
>>
>> The req operation REQ_OP is separated from the rq_flag_bits
>> definition. This converts the block layer drivers to
>> use req_op to get the op from the request struct.
>>
>> Signed-off-by: Mike Christie 
>> ---
>>  drivers/block/loop.c  |  6 +++---
>>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>>  drivers/block/nbd.c   |  2 +-
>>  drivers/block/rbd.c   |  4 ++--
>>  drivers/block/xen-blkfront.c  |  8 +---
>>  drivers/ide/ide-floppy.c  |  2 +-
>>  drivers/md/dm.c   |  2 +-
>>  drivers/mmc/card/block.c  |  7 +++
>>  drivers/mmc/card/queue.c  |  6 ++
> 
> Dave Chinner reported a deadlock with XFS + DAX, which I reproduced
> and bisected to this commit:
> 
> commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34
> Author: Mike Christie 
> Date:   Sun Jun 5 14:32:17 2016 -0500
> drivers: use req op accessor
> 
> Here are the steps to reproduce the deadlock with a BRD ramdisk:
> 
> mkfs.xfs -f /dev/ram0
> mount -o dax /dev/ram0 /mnt/scratch

When using ramdisks, we need the attached patch like in your other bug
report. I think it will fix some hangs people are seeing.

I do not think that it should cause the failure to run issue you saw
when doing generic/008 and ext2.

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3022dad..9fbbeba 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -300,20 +300,20 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
  * Process a single bvec of a bio.
  */
 static int brd_do_bvec(struct brd_device *brd, struct page *page,
-			unsigned int len, unsigned int off, int rw,
+			unsigned int len, unsigned int off, int op,
 			sector_t sector)
 {
 	void *mem;
 	int err = 0;
 
-	if (rw != READ) {
+	if (op_is_write(op)) {
 		err = copy_to_brd_setup(brd, sector, len);
 		if (err)
 			goto out;
 	}
 
 	mem = kmap_atomic(page);
-	if (rw == READ) {
+	if (!op_is_write(op)) {
 		copy_from_brd(mem + off, brd, sector, len);
 		flush_dcache_page(page);
 	} else {
@@ -330,7 +330,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct brd_device *brd = bdev->bd_disk->private_data;
-	int rw;
 	struct bio_vec bvec;
 	sector_t sector;
 	struct bvec_iter iter;
@@ -347,14 +346,12 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
-	rw = bio_data_dir(bio);
-
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 		int err;
 
 		err = brd_do_bvec(brd, bvec.bv_page, len,
-	bvec.bv_offset, rw, sector);
+	bvec.bv_offset, bio_op(bio), sector);
 		if (err)
 			goto io_error;
 		sector += len >> SECTOR_SHIFT;
@@ -369,11 +366,11 @@ io_error:
 }
 
 static int brd_rw_page(struct block_device *bdev, sector_t sector,
-		   struct page *page, int rw)
+		   struct page *page, int op, int op_flags)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
-	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, rw, sector);
-	page_endio(page, rw & WRITE, err);
+	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector);
+	page_endio(page, op, err);
 	return err;
 }
 
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7454cf1..f0e126c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -843,15 +843,15 @@ static void zram_bio_discard(struct zram *zram, u32 index,
 }
 
 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
-			int offset, int rw)
+			int offset, int op)
 {
 	unsigned long start_time = jiffies;
 	int ret;
 
-	generic_start_io_acct(rw, bvec->bv_len >> SECTOR_SHIFT,
+	generic_start_io_acct(op, bvec->bv_len >> SECTOR_SHIFT,
 			>disk->part0);
 
-	if (rw == READ) {
+	if (!op_is_write(op)) {
 		atomic64_inc(>stats.num_reads);
 		ret = zram_bvec_read(zram, bvec, index, offset);
 	} else {
@@ -859,10 +859,10 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 		ret = zram_bvec_write(zram, bvec, index, offset);
 	}
 
-	generic_end_io_acct(rw, >disk->part0, start_time);
+	generic_end_io_acct(op, >disk->part0, start_time);
 
 	if (unlikely(ret)) {
-		if (rw == READ)
+		if (!op_is_write(op))
 			atomic64_inc(>stats.failed_reads);
 		else
 			atomic64_inc(>stats.failed_writes);
@@ -873,7 +873,7 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 
 static void __zram_make_request(struct zram *zram, struct bio *bio)
 {
-	int offset, rw;
+	int offset;
 	u32 index;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
@@ -888,7 +888,6 @@ static void __zram_make_request(struct zram *zram, st

Re: [PATCH 42/45] block, fs, drivers: remove REQ_OP compat defs and related code

2016-08-03 Thread Mike Christie
On 08/03/2016 11:25 AM, Ross Zwisler wrote:
>  run fstests generic/008 at 2016-08-03 09:54:56
> page:ea0017af04c0 count:3 mapcount:0 mapping:8805eb059200 index:0x0
> flags: 0x3fff802828(uptodate|lru|private|writeback)
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> page->mem_cgroup:8806098e0800
> [ cut here ]
> kernel BUG at mm/filemap.c:833!
> invalid opcode:  [#1] SMP
> Modules linked in: brd dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm
> CPU: 0 PID: 2522 Comm: xfs_io Not tainted 4.7.0-rc2-00042-g4e1b2d52 #18
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> task: 8805ebae4ec0 ti: 8805eba3c000 task.ti: 8805eba3c000
> RIP: 0010:[] [] unlock_page+0xa5/0xb0
> RSP: 0018:8805eba3fa60 EFLAGS: 00010282
> RAX: 0021 RBX:  RCX: 0006
> RDX:  RSI: 0001 RDI: 8806109ce200
> RBP: 8805eba3fa60 R08: 0001 R09: 0001
> R10: 8805ebae4ec0 R11: 0001 R12: ea0017af04c0
> R13: 00028000 R14: a00202c0 R15: 88060eff1200
> FS: 7f87a31cf700() GS:88061080() knlGS:
> CS: 0010 DS:  ES:  CR0: 80050033
> CR2: 7f87a31e6000 CR3: 00060da31000 CR4: 001406f0
> Stack:
> 8805eba3fa98 812bd782 8805eba3fdb0 1000
> ea0017af04c0  0088 8805eba3fbe0
> 812c3ff1 8805eba3fd00 00028000 000c
> Call Trace:
> [] bdev_write_page+0xb2/0xe0 fs/block_dev.c:462
> [] __mpage_writepage+0x5c1/0x750 fs/mpage.c:604
> [] write_cache_pages+0x20d/0x5f0 mm/page-writeback.c:2261
> [] mpage_writepages+0x75/0xe0 fs/mpage.c:703
> [] ext2_writepages+0x3b/0x40 fs/ext2/inode.c:887
> [] do_writepages+0x21/0x30 mm/page-writeback.c:2361
> [] __filemap_fdatawrite_range+0xc6/0x100 mm/filemap.c:300
> [] filemap_write_and_wait_range+0x44/0x90 mm/filemap.c:490
> [] __generic_file_fsync+0x27/0x90 fs/libfs.c:937
> [] generic_file_fsync+0x19/0x40 fs/libfs.c:974
> [] ext2_fsync+0x2e/0x70 fs/ext2/file.c:149
> [] vfs_fsync_range+0x4b/0xb0 fs/sync.c:195
> [< inline >] vfs_fsync fs/sync.c:209
> [] do_fsync+0x3d/0x70 fs/sync.c:219
> [< inline >] SYSC_fsync fs/sync.c:227
> [] SyS_fsync+0x10/0x20 fs/sync.c:225
> [] entry_SYSCALL_64_fastpath+0x1f/0xbd
> arch/x86/entry/entry_64.S:207
> Code: 00 00 48 d3 ea 89 d2 48 8d 0c 92 48 8d 14 4a 48 8d 3c d0 31 d2
> e8 bc fc f1 ff 5d c3 48 c7 c6 20 1d ec 81 4c 89 c7 e8 bb 8d 03 00 <0f>
> 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 b9 08 00 00
> RIP [] unlock_page+0xa5/0xb0 mm/filemap.c:833
> RSP 
> ---[ end trace d419bf59bba263fb ]---


Thanks for testing and the detailed bug report. Looks like I missed the
rw_page callback. Testing a patch right now. Should be done in a couple
of hours.




Re: [PATCH 42/45] block, fs, drivers: remove REQ_OP compat defs and related code

2016-08-03 Thread Mike Christie
On 08/03/2016 11:25 AM, Ross Zwisler wrote:
>  run fstests generic/008 at 2016-08-03 09:54:56
> page:ea0017af04c0 count:3 mapcount:0 mapping:8805eb059200 index:0x0
> flags: 0x3fff802828(uptodate|lru|private|writeback)
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> page->mem_cgroup:8806098e0800
> [ cut here ]
> kernel BUG at mm/filemap.c:833!
> invalid opcode:  [#1] SMP
> Modules linked in: brd dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm
> CPU: 0 PID: 2522 Comm: xfs_io Not tainted 4.7.0-rc2-00042-g4e1b2d52 #18
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> task: 8805ebae4ec0 ti: 8805eba3c000 task.ti: 8805eba3c000
> RIP: 0010:[] [] unlock_page+0xa5/0xb0
> RSP: 0018:8805eba3fa60 EFLAGS: 00010282
> RAX: 0021 RBX:  RCX: 0006
> RDX:  RSI: 0001 RDI: 8806109ce200
> RBP: 8805eba3fa60 R08: 0001 R09: 0001
> R10: 8805ebae4ec0 R11: 0001 R12: ea0017af04c0
> R13: 00028000 R14: a00202c0 R15: 88060eff1200
> FS: 7f87a31cf700() GS:88061080() knlGS:
> CS: 0010 DS:  ES:  CR0: 80050033
> CR2: 7f87a31e6000 CR3: 00060da31000 CR4: 001406f0
> Stack:
> 8805eba3fa98 812bd782 8805eba3fdb0 1000
> ea0017af04c0  0088 8805eba3fbe0
> 812c3ff1 8805eba3fd00 00028000 000c
> Call Trace:
> [] bdev_write_page+0xb2/0xe0 fs/block_dev.c:462
> [] __mpage_writepage+0x5c1/0x750 fs/mpage.c:604
> [] write_cache_pages+0x20d/0x5f0 mm/page-writeback.c:2261
> [] mpage_writepages+0x75/0xe0 fs/mpage.c:703
> [] ext2_writepages+0x3b/0x40 fs/ext2/inode.c:887
> [] do_writepages+0x21/0x30 mm/page-writeback.c:2361
> [] __filemap_fdatawrite_range+0xc6/0x100 mm/filemap.c:300
> [] filemap_write_and_wait_range+0x44/0x90 mm/filemap.c:490
> [] __generic_file_fsync+0x27/0x90 fs/libfs.c:937
> [] generic_file_fsync+0x19/0x40 fs/libfs.c:974
> [] ext2_fsync+0x2e/0x70 fs/ext2/file.c:149
> [] vfs_fsync_range+0x4b/0xb0 fs/sync.c:195
> [< inline >] vfs_fsync fs/sync.c:209
> [] do_fsync+0x3d/0x70 fs/sync.c:219
> [< inline >] SYSC_fsync fs/sync.c:227
> [] SyS_fsync+0x10/0x20 fs/sync.c:225
> [] entry_SYSCALL_64_fastpath+0x1f/0xbd
> arch/x86/entry/entry_64.S:207
> Code: 00 00 48 d3 ea 89 d2 48 8d 0c 92 48 8d 14 4a 48 8d 3c d0 31 d2
> e8 bc fc f1 ff 5d c3 48 c7 c6 20 1d ec 81 4c 89 c7 e8 bb 8d 03 00 <0f>
> 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 b9 08 00 00
> RIP [] unlock_page+0xa5/0xb0 mm/filemap.c:833
> RSP 
> ---[ end trace d419bf59bba263fb ]---


Thanks for testing and the detailed bug report. Looks like I missed the
rw_page callback. Testing a patch right now. Should be done in a couple
of hours.




Re: [PATCH 28/45] target: use bio op accessors

2016-06-06 Thread Mike Christie
On 06/06/2016 01:46 AM, Hannes Reinecke wrote:
> On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
>> From: Mike Christie <mchri...@redhat.com>
>>
>> Separate the op from the rq_flag_bits and have the target layer
>> set/get the bio using bio_set_op_attrs/bio_op.
>>
>> Signed-off-by: Mike Christie <mchri...@redhat.com>
>> ---
>>  drivers/target/target_core_iblock.c | 29 ++---
>>  drivers/target/target_core_pscsi.c  |  2 +-
>>  2 files changed, 15 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/target/target_core_iblock.c 
>> b/drivers/target/target_core_iblock.c
>> index c25109c..22af12f 100644
>> --- a/drivers/target/target_core_iblock.c
>> +++ b/drivers/target/target_core_iblock.c
> [ .. ]
>> @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct 
>> scatterlist *sgl, u32 sgl_nents,
>>   * Force writethrough using WRITE_FUA if a volatile write cache
>>   * is not enabled, or if initiator set the Force Unit Access 
>> bit.
>>   */
>> +op = REQ_OP_WRITE;
>>  if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
>>  if (cmd->se_cmd_flags & SCF_FUA)
>> -rw = WRITE_FUA;
>> +op_flags = WRITE_FUA;
>>  else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
>> -rw = WRITE_FUA;
>> -else
>> -rw = WRITE;
>> -} else {
>> -rw = WRITE;
>> +op_flags = WRITE_FUA;
>>  }
> Wrong intendation.

It should be ok. That line is for the QUEUE_FLAG_WC test. We end up with:

op = REQ_OP_WRITE;
if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
if (cmd->se_cmd_flags & SCF_FUA)
op_flags = WRITE_FUA;
else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
op_flags = WRITE_FUA;
}






Re: [PATCH 28/45] target: use bio op accessors

2016-06-06 Thread Mike Christie
On 06/06/2016 01:46 AM, Hannes Reinecke wrote:
> On 06/05/2016 09:32 PM, mchri...@redhat.com wrote:
>> From: Mike Christie 
>>
>> Separate the op from the rq_flag_bits and have the target layer
>> set/get the bio using bio_set_op_attrs/bio_op.
>>
>> Signed-off-by: Mike Christie 
>> ---
>>  drivers/target/target_core_iblock.c | 29 ++---
>>  drivers/target/target_core_pscsi.c  |  2 +-
>>  2 files changed, 15 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/target/target_core_iblock.c 
>> b/drivers/target/target_core_iblock.c
>> index c25109c..22af12f 100644
>> --- a/drivers/target/target_core_iblock.c
>> +++ b/drivers/target/target_core_iblock.c
> [ .. ]
>> @@ -689,18 +690,15 @@ iblock_execute_rw(struct se_cmd *cmd, struct 
>> scatterlist *sgl, u32 sgl_nents,
>>   * Force writethrough using WRITE_FUA if a volatile write cache
>>   * is not enabled, or if initiator set the Force Unit Access 
>> bit.
>>   */
>> +op = REQ_OP_WRITE;
>>  if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
>>  if (cmd->se_cmd_flags & SCF_FUA)
>> -rw = WRITE_FUA;
>> +op_flags = WRITE_FUA;
>>  else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
>> -rw = WRITE_FUA;
>> -else
>> -rw = WRITE;
>> -} else {
>> -rw = WRITE;
>> +op_flags = WRITE_FUA;
>>  }
> Wrong intendation.

It should be ok. That line is for the QUEUE_FLAG_WC test. We end up with:

op = REQ_OP_WRITE;
if (test_bit(QUEUE_FLAG_FUA, >queue_flags)) {
if (cmd->se_cmd_flags & SCF_FUA)
op_flags = WRITE_FUA;
else if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
op_flags = WRITE_FUA;
}






Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs

2016-05-04 Thread Mike Christie
On 05/04/2016 12:58 PM, Jeff Moyer wrote:
> Mike Christie <mchri...@redhat.com> writes:
> 
>> On 05/03/2016 03:44 PM, Jeff Moyer wrote:
>>> Hi, Mike,
>>>
>>> That git tree doesn't seem to exist.  I did manage to apply your patch
>>> set on top of next-20160415, though.
>>>
>>> So... what testing did you do? ;-) I ran into the following problems
>>
>> I normally run xfstests and run it on my daily workstation and laptop. I
>> did not do this for every FS this time and hit a regression.
>>
>> What FS were you using?
> 
> I'm using xfs, scsi disk, no blk-mq, no dm.
> 
>>> - git clone fails
>>> - yum segfaults
>>
>> In v7/v6, I missed a new submit_bio call, so I hit issues like the two
>> above. I have this fixed in the next version.
> 
> OK, does this mean you're posting another version, or you already did
> and I somehow missed it?
> 

I did not repost yet. I am still testing. Jens had me add some wrappers
around the operation access, so because of my last screw up and the
wrapper change affected all my patches I am redoing all my testing.


Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs

2016-05-04 Thread Mike Christie
On 05/04/2016 12:58 PM, Jeff Moyer wrote:
> Mike Christie  writes:
> 
>> On 05/03/2016 03:44 PM, Jeff Moyer wrote:
>>> Hi, Mike,
>>>
>>> That git tree doesn't seem to exist.  I did manage to apply your patch
>>> set on top of next-20160415, though.
>>>
>>> So... what testing did you do? ;-) I ran into the following problems
>>
>> I normally run xfstests and run it on my daily workstation and laptop. I
>> did not do this for every FS this time and hit a regression.
>>
>> What FS were you using?
> 
> I'm using xfs, scsi disk, no blk-mq, no dm.
> 
>>> - git clone fails
>>> - yum segfaults
>>
>> In v7/v6, I missed a new submit_bio call, so I hit issues like the two
>> above. I have this fixed in the next version.
> 
> OK, does this mean you're posting another version, or you already did
> and I somehow missed it?
> 

I did not repost yet. I am still testing. Jens had me add some wrappers
around the operation access, so because of my last screw up and the
wrapper change affected all my patches I am redoing all my testing.


Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs

2016-05-04 Thread Mike Christie
On 05/03/2016 03:44 PM, Jeff Moyer wrote:
> mchri...@redhat.com writes:
> 
>> The following patches begin to cleanup the request->cmd_flags and
>> bio->bi_rw mess. We currently use cmd_flags to specify the operation,
>> attributes and state of the request. For bi_rw we use it for similar
>> info and also the priority but then also have another bi_flags field
>> for state. At some point, we abused them so much we just made cmd_flags
>> 64 bits, so we could add more.
>>
>> The following patches seperate the operation (read, write discard,
>> flush, etc) from cmd_flags/bi_rw.
>>
>> This patchset was made against linux-next from today April 15
>> (git tag next-20160415).
>>
>> I put a git tree here:
>> https://github.com/mikechristie/linux-kernel.git
>> The patches are in the op branch.
> 
> Hi, Mike,
> 
> That git tree doesn't seem to exist.  I did manage to apply your patch
> set on top of next-20160415, though.
> 
> So... what testing did you do? ;-) I ran into the following problems

I normally run xfstests and run it on my daily workstation and laptop. I
did not do this for every FS this time and hit a regression.

What FS were you using?

> - git clone fails
> - yum segfaults


In v7/v6, I missed a new submit_bio call, so I hit issues like the two
above. I have this fixed in the next version.

> - many blktrace/blkparse issues, including incorrect cpu recorded in
>   traces, null task names, and blkparse outputting nothing for a trace
>   file several gigabytes in size.

I will double check for these issues.


Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs

2016-05-04 Thread Mike Christie
On 05/03/2016 03:44 PM, Jeff Moyer wrote:
> mchri...@redhat.com writes:
> 
>> The following patches begin to cleanup the request->cmd_flags and
>> bio->bi_rw mess. We currently use cmd_flags to specify the operation,
>> attributes and state of the request. For bi_rw we use it for similar
>> info and also the priority but then also have another bi_flags field
>> for state. At some point, we abused them so much we just made cmd_flags
>> 64 bits, so we could add more.
>>
>> The following patches seperate the operation (read, write discard,
>> flush, etc) from cmd_flags/bi_rw.
>>
>> This patchset was made against linux-next from today April 15
>> (git tag next-20160415).
>>
>> I put a git tree here:
>> https://github.com/mikechristie/linux-kernel.git
>> The patches are in the op branch.
> 
> Hi, Mike,
> 
> That git tree doesn't seem to exist.  I did manage to apply your patch
> set on top of next-20160415, though.
> 
> So... what testing did you do? ;-) I ran into the following problems

I normally run xfstests and run it on my daily workstation and laptop. I
did not do this for every FS this time and hit a regression.

What FS were you using?

> - git clone fails
> - yum segfaults


In v7/v6, I missed a new submit_bio call, so I hit issues like the two
above. I have this fixed in the next version.

> - many blktrace/blkparse issues, including incorrect cpu recorded in
>   traces, null task names, and blkparse outputting nothing for a trace
>   file several gigabytes in size.

I will double check for these issues.


Re: [PATCH 41/42] block: do not use REQ_FLUSH for tracking flush support

2016-04-15 Thread Mike Christie
On 04/15/2016 05:50 AM, Juergen Gross wrote:
> On 15/04/16 12:40, mchri...@redhat.com wrote:
>> From: Mike Christie <mchri...@redhat.com>
>>
>> The last patch added a REQ_OP_FLUSH for request_fn drivers
>> and the next patch renames REQ_FLUSH to REQ_PREFLUSH which
>> will be used by file systems and make_request_fn drivers so
>> they can send a write/flush combo.
>>
>> This patch drops xen's use of REQ_FLUSH to track if it supports
>> REQ_OP_FLUSH requests, so REQ_FLUSH can be deleted.
>>
>> v6:
>> - Dropped parts of patch handled by Jens's QUEUE_FLAG_WC/FUA
>> patches and modified patch to check feature_flush/fua bits.
>>
>> Signed-off-by: Mike Christie <mchri...@redhat.com>
>> Reviewed-by: Hannes Reinecke <h...@suse.com>
>> ---
>>  drivers/block/xen-blkfront.c | 47 
>> ++--
>>  1 file changed, 24 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index f01691a..d6429e7 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
> 
> ...
> 
>> @@ -985,24 +981,22 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, 
>> u16 sector_size,
>>  return 0;
>>  }
>>  
>> -static const char *flush_info(unsigned int feature_flush)
>> +static const char *flush_info(struct blkfront_info *info)
>>  {
>> -switch (feature_flush & ((REQ_FLUSH | REQ_FUA))) {
>> -case REQ_FLUSH|REQ_FUA:
>> +if (info->feature_flush && info->feature_fua)
>>  return "barrier: enabled;";
>> -case REQ_FLUSH:
>> +else if (info->feature_fua)
> 
> Shouldn't this test feature_flush?
> 
>>  return "flush diskcache: enabled;";
>> -default:
>> +else
>>  return "barrier or flush: disabled;";
>> -}
>>  }
>>  
>>  static void xlvbd_flush(struct blkfront_info *info)
>>  {
>> -blk_queue_write_cache(info->rq, info->feature_flush & REQ_FLUSH,
>> -info->feature_flush & REQ_FUA);
>> +blk_queue_write_cache(info->rq, info->feature_flush ? true : false,
>> +  info->feature_flush ? true : false);
> 
> And here the second test should be feature_fua?
> 

You are right. Will fix up and resend.



Re: [PATCH 41/42] block: do not use REQ_FLUSH for tracking flush support

2016-04-15 Thread Mike Christie
On 04/15/2016 05:50 AM, Juergen Gross wrote:
> On 15/04/16 12:40, mchri...@redhat.com wrote:
>> From: Mike Christie 
>>
>> The last patch added a REQ_OP_FLUSH for request_fn drivers
>> and the next patch renames REQ_FLUSH to REQ_PREFLUSH which
>> will be used by file systems and make_request_fn drivers so
>> they can send a write/flush combo.
>>
>> This patch drops xen's use of REQ_FLUSH to track if it supports
>> REQ_OP_FLUSH requests, so REQ_FLUSH can be deleted.
>>
>> v6:
>> - Dropped parts of patch handled by Jens's QUEUE_FLAG_WC/FUA
>> patches and modified patch to check feature_flush/fua bits.
>>
>> Signed-off-by: Mike Christie 
>> Reviewed-by: Hannes Reinecke 
>> ---
>>  drivers/block/xen-blkfront.c | 47 
>> ++--
>>  1 file changed, 24 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index f01691a..d6429e7 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
> 
> ...
> 
>> @@ -985,24 +981,22 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, 
>> u16 sector_size,
>>  return 0;
>>  }
>>  
>> -static const char *flush_info(unsigned int feature_flush)
>> +static const char *flush_info(struct blkfront_info *info)
>>  {
>> -switch (feature_flush & ((REQ_FLUSH | REQ_FUA))) {
>> -case REQ_FLUSH|REQ_FUA:
>> +if (info->feature_flush && info->feature_fua)
>>  return "barrier: enabled;";
>> -case REQ_FLUSH:
>> +else if (info->feature_fua)
> 
> Shouldn't this test feature_flush?
> 
>>  return "flush diskcache: enabled;";
>> -default:
>> +else
>>  return "barrier or flush: disabled;";
>> -}
>>  }
>>  
>>  static void xlvbd_flush(struct blkfront_info *info)
>>  {
>> -blk_queue_write_cache(info->rq, info->feature_flush & REQ_FLUSH,
>> -info->feature_flush & REQ_FUA);
>> +blk_queue_write_cache(info->rq, info->feature_flush ? true : false,
>> +  info->feature_flush ? true : false);
> 
> And here the second test should be feature_fua?
> 

You are right. Will fix up and resend.



Re: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

2016-03-24 Thread Mike Christie
On 03/22/2016 02:01 PM, Jens Axboe wrote:
> On 03/22/2016 12:59 PM, Christoph Hellwig wrote:
>> On Tue, Mar 22, 2016 at 11:55:15AM -0600, Jens Axboe wrote:
>>> Some of the flags that we want to use from the make_request_fn path
>>> are now larger than 32-bit, so change the functions involved to
>>> accept an u64 instead of an unsigned int.
>>
>> When did we start doing that?  We really should merge Mike's split
>> of the operation style flags into the cmd_type before making things
>> even worse in the flags area.
> 
> Just now, and I ran into it last week as well, for a test patch on cfq
> that passed in higher flags for get_request -> may_queue() as well. We
> can do Mike's split first, I think it's a good cleanup. As a standalone
> series, I needed it though.
> 

Hey, did you want any changes on that patchset? I was going to repost it
with the kbuild fix against linux-next, but I can make any changes you
wanted first.


Re: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

2016-03-24 Thread Mike Christie
On 03/22/2016 02:01 PM, Jens Axboe wrote:
> On 03/22/2016 12:59 PM, Christoph Hellwig wrote:
>> On Tue, Mar 22, 2016 at 11:55:15AM -0600, Jens Axboe wrote:
>>> Some of the flags that we want to use from the make_request_fn path
>>> are now larger than 32-bit, so change the functions involved to
>>> accept an u64 instead of an unsigned int.
>>
>> When did we start doing that?  We really should merge Mike's split
>> of the operation style flags into the cmd_type before making things
>> even worse in the flags area.
> 
> Just now, and I ran into it last week as well, for a test patch on cfq
> that passed in higher flags for get_request -> may_queue() as well. We
> can do Mike's split first, I think it's a good cleanup. As a standalone
> series, I needed it though.
> 

Hey, did you want any changes on that patchset? I was going to repost it
with the kbuild fix against linux-next, but I can make any changes you
wanted first.


Re: [PATCH 00/35 v4] separate operations from flags in the bio/request structs

2016-02-29 Thread Mike Christie
On 02/29/2016 11:13 AM, Christoph Hellwig wrote:
> Any reason you've dropped my ACK for the previous version?

Sorry. I think I accidentally dropped it when I rebased back over to
linux-next. It looks like it only got stuck on the first patch. I will
fix that up on the resend for those kbuild errors.


Re: [PATCH 00/35 v4] separate operations from flags in the bio/request structs

2016-02-29 Thread Mike Christie
On 02/29/2016 11:13 AM, Christoph Hellwig wrote:
> Any reason you've dropped my ACK for the previous version?

Sorry. I think I accidentally dropped it when I rebased back over to
linux-next. It looks like it only got stuck on the first patch. I will
fix that up on the resend for those kbuild errors.


Re: [PATCH] Use ida_simple for SCSI iSCSI transport session id

2016-02-16 Thread Mike Christie
On 02/15/2016 12:26 PM, Chris Leech wrote:
> On Fri, Feb 12, 2016 at 09:54:51AM -0800, James Bottomley wrote:
>> On Fri, 2016-02-12 at 09:38 -0800, Lee Duncan wrote:
>>> The scsi_transport_iscsi module already uses the ida_simple
>>> routines for managing the target ID, if requested to do
>>> so. This change replaces an ever-increasing atomic integer
>>> that tracks the session ID itself with the ida_simple
>>> family of routines. This means that the session ID
>>> will be reclaimed and can be reused when the session
>>> is freed.
>>
>> Is reusing session ID's really a good idea?  For sequential sessions it
>> means that the ID of the next session will be re-used, i.e. the same as
>> the previous sessions, which could lead to target confusion.  I think
>> local uniqueness of session IDs is more important than wrap around
>> because sessions are short lived entities and the chances of the same
>> session being alive by the time we've wrapped is pretty tiny.
> 
> I've got a few complaints about target resources being tied up because
> we don't reuse session IDs.  The ISID becomes a component in the
> I_T nexus identifier, so changing it invalidates persistent reservations.
> 
>> If you can demostrate a multi-target problem, perhaps we should rather
>> fix this by making the next session id a target local quantity?
> 
> Mike's got a good point that we don't really need to base the ISID off
> of our local session identifier (kobject name).  I think getting reuse
> right may be a bit trickier than being a target local value, because it
> needs to be unique across target portal groups.  Which probably furthers
> the argument that we should deal with that in the userspace tools.
> 
> If we plan to split the protocol ISID cleanly from the kobject name,
> I guess the question is if aggressive reuse of the local identifier is
> better than dealing with the unlikely collision on rollover?

I thought Lee's patch to convert the host_no from a atomic_t to ida
based was merged in Martin's tree. If that is going upstream, then I
thought you would want to fix the session id too.

Is the concern similar to /dev/sdX reuse and bad apps? In this case,
some app might do a logout and login, but not update the sysfs mapping.
You could then hit corruption due to the sysfs session id now mapping to
a different target.


  1   2   3   >