Re: [PATCH] qedi: Fix truncation of name and secret

2018-02-05 Thread Javali, Nilesh

On 2/6/18, 12:51 AM, "Lee Duncan"  wrote:

>On 02/05/2018 11:15 AM, Lee Duncan wrote:
>> On 01/31/2018 10:57 PM, Nilesh Javali wrote:
>>> Adjust the NULL byte added by snprintf.
>>>
>>> Signed-off-by: Nilesh Javali 
>>> ---
>>>  drivers/scsi/qedi/qedi_main.c | 12 ++--
>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/scsi/qedi/qedi_main.c
>>>b/drivers/scsi/qedi/qedi_main.c
>>> index 34a..cf8badb 100644
>>> --- a/drivers/scsi/qedi/qedi_main.c
>>> +++ b/drivers/scsi/qedi/qedi_main.c
>>> @@ -1840,7 +1840,7 @@ static ssize_t qedi_show_boot_ini_info(void
>>>*data, int type, char *buf)
>>>  
>>> switch (type) {
>>> case ISCSI_BOOT_INI_INITIATOR_NAME:
>>> -   rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN, "%s\n",
>>> +   rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN + 1, "%s",
>>>   initiator->initiator_name.byte);
>>> break;
>>> default:
>>> @@ -1908,7 +1908,7 @@ static umode_t qedi_ini_get_attr_visibility(void
>>>*data, int type)
>>>  
>>> switch (type) {
>>> case ISCSI_BOOT_TGT_NAME:
>>> -   rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN, "%s\n",
>>> +   rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN + 1, "%s",
>>>   block->target[idx].target_name.byte);
>>> break;
>>> case ISCSI_BOOT_TGT_IP_ADDR:
>>> @@ -1930,19 +1930,19 @@ static umode_t
>>>qedi_ini_get_attr_visibility(void *data, int type)
>>>   block->target[idx].lun.value[0]);
>>> break;
>>> case ISCSI_BOOT_TGT_CHAP_NAME:
>>> -   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN, "%s\n",
>>> +   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN + 1, "%s",
>>>   chap_name);
>>> break;
>>> case ISCSI_BOOT_TGT_CHAP_SECRET:
>>> -   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN, "%s\n",
>>> +   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN + 1, "%s",
>>>   chap_secret);
>>> break;
>>> case ISCSI_BOOT_TGT_REV_CHAP_NAME:
>>> -   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN, "%s\n",
>>> +   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN + 1, "%s",
>>>   mchap_name);
>>> break;
>>> case ISCSI_BOOT_TGT_REV_CHAP_SECRET:
>>> -   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN, "%s\n",
>>> +   rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN + 1, "%s",
>>>   mchap_secret);
>>> break;
>>> case ISCSI_BOOT_TGT_FLAGS:
>>>
>> 
>> Reviewed-by: Lee Duncan 
>> 
>
>Assuming you address Bart's comments.
>-- 
>Lee Duncan
>SUSE Labs

Bart, Lee,

Thanks for the review and comments. Please ignore the current patch.

The data in NVRAM is not guaranteed to be NUL terminated.
Hence in V2 solution, the data would be copied upto the element size or to
the first NUL
in the byte-stream and then append a NUL.

I would post V2 of the solution shortly.


Thanks,
Nilesh




RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Kashyap Desai
> > We still have more than one reply queue ending up completion one CPU.
>
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> smp_affinity_enable has to be set as 1, but seems it is the default
setting.
>
> Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
which
> figures out an optimal number of vectors, and the computation is based
on
> cpumask_weight(cpu_possible_mask) now. If all offline CPUs are mapped to
> some of reply queues, these queues won't be active(no request submitted
to
> these queues). The mechanism of PCI_IRQ_AFFINITY basically makes sure
that
> more than one irq vector won't be handled by one same CPU, and the irq
> vector spread is done in irq_create_affinity_masks().
>
> > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > module parameter to simulate the issue. We need more number of Online
> > CPU than reply-queue.
>
> IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
irq
> vector number, num_possible_cpus()/num_online_cpus() and each irq
> vector's affinity assignment.
>
> > We may see completion redirected to original CPU because of
> > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU
> > busy in local ISR routine.
>
> Could you dump each irq vector's affinity assignment of your megaraid in
your
> test?

To quickly reproduce, I restricted to single MSI-x vector on megaraid_sas
driver.  System has total 16 online CPUs.

Output of affinity hints.
kernel version:
Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64 x86_64
x86_64 GNU/Linux
PCI name is 83:00.0, dump its irq affinity:
irq 105, cpu list 0-3,8-11

Affinity mask is created properly, but only CPU-0 is overloaded with
interrupt processing.

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 47861 MB
node 0 free: 46516 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 64491 MB
node 1 free: 62805 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Output of  system activities (sar).  (gnice is 100% and it is consumed in
megaraid_sas ISR routine.)


12:44:40 PM CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
12:44:41 PM all 6.03  0.0029.98  0.00
0.00 0.000.000.000.00 63.99
12:44:41 PM   0 0.00  0.00 0.000.00
0.00 0.000.000.00   100.00 0


In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
also used " host_tagset" V2 patch set for megaraid_sas.

Using RFC requested in -
"https://marc.info/?l=linux-scsi=151601833418346=2 " lockup is avoided
(you can noticed that gnice is shifted to softirq. Even though it is 100%
consumed, There is always exit for existing completion loop due to
irqpoll_weight  @irq_poll_init().

Average:CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
Average:all  4.25  0.0021.61  0.00
0.00  0.00 6.61   0.00  0.00 67.54
Average:  0   0.00  0.00 0.00  0.00
0.00  0.00   100.000.00  0.00  0.00


Hope this clarifies. We need different fix to avoid lockups. Can we
consider using irq poll interface if #CPU is more than Reply queue/MSI-x.
?

>
> And the following script can do it easily, and the pci path (the 1st
column of
> lspci output) need to be passed, such as: 00:1c.4,
>
> #!/bin/sh
> if [ $# -ge 1 ]; then
> PCID=$1
> else
> PCID=`lspci | grep "Non-Volatile memory" | cut -c1-7` fi PCIP=`find
> /sys/devices -name *$PCID | grep pci` IRQS=`ls $PCIP/msi_irqs`
>
> echo "kernel version: "
> uname -a
>
> echo "PCI name is $PCID, dump its irq affinity:"
> for IRQ in $IRQS; do
> CPUS=`cat /proc/irq/$IRQ/smp_affinity_list`
> echo "\tirq $IRQ, cpu list $CPUS"
> done
>
>
> Thanks,
> Ming


Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread chenxiang (M)

在 2018/2/5 23:20, Ming Lei 写道:

This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
each reply queue to blk_mq's hw queue, then .queuecommand can always
choose the hw queue as the reply queue. And if no any online CPU is
mapped to one hw queue, request can't be submitted to this hw queue
at all, finally the irq affinity issue is solved.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
  drivers/scsi/hpsa.c | 51 ++-
  1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 443eabf63a9f..e517a4c74a28 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -51,6 +51,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include "hpsa_cmd.h"
@@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
  #define HPSA_NRESERVED_CMDS   (HPSA_CMDS_RESERVED_FOR_DRIVER +\
 HPSA_MAX_CONCURRENT_PASSTHRUS)
  
+static int hpsa_map_queues(struct Scsi_Host *shost)

+{
+struct ctlr_info *h = shost_to_hba(shost);
+
+return blk_mq_pci_map_queues(>tag_set, h->pdev);
+}
+


Hi Lei Ming,
It is okay to use blk_mq_pci_map_queue to solve automatic irq affinity 
issue when the first interrupt vector for queues is 0.
But if the first interrupt vector for queues is not 0,  we seems 
couldn't use blk_mq_pci_map_queue directly,
such as blk_mq_virtio_map_queues, it realizes a interface itself. Is it 
possible to provide a general interface for those

situations?



  static struct scsi_host_template hpsa_driver_template = {
.module = THIS_MODULE,
.name   = HPSA,
@@ -974,10 +982,13 @@ static struct scsi_host_template hpsa_driver_template = {
  #ifdef CONFIG_COMPAT
.compat_ioctl   = hpsa_compat_ioctl,
  #endif
+   .map_queues = hpsa_map_queues,
.sdev_attrs = hpsa_sdev_attrs,
.shost_attrs = hpsa_shost_attrs,
.max_sectors = 1024,
.no_write_same = 1,
+   .force_blk_mq = 1,
+   .host_tagset = 1,
  };
  
  static inline u32 next_command(struct ctlr_info *h, u8 q)

@@ -1045,11 +1056,7 @@ static void set_performant_mode(struct ctlr_info *h, 
struct CommandList *c,
c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
if (unlikely(!h->msix_vectors))
return;
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   c->Header.ReplyQueue =
-   raw_smp_processor_id() % h->nreply_queues;
-   else
-   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+   c->Header.ReplyQueue = reply_queue;
}
  }
  
@@ -1063,10 +1070,7 @@ static void set_ioaccel1_performant_mode(struct ctlr_info *h,

 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->ReplyQueue = reply_queue % h->nreply_queues;
+   cp->ReplyQueue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit (bit 0)
@@ -1087,10 +1091,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
ctlr_info *h,
/* Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/* Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
 *  - pull count (bits 0-3)
@@ -1109,10 +1110,7 @@ static void set_ioaccel2_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % 

RE: [PATCH V2 4/8] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-05 Thread Don Brace
> This patch introduces the parameter of 'g_global_tags' so that we can
> test this feature by null_blk easiy.
> 
> Not see obvious performance drop with global_tags when the whole hw
> depth is kept as same:
> 
> 1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0
> submit_queues=4 hw_queue_depth=1
> 
> 2) 'global_tags', global hw queue depth is 4, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=1
> submit_queues=4 hw_queue_depth=4
> 
> 3) fio test done in above two settings:
>fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --
> ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  --
> name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 --
> name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3
> 
> 1M IOPS can be reached in both above tests which is done in one VM.
> 
I am getting 1.1M IOPS for both cases.

Tested-by: Don Brace 




Re: [PATCH] qedi: Fix truncation of name and secret

2018-02-05 Thread Lee Duncan
On 02/05/2018 11:15 AM, Lee Duncan wrote:
> On 01/31/2018 10:57 PM, Nilesh Javali wrote:
>> Adjust the NULL byte added by snprintf.
>>
>> Signed-off-by: Nilesh Javali 
>> ---
>>  drivers/scsi/qedi/qedi_main.c | 12 ++--
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
>> index 34a..cf8badb 100644
>> --- a/drivers/scsi/qedi/qedi_main.c
>> +++ b/drivers/scsi/qedi/qedi_main.c
>> @@ -1840,7 +1840,7 @@ static ssize_t qedi_show_boot_ini_info(void *data, int 
>> type, char *buf)
>>  
>>  switch (type) {
>>  case ISCSI_BOOT_INI_INITIATOR_NAME:
>> -rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN, "%s\n",
>> +rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN + 1, "%s",
>>initiator->initiator_name.byte);
>>  break;
>>  default:
>> @@ -1908,7 +1908,7 @@ static umode_t qedi_ini_get_attr_visibility(void 
>> *data, int type)
>>  
>>  switch (type) {
>>  case ISCSI_BOOT_TGT_NAME:
>> -rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN, "%s\n",
>> +rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN + 1, "%s",
>>block->target[idx].target_name.byte);
>>  break;
>>  case ISCSI_BOOT_TGT_IP_ADDR:
>> @@ -1930,19 +1930,19 @@ static umode_t qedi_ini_get_attr_visibility(void 
>> *data, int type)
>>block->target[idx].lun.value[0]);
>>  break;
>>  case ISCSI_BOOT_TGT_CHAP_NAME:
>> -rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN, "%s\n",
>> +rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN + 1, "%s",
>>chap_name);
>>  break;
>>  case ISCSI_BOOT_TGT_CHAP_SECRET:
>> -rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN, "%s\n",
>> +rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN + 1, "%s",
>>chap_secret);
>>  break;
>>  case ISCSI_BOOT_TGT_REV_CHAP_NAME:
>> -rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN, "%s\n",
>> +rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN + 1, "%s",
>>mchap_name);
>>  break;
>>  case ISCSI_BOOT_TGT_REV_CHAP_SECRET:
>> -rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN, "%s\n",
>> +rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN + 1, "%s",
>>mchap_secret);
>>  break;
>>  case ISCSI_BOOT_TGT_FLAGS:
>>
> 
> Reviewed-by: Lee Duncan 
> 

Assuming you address Bart's comments.
-- 
Lee Duncan
SUSE Labs


Re: [PATCH] qedi: Fix truncation of name and secret

2018-02-05 Thread Lee Duncan
On 01/31/2018 10:57 PM, Nilesh Javali wrote:
> Adjust the NULL byte added by snprintf.
> 
> Signed-off-by: Nilesh Javali 
> ---
>  drivers/scsi/qedi/qedi_main.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
> index 34a..cf8badb 100644
> --- a/drivers/scsi/qedi/qedi_main.c
> +++ b/drivers/scsi/qedi/qedi_main.c
> @@ -1840,7 +1840,7 @@ static ssize_t qedi_show_boot_ini_info(void *data, int 
> type, char *buf)
>  
>   switch (type) {
>   case ISCSI_BOOT_INI_INITIATOR_NAME:
> - rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN, "%s\n",
> + rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN + 1, "%s",
> initiator->initiator_name.byte);
>   break;
>   default:
> @@ -1908,7 +1908,7 @@ static umode_t qedi_ini_get_attr_visibility(void *data, 
> int type)
>  
>   switch (type) {
>   case ISCSI_BOOT_TGT_NAME:
> - rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN, "%s\n",
> + rc = snprintf(str, NVM_ISCSI_CFG_ISCSI_NAME_MAX_LEN + 1, "%s",
> block->target[idx].target_name.byte);
>   break;
>   case ISCSI_BOOT_TGT_IP_ADDR:
> @@ -1930,19 +1930,19 @@ static umode_t qedi_ini_get_attr_visibility(void 
> *data, int type)
> block->target[idx].lun.value[0]);
>   break;
>   case ISCSI_BOOT_TGT_CHAP_NAME:
> - rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN, "%s\n",
> + rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN + 1, "%s",
> chap_name);
>   break;
>   case ISCSI_BOOT_TGT_CHAP_SECRET:
> - rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN, "%s\n",
> + rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN + 1, "%s",
> chap_secret);
>   break;
>   case ISCSI_BOOT_TGT_REV_CHAP_NAME:
> - rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN, "%s\n",
> + rc = snprintf(str, NVM_ISCSI_CFG_CHAP_NAME_MAX_LEN + 1, "%s",
> mchap_name);
>   break;
>   case ISCSI_BOOT_TGT_REV_CHAP_SECRET:
> - rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN, "%s\n",
> + rc = snprintf(str, NVM_ISCSI_CFG_CHAP_PWD_MAX_LEN + 1, "%s",
> mchap_secret);
>   break;
>   case ISCSI_BOOT_TGT_FLAGS:
> 

Reviewed-by: Lee Duncan 
-- 
Lee


[PATCH] libiscsi: ensure session spin lock usage consistent

2018-02-05 Thread Lee Duncan
The libiscsi code was using both spin_lock()/spin_unlock()
and spin_lock_bh()/spin_unlock_bh() on its session lock.
In addition, lock validation found that libiscsi.c was
taking a HARDIRQ-unsafe lock while holding an HARDIRQ-
irq-safe lock:

> [ 2528.738454] =
> [ 2528.744679] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> [ 2528.749891] 4.15.0-6-g4548e6f42022-dirty #418 Not tainted
> [ 2528.754356] -
> [ 2528.759643] kworker/0:1H/100 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 2528.764608]  (&(>frwd_lock)->rlock){+.-.}, at: 
> [<97d8de0f>] iscsi_eh_cmd_timed_out+0x3d/0x2b0
> [ 2528.769164]
>and this task is already holding:
> [ 2528.771313]  (&(>__queue_lock)->rlock){-.-.}, at: [] 
> blk_timeout_work+0x22/0x120
> [ 2528.773825] which would create a new lock dependency:
> [ 2528.775216]  (&(>__queue_lock)->rlock){-.-.} -> 
> (&(>frwd_lock)->rlock){+.-.}
> [ 2528.777071]
>but this new dependency connects a HARDIRQ-irq-safe lock:
> [ 2528.778945]  (&(>__queue_lock)->rlock){-.-.}
> [ 2528.778948]
>... which became HARDIRQ-irq-safe at:
> [ 2528.781204]   _raw_spin_lock_irqsave+0x3c/0x4b
> [ 2528.781966]   cfq_idle_slice_timer+0x2f/0x100
> [ 2528.782705]   __hrtimer_run_queues+0xdc/0x440
> [ 2528.783448]   hrtimer_interrupt+0xb1/0x210
> [ 2528.784149]   smp_apic_timer_interrupt+0x6d/0x260
> [ 2528.784954]   apic_timer_interrupt+0xac/0xc0
> [ 2528.785679]   native_safe_halt+0x2/0x10
> [ 2528.786280]   default_idle+0x1f/0x180
> [ 2528.786806]   do_idle+0x166/0x1e0
> [ 2528.787288]   cpu_startup_entry+0x6f/0x80
> [ 2528.787874]   start_secondary+0x186/0x1e0
> [ 2528.788448]   secondary_startup_64+0xa5/0xb0
> [ 2528.789070]
>to a HARDIRQ-irq-unsafe lock:
> [ 2528.789913]  (&(>frwd_lock)->rlock){+.-.}
> [ 2528.789915]
>... which became HARDIRQ-irq-unsafe at:
> [ 2528.791548] ...
> [ 2528.791553]   _raw_spin_lock_bh+0x34/0x40
> [ 2528.792366]   iscsi_conn_setup+0x166/0x220
> [ 2528.792960]   iscsi_tcp_conn_setup+0x10/0x40
> [ 2528.793526]   iscsi_sw_tcp_conn_create+0x1b/0x160
> [ 2528.794111]   iscsi_if_rx+0xf9f/0x1580
> [ 2528.794542]   netlink_unicast+0x1f7/0x2f0
> [ 2528.795105]   netlink_sendmsg+0x2c9/0x3c0
> [ 2528.795636]   sock_sendmsg+0x30/0x40
> [ 2528.796068]   ___sys_sendmsg+0x269/0x2c0
> [ 2528.796544]   __sys_sendmsg+0x51/0x90
> [ 2528.797028]   entry_SYSCALL_64_fastpath+0x25/0x9c
> [ 2528.797595]
> ...

This commit avoids possible reverse deadlock.

Signed-off-by: Lee Duncan 
---
 drivers/scsi/libiscsi.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 82c3fd4bc938..055357b2fe9e 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1248,9 +1248,9 @@ int iscsi_complete_pdu(struct iscsi_conn *conn, struct 
iscsi_hdr *hdr,
 {
int rc;
 
-   spin_lock(>session->lock);
+   spin_lock_bh(>session->lock);
rc = __iscsi_complete_pdu(conn, hdr, data, datalen);
-   spin_unlock(>session->lock);
+   spin_unlock_bh(>session->lock);
return rc;
 }
 EXPORT_SYMBOL_GPL(iscsi_complete_pdu);
@@ -1749,14 +1749,14 @@ static void iscsi_tmf_timedout(unsigned long data)
struct iscsi_conn *conn = (struct iscsi_conn *)data;
struct iscsi_session *session = conn->session;
 
-   spin_lock(>lock);
+   spin_lock_bh(>lock);
if (conn->tmf_state == TMF_QUEUED) {
conn->tmf_state = TMF_TIMEDOUT;
ISCSI_DBG_EH(session, "tmf timedout\n");
/* unblock eh_abort() */
wake_up(>ehwait);
}
-   spin_unlock(>lock);
+   spin_unlock_bh(>lock);
 }
 
 static int iscsi_exec_task_mgmt_fn(struct iscsi_conn *conn,
@@ -1908,7 +1908,7 @@ static enum blk_eh_timer_return 
iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 
ISCSI_DBG_EH(session, "scsi cmd %p timedout\n", sc);
 
-   spin_lock(>lock);
+   spin_lock_bh(>lock);
task = (struct iscsi_task *)sc->SCp.ptr;
if (!task) {
/*
@@ -2022,7 +2022,7 @@ static enum blk_eh_timer_return 
iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 done:
if (task)
task->last_timeout = jiffies;
-   spin_unlock(>lock);
+   spin_unlock_bh(>lock);
ISCSI_DBG_EH(session, "return %s\n", rc == BLK_EH_RESET_TIMER ?
 "timer reset" : "nh");
return rc;
@@ -2034,7 +2034,7 @@ static void iscsi_check_transport_timeouts(unsigned long 
data)
struct iscsi_session *session = conn->session;
unsigned long recv_timeout, next_timeout = 0, last_recv;
 
-   spin_lock(>lock);
+   spin_lock_bh(>lock);
if (session->state != ISCSI_STATE_LOGGED_IN)
goto done;
 
@@ -2051,7 +2051,7 @@ static void 

RE: [PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host

2018-02-05 Thread Don Brace


> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Monday, February 05, 2018 9:21 AM
> To: Jens Axboe ; linux-bl...@vger.kernel.org; Christoph
> Hellwig ; Mike Snitzer 
> Cc: linux-scsi@vger.kernel.org; Hannes Reinecke ; Arun Easi
> ; Omar Sandoval ; Martin K .
> Petersen ; James Bottomley
> ; Christoph Hellwig ;
> Don Brace ; Kashyap Desai
> ; Peter Rivera ;
> Paolo Bonzini ; Laurence Oberman
> ; Ming Lei 
> Subject: [PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host
> 
> EXTERNAL EMAIL
> 
> 
> So that we can decide the default reply queue by the map created
> during adding host.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hpsa.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 287e5eb0723f..443eabf63a9f 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -8691,12 +8691,9 @@ static int hpsa_init_one(struct pci_dev *pdev, const
> struct pci_device_id *ent)
> /* Disable discovery polling.*/
> h->discovery_polling = 0;
> 
> -
> /* Turn the interrupts on so we can service requests */
> h->access.set_intr_mask(h, HPSA_INTR_ON);
> 
> -   hpsa_hba_inquiry(h);
> -
> h->lastlogicals = kzalloc(sizeof(*(h->lastlogicals)), GFP_KERNEL);
> if (!h->lastlogicals)
> dev_info(>pdev->dev,
> @@ -8707,6 +8704,8 @@ static int hpsa_init_one(struct pci_dev *pdev, const
> struct pci_device_id *ent)
> if (rc)
> goto clean7; /* perf, sg, cmd, irq, shost, pci, lu, aer/h */
> 
> +   hpsa_hba_inquiry(h);
> +
> /* Monitor the controller for firmware lockups */
> h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
> INIT_DELAYED_WORK(>monitor_ctlr_work, hpsa_monitor_ctlr_worker);
> --
> 2.9.5

Tested-by: Don Brace 
P441, P431, P830i, H240

Acked-by: Don Brace 




RE: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Don Brace
> -Original Message-
> From: Laurence Oberman [mailto:lober...@redhat.com]
> Sent: Monday, February 05, 2018 9:58 AM
> To: Ming Lei ; Jens Axboe ; linux-
> bl...@vger.kernel.org; Christoph Hellwig ; Mike Snitzer
> ; Don Brace 
> Cc: linux-scsi@vger.kernel.org; Hannes Reinecke ; Arun Easi
> ; Omar Sandoval ; Martin K .
> Petersen ; James Bottomley
> ; Christoph Hellwig ;
> Don Brace ; Kashyap Desai
> ; Peter Rivera ;
> Paolo Bonzini 
> Subject: Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue
> 
> EXTERNAL EMAIL
> 
> 
> On Mon, 2018-02-05 at 23:20 +0800, Ming Lei wrote:
> > This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime
> > maps
> > each reply queue to blk_mq's hw queue, then .queuecommand can always
> > choose the hw queue as the reply queue. And if no any online CPU is
> > mapped to one hw queue, request can't be submitted to this hw queue
> > at all, finally the irq affinity issue is solved.
> >
> > Cc: Hannes Reinecke 
> > Cc: Arun Easi 
> > Cc: Omar Sandoval ,
> > Cc: "Martin K. Petersen" ,
> > Cc: James Bottomley ,
> > Cc: Christoph Hellwig ,
> > Cc: Don Brace 
> > Cc: Kashyap Desai 
> > Cc: Peter Rivera 
> > Cc: Paolo Bonzini 
> > Cc: Mike Snitzer 
> > Tested-by: Laurence Oberman 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/scsi/hpsa.c | 51 ++-
> This is a critical issue on the HPSA because Linus already has the
> original commit that causes the system to fail to boot.
> 
> All my testing was on DL380 G7 servers with:
> 
> Hewlett-Packard Company Smart Array G6 controllers
> Vendor: HP   Model: P410iRev: 6.64
> 
> Ming's patch fixes this so we need to try move this along.
> 
> I have a DL380 G8 as well which is also likely exposed here and I added
>  Don Brace for FYI to this list.
> 
> Thanks Ming

Tested-by: Don Brace 
P441, P431, P830i, H240

Acked-by: Don Brace 





RE: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Don Brace
> -Original Message-
> This is a critical issue on the HPSA because Linus already has the
> original commit that causes the system to fail to boot.
> 
> All my testing was on DL380 G7 servers with:
> 
> Hewlett-Packard Company Smart Array G6 controllers
> Vendor: HP   Model: P410iRev: 6.64
> 
> Ming's patch fixes this so we need to try move this along.
> 
> I have a DL380 G8 as well which is also likely exposed here and I added
>  Don Brace for FYI to this list.
> 
> Thanks Ming

Running some tests now.


Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Laurence Oberman
On Mon, 2018-02-05 at 23:20 +0800, Ming Lei wrote:
> This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime
> maps
> each reply queue to blk_mq's hw queue, then .queuecommand can always
> choose the hw queue as the reply queue. And if no any online CPU is
> mapped to one hw queue, request can't be submitted to this hw queue
> at all, finally the irq affinity issue is solved.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hpsa.c | 51 ++-
> 
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 443eabf63a9f..e517a4c74a28 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "hpsa_cmd.h"
> @@ -956,6 +957,13 @@ static struct device_attribute
> *hpsa_shost_attrs[] = {
>  #define HPSA_NRESERVED_CMDS  (HPSA_CMDS_RESERVED_FOR_DRIVER +\
>    HPSA_MAX_CONCURRENT_PASSTHRUS)
>  
> +static int hpsa_map_queues(struct Scsi_Host *shost)
> +{
> +struct ctlr_info *h = shost_to_hba(shost);
> +
> +return blk_mq_pci_map_queues(>tag_set, h->pdev);
> +}
> +
>  static struct scsi_host_template hpsa_driver_template = {
>   .module = THIS_MODULE,
>   .name   = HPSA,
> @@ -974,10 +982,13 @@ static struct scsi_host_template
> hpsa_driver_template = {
>  #ifdef CONFIG_COMPAT
>   .compat_ioctl   = hpsa_compat_ioctl,
>  #endif
> + .map_queues = hpsa_map_queues,
>   .sdev_attrs = hpsa_sdev_attrs,
>   .shost_attrs = hpsa_shost_attrs,
>   .max_sectors = 1024,
>   .no_write_same = 1,
> + .force_blk_mq = 1,
> + .host_tagset = 1,
>  };
>  
>  static inline u32 next_command(struct ctlr_info *h, u8 q)
> @@ -1045,11 +1056,7 @@ static void set_performant_mode(struct
> ctlr_info *h, struct CommandList *c,
>   c->busaddr |= 1 | (h->blockFetchTable[c-
> >Header.SGList] << 1);
>   if (unlikely(!h->msix_vectors))
>   return;
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - c->Header.ReplyQueue =
> - raw_smp_processor_id() % h-
> >nreply_queues;
> - else
> - c->Header.ReplyQueue = reply_queue % h-
> >nreply_queues;
> + c->Header.ReplyQueue = reply_queue;
>   }
>  }
>  
> @@ -1063,10 +1070,7 @@ static void
> set_ioaccel1_performant_mode(struct ctlr_info *h,
>    * Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->ReplyQueue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->ReplyQueue = reply_queue % h->nreply_queues;
> + cp->ReplyQueue = reply_queue;
>   /*
>    * Set the bits in the address sent down to include:
>    *  - performant mode bit (bit 0)
> @@ -1087,10 +1091,7 @@ static void
> set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,
>   /* Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /* Set the bits in the address sent down to include:
>    *  - performant mode bit not used in ioaccel mode 2
>    *  - pull count (bits 0-3)
> @@ -1109,10 +1110,7 @@ static void
> set_ioaccel2_performant_mode(struct ctlr_info *h,
>    * Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /*
>    * Set the bits in the address sent down to include:
>    *  - performant mode bit not used in ioaccel mode 2
> @@ -1152,11 +1150,27 @@ static void
> 

Re: [PATCH V2 6/8] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

2018-02-05 Thread Paolo Bonzini
On 05/02/2018 16:20, Ming Lei wrote:
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
> virtio-scsi, the irq affinity assigned can become the following shape:
> 
>   irq 36, cpu list 0-7
>   irq 37, cpu list 0-7
>   irq 38, cpu list 0-7
>   irq 39, cpu list 0-1
>   irq 40, cpu list 4,6
>   irq 41, cpu list 2-3
>   irq 42, cpu list 5,7
> 
> Then IO hang is triggered in case of non-SCSI_MQ.
> 
> Given storage IO is always C/S model, there isn't such issue with 
> SCSI_MQ(blk-mq),
> because no IO can be submitted to one hw queue if the hw queue hasn't online
> CPUs.
> 
> Fix this issue by forcing to use blk_mq.
> 
> BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
> been quite stable, so it shouldn't cause extra risk.

I think that's ok now that we have I/O schedulers for blk-mq.

Acked-by: Paolo Bonzini 

Paolo

> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Reviewed-by: Hannes Reinecke 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/virtio_scsi.c | 59 
> +++---
>  1 file changed, 3 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 7c28e8d4955a..54e3a0f6844c 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -91,9 +91,6 @@ struct virtio_scsi_vq {
>  struct virtio_scsi_target_state {
>   seqcount_t tgt_seq;
>  
> - /* Count of outstanding requests. */
> - atomic_t reqs;
> -
>   /* Currently active virtqueue for requests sent to this target. */
>   struct virtio_scsi_vq *req_vq;
>  };
> @@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
> *vscsi, void *buf)
>   struct virtio_scsi_cmd *cmd = buf;
>   struct scsi_cmnd *sc = cmd->sc;
>   struct virtio_scsi_cmd_resp *resp = >resp.cmd;
> - struct virtio_scsi_target_state *tgt =
> - scsi_target(sc->device)->hostdata;
>  
>   dev_dbg(>device->sdev_gendev,
>   "cmd %p response %u status %#02x sense_len %u\n",
> @@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
> *vscsi, void *buf)
>   }
>  
>   sc->scsi_done(sc);
> -
> - atomic_dec(>reqs);
>  }
>  
>  static void virtscsi_vq_done(struct virtio_scsi *vscsi,
> @@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
> *sh,
>   struct scsi_cmnd *sc)
>  {
>   struct virtio_scsi *vscsi = shost_priv(sh);
> - struct virtio_scsi_target_state *tgt =
> - scsi_target(sc->device)->hostdata;
>  
> - atomic_inc(>reqs);
>   return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
>  }
>  
> @@ -596,55 +586,11 @@ static struct virtio_scsi_vq 
> *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
>   return >req_vqs[hwq];
>  }
>  
> -static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
> -struct virtio_scsi_target_state 
> *tgt)
> -{
> - struct virtio_scsi_vq *vq;
> - unsigned long flags;
> - u32 queue_num;
> -
> - local_irq_save(flags);
> - if (atomic_inc_return(>reqs) > 1) {
> - unsigned long seq;
> -
> - do {
> - seq = read_seqcount_begin(>tgt_seq);
> - vq = tgt->req_vq;
> - } while (read_seqcount_retry(>tgt_seq, seq));
> - } else {
> - /* no writes can be concurrent because of atomic_t */
> - write_seqcount_begin(>tgt_seq);
> -
> - /* keep previous req_vq if a reader just arrived */
> - if (unlikely(atomic_read(>reqs) > 1)) {
> - vq = tgt->req_vq;
> - goto unlock;
> - }
> -
> - queue_num = smp_processor_id();
> - while (unlikely(queue_num >= vscsi->num_queues))
> - queue_num -= vscsi->num_queues;
> - tgt->req_vq = vq = >req_vqs[queue_num];
> - unlock:
> - write_seqcount_end(>tgt_seq);
> - }
> - local_irq_restore(flags);
> -
> - return vq;
> -}
> -
>  static int virtscsi_queuecommand_multi(struct Scsi_Host 

[PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Ming Lei
This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
each reply queue to blk_mq's hw queue, then .queuecommand can always
choose the hw queue as the reply queue. And if no any online CPU is
mapped to one hw queue, request can't be submitted to this hw queue
at all, finally the irq affinity issue is solved.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hpsa.c | 51 ++-
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 443eabf63a9f..e517a4c74a28 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "hpsa_cmd.h"
@@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
 #define HPSA_NRESERVED_CMDS(HPSA_CMDS_RESERVED_FOR_DRIVER +\
 HPSA_MAX_CONCURRENT_PASSTHRUS)
 
+static int hpsa_map_queues(struct Scsi_Host *shost)
+{
+struct ctlr_info *h = shost_to_hba(shost);
+
+return blk_mq_pci_map_queues(>tag_set, h->pdev);
+}
+
 static struct scsi_host_template hpsa_driver_template = {
.module = THIS_MODULE,
.name   = HPSA,
@@ -974,10 +982,13 @@ static struct scsi_host_template hpsa_driver_template = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = hpsa_compat_ioctl,
 #endif
+   .map_queues = hpsa_map_queues,
.sdev_attrs = hpsa_sdev_attrs,
.shost_attrs = hpsa_shost_attrs,
.max_sectors = 1024,
.no_write_same = 1,
+   .force_blk_mq = 1,
+   .host_tagset = 1,
 };
 
 static inline u32 next_command(struct ctlr_info *h, u8 q)
@@ -1045,11 +1056,7 @@ static void set_performant_mode(struct ctlr_info *h, 
struct CommandList *c,
c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
if (unlikely(!h->msix_vectors))
return;
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   c->Header.ReplyQueue =
-   raw_smp_processor_id() % h->nreply_queues;
-   else
-   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+   c->Header.ReplyQueue = reply_queue;
}
 }
 
@@ -1063,10 +1070,7 @@ static void set_ioaccel1_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->ReplyQueue = reply_queue % h->nreply_queues;
+   cp->ReplyQueue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit (bit 0)
@@ -1087,10 +1091,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
ctlr_info *h,
/* Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/* Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
 *  - pull count (bits 0-3)
@@ -1109,10 +1110,7 @@ static void set_ioaccel2_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
@@ -1152,11 +1150,27 @@ static void 
dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
 }
 
+static unsigned get_reply_queue(struct ctlr_info *h, struct CommandList *c)
+{
+  

[PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host

2018-02-05 Thread Ming Lei
So that we can decide the default reply queue by the map created
during adding host.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hpsa.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 287e5eb0723f..443eabf63a9f 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -8691,12 +8691,9 @@ static int hpsa_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
/* Disable discovery polling.*/
h->discovery_polling = 0;
 
-
/* Turn the interrupts on so we can service requests */
h->access.set_intr_mask(h, HPSA_INTR_ON);
 
-   hpsa_hba_inquiry(h);
-
h->lastlogicals = kzalloc(sizeof(*(h->lastlogicals)), GFP_KERNEL);
if (!h->lastlogicals)
dev_info(>pdev->dev,
@@ -8707,6 +8704,8 @@ static int hpsa_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto clean7; /* perf, sg, cmd, irq, shost, pci, lu, aer/h */
 
+   hpsa_hba_inquiry(h);
+
/* Monitor the controller for firmware lockups */
h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
INIT_DELAYED_WORK(>monitor_ctlr_work, hpsa_monitor_ctlr_worker);
-- 
2.9.5



[PATCH V2 6/8] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

2018-02-05 Thread Ming Lei
Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
virtio-scsi, the irq affinity assigned can become the following shape:

irq 36, cpu list 0-7
irq 37, cpu list 0-7
irq 38, cpu list 0-7
irq 39, cpu list 0-1
irq 40, cpu list 4,6
irq 41, cpu list 2-3
irq 42, cpu list 5,7

Then IO hang is triggered in case of non-SCSI_MQ.

Given storage IO is always C/S model, there isn't such issue with 
SCSI_MQ(blk-mq),
because no IO can be submitted to one hw queue if the hw queue hasn't online
CPUs.

Fix this issue by forcing to use blk_mq.

BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
been quite stable, so it shouldn't cause extra risk.

Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Reviewed-by: Hannes Reinecke 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/virtio_scsi.c | 59 +++---
 1 file changed, 3 insertions(+), 56 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 7c28e8d4955a..54e3a0f6844c 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -91,9 +91,6 @@ struct virtio_scsi_vq {
 struct virtio_scsi_target_state {
seqcount_t tgt_seq;
 
-   /* Count of outstanding requests. */
-   atomic_t reqs;
-
/* Currently active virtqueue for requests sent to this target. */
struct virtio_scsi_vq *req_vq;
 };
@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
struct virtio_scsi_cmd *cmd = buf;
struct scsi_cmnd *sc = cmd->sc;
struct virtio_scsi_cmd_resp *resp = >resp.cmd;
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
dev_dbg(>device->sdev_gendev,
"cmd %p response %u status %#02x sense_len %u\n",
@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
}
 
sc->scsi_done(sc);
-
-   atomic_dec(>reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi,
@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
*sh,
struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
-   atomic_inc(>reqs);
return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
 }
 
@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct 
virtio_scsi *vscsi,
return >req_vqs[hwq];
 }
 
-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
-  struct virtio_scsi_target_state 
*tgt)
-{
-   struct virtio_scsi_vq *vq;
-   unsigned long flags;
-   u32 queue_num;
-
-   local_irq_save(flags);
-   if (atomic_inc_return(>reqs) > 1) {
-   unsigned long seq;
-
-   do {
-   seq = read_seqcount_begin(>tgt_seq);
-   vq = tgt->req_vq;
-   } while (read_seqcount_retry(>tgt_seq, seq));
-   } else {
-   /* no writes can be concurrent because of atomic_t */
-   write_seqcount_begin(>tgt_seq);
-
-   /* keep previous req_vq if a reader just arrived */
-   if (unlikely(atomic_read(>reqs) > 1)) {
-   vq = tgt->req_vq;
-   goto unlock;
-   }
-
-   queue_num = smp_processor_id();
-   while (unlikely(queue_num >= vscsi->num_queues))
-   queue_num -= vscsi->num_queues;
-   tgt->req_vq = vq = >req_vqs[queue_num];
- unlock:
-   write_seqcount_end(>tgt_seq);
-   }
-   local_irq_restore(flags);
-
-   return vq;
-}
-
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
   struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
-   struct virtio_scsi_vq *req_vq;
-
-   if (shost_use_blk_mq(sh))
-  

[PATCH V2 5/8] scsi: introduce force_blk_mq

2018-02-05 Thread Ming Lei
>From scsi driver view, it is a bit troublesome to support both blk-mq
and non-blk-mq at the same time, especially when drivers need to support
multi hw-queue.

This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
can provide blk-mq only support, so driver code can avoid the trouble
for supporting both.

This patch may clean up driver a lot by providing blk-mq only support, 
espeically
it is easier to convert multiple reply queues into blk_mq's MQ for the following
purposes:

1) use blk_mq multiple hw queue to deal with allocated irq vectors of all 
offline
CPU affinity[1]:

[1] https://marc.info/?l=linux-kernel=151748144730409=2

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned.

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx.

2) some drivers[1] require to complete request in the submission CPU for
avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
to reinvent wheels for solving the problem.

[2] https://marc.info/?t=15160185141=1=2

Sovling the above issues for non-MQ path may not be easy, or introduce
unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
recently[3]:

[3] https://marc.info/?l=linux-scsi=151727684915589=2

Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Mike Snitzer 
Reviewed-by: Hannes Reinecke 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hosts.c | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index fe3a0da3ec97..c75cebd7911d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost->dma_boundary = 0x;
 
shost->use_blk_mq = scsi_use_blk_mq;
+   shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
 
device_initialize(>shost_gendev);
dev_set_name(>shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index f6623f887ee4..d67b2eaff8f1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
/* True if the controller does not support WRITE SAME */
unsigned no_write_same:1;
 
+   /* True if the low-level driver supports blk-mq only */
+   unsigned force_blk_mq:1;
+
/*
 * Countdown for host blocking with no commands outstanding.
 */
-- 
2.9.5



[PATCH V2 4/8] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-05 Thread Ming Lei
This patch introduces the parameter of 'g_global_tags' so that we can
test this feature by null_blk easiy.

Not see obvious performance drop with global_tags when the whole hw
depth is kept as same:

1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
submit_queues=4 hw_queue_depth=1

2) 'global_tags', global hw queue depth is 4, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=1 
submit_queues=4 hw_queue_depth=4

3) fio test done in above two settings:
   fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 
--ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  
--name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 
--name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3

1M IOPS can be reached in both above tests which is done in one VM.

Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 drivers/block/null_blk.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 287a09611c0f..ad0834efad42 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -163,6 +163,10 @@ static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
 
+static int g_global_tags = 0;
+module_param_named(global_tags, g_global_tags, int, S_IRUGO);
+MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
+
 static int g_home_node = NUMA_NO_NODE;
 module_param_named(home_node, g_home_node, int, S_IRUGO);
 MODULE_PARM_DESC(home_node, "Home node for the device");
@@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;
+   if (g_global_tags)
+   set->flags |= BLK_MQ_F_GLOBAL_TAGS;
set->driver_data = NULL;
 
if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.5



[PATCH V2 3/8] scsi: Add template flag 'host_tagset'

2018-02-05 Thread Ming Lei
From: Hannes Reinecke 

Add a host template flag 'host_tagset' to enable the use of a global
tagset for block-mq.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 55be2550c555..9ab74ac634ea 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2274,6 +2274,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
shost->tag_set.flags |=
BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
+   if (shost->hostt->host_tagset)
+   shost->tag_set.flags |= BLK_MQ_F_GLOBAL_TAGS;
shost->tag_set.driver_data = shost;
 
return blk_mq_alloc_tag_set(>tag_set);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a8b7bf879ced..f6623f887ee4 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -457,6 +457,9 @@ struct scsi_host_template {
 */
unsigned int max_host_blocked;
 
+   /* True if the host supports a host-wide tagspace */
+   unsigned host_tagset:1;
+
/*
 * Default value for the blocking.  If the queue is empty,
 * host_blocked counts down in the request_fn until it restarts
-- 
2.9.5



[PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-05 Thread Ming Lei
Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
reply queues, but tags is often HBA wide.

These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
for automatic affinity assignment.

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned, and HPSA has been broken
with v4.15+:

https://marc.info/?l=linux-kernel=151748144730409=2

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx, but one tricky thing is
the HBA wide(instead of hw queue wide) tags.

This patch is based on the following Hannes's patch:

https://marc.info/?l=linux-block=149132580511346=2

One big difference with Hannes's is that this patch only makes the tags sbitmap
and active_queues data structure HBA wide, and others are kept as NUMA locality,
such as request, hctx, tags, ...

The following patch will support global tags on null_blk, also the performance
data is provided, no obvious performance loss is observed when the whole
hw queue depth is same.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 13 -
 block/blk-mq-tag.c | 23 ++-
 block/blk-mq-tag.h |  5 -
 block/blk-mq.c | 29 -
 block/blk-mq.h |  3 ++-
 include/linux/blk-mq.h |  2 ++
 7 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0dfafa4b655a..0f0fafe03f5d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
HCTX_FLAG_NAME(SG_MERGE),
+   HCTX_FLAG_NAME(GLOBAL_TAGS),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55c0a745b427..385bbec73804 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -81,6 +81,17 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
} else
clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
 
+   /* need to restart all hw queues for global tags */
+   if (hctx->flags & BLK_MQ_F_GLOBAL_TAGS) {
+   struct blk_mq_hw_ctx *hctx2;
+   int i;
+
+   queue_for_each_hw_ctx(hctx->queue, hctx2, i)
+   if (blk_mq_run_hw_queue(hctx2, true))
+   return true;
+   return false;
+   }
+
return blk_mq_run_hw_queue(hctx, true);
 }
 
@@ -495,7 +506,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
int ret;
 
hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
-  set->reserved_tags);
+  set->reserved_tags, false);
if (!hctx->sched_tags)
return -ENOMEM;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 571797dc36cb..66377d09eaeb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct 
blk_mq_tags *tags,
return NULL;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+unsigned int total_tags,
 unsigned int reserved_tags,
-int node, int alloc_policy)
+int node, int alloc_policy,
+bool global_tag)
 {
struct blk_mq_tags *tags;
 
@@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;
 
+   WARN_ON(global_tag && !set->global_tags);
+   if (global_tag && set->global_tags) {
+   tags->bitmap_tags = set->global_tags->bitmap_tags;
+   tags->breserved_tags = set->global_tags->breserved_tags;
+   tags->active_queues = 

[PATCH V2 1/8] blk-mq: tags: define several fields of tags as pointer

2018-02-05 Thread Ming Lei
This patch changes tags->breserved_tags, tags->bitmap_tags and
tags->active_queues as pointer, and prepares for supporting global tags.

No functional change.

Tested-by: Laurence Oberman 
Reviewed-by: Hannes Reinecke 
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bfq-iosched.c|  4 ++--
 block/blk-mq-debugfs.c | 10 +-
 block/blk-mq-tag.c | 48 ++--
 block/blk-mq-tag.h | 10 +++---
 block/blk-mq.c |  2 +-
 block/kyber-iosched.c  |  2 +-
 6 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..1e1211814a57 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct 
blk_mq_alloc_data *data)
WARN_ON_ONCE(1);
return;
}
-   bt = >breserved_tags;
+   bt = tags->breserved_tags;
} else
-   bt = >bitmap_tags;
+   bt = tags->bitmap_tags;
 
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 21cbc1f071c6..0dfafa4b655a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
seq_printf(m, "active_queues=%d\n",
-  atomic_read(>active_queues));
+  atomic_read(tags->active_queues));
 
seq_puts(m, "\nbitmap_tags:\n");
-   sbitmap_queue_show(>bitmap_tags, m);
+   sbitmap_queue_show(tags->bitmap_tags, m);
 
if (tags->nr_reserved_tags) {
seq_puts(m, "\nbreserved_tags:\n");
-   sbitmap_queue_show(>breserved_tags, m);
+   sbitmap_queue_show(tags->breserved_tags, m);
}
 }
 
@@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->tags)
-   sbitmap_bitmap_show(>tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(>tags->bitmap_tags->sb, m);
mutex_unlock(>sysfs_lock);
 
 out:
@@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->sched_tags)
-   sbitmap_bitmap_show(>sched_tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(>sched_tags->bitmap_tags->sb, m);
mutex_unlock(>sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..571797dc36cb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
if (!tags)
return true;
 
-   return sbitmap_any_bit_clear(>bitmap_tags.sb);
+   return sbitmap_any_bit_clear(>bitmap_tags->sb);
 }
 
 /*
@@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, >state) &&
!test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, >state))
-   atomic_inc(>tags->active_queues);
+   atomic_inc(hctx->tags->active_queues);
 
return true;
 }
@@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-   sbitmap_queue_wake_all(>bitmap_tags);
+   sbitmap_queue_wake_all(tags->bitmap_tags);
if (include_reserve)
-   sbitmap_queue_wake_all(>breserved_tags);
+   sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, >state))
return;
 
-   atomic_dec(>active_queues);
+   atomic_dec(tags->active_queues);
 
blk_mq_tag_wakeup_all(tags, false);
 }
@@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
if (bt->sb.depth == 1)
return true;
 
-   users = atomic_read(>tags->active_queues);
+   users = atomic_read(hctx->tags->active_queues);
if (!users)
return true;
 
@@ -117,10 +117,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data 
*data)
WARN_ON_ONCE(1);
return BLK_MQ_TAG_FAIL;
}
-   bt = >breserved_tags;
+   bt = tags->breserved_tags;
tag_offset = 0;
} else {
-   bt = >bitmap_tags;
+   bt = tags->bitmap_tags;
tag_offset = tags->nr_reserved_tags;
}
 
@@ 

[PATCH V2 0/8] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Ming Lei
Hi All,

This patchset supports global tags which was started by Hannes originally:

https://marc.info/?l=linux-block=149132580511346=2

Also inroduce 'force_blk_mq' and 'host_tagset' to 'struct scsi_host_template',
so that driver can avoid to support two IO paths(legacy and blk-mq), especially
recent discusion mentioned that SCSI_MQ will be enabled at default soon.

https://marc.info/?l=linux-scsi=151727684915589=2

With the above changes, it should be easier to convert SCSI drivers'
reply queue into blk-mq's hctx, then the automatic irq affinity issue can
be solved easily, please see detailed descrption in commit log and the
8th patch for converting HPSA.

Also drivers may require to complete request on the submission CPU
for avoiding hard/soft deadlock, which can be done easily with blk_mq
too.

https://marc.info/?t=15160185141=1=2

The 6th patch uses the introduced 'force_blk_mq' to fix virtio_scsi
so that IO hang issue can be avoided inside legacy IO path, this issue is
a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.

The 7th & 8th patch fixes HPSA's IO issue which is caused by the reply
queue selection algorithem.

Laurence has verified that this patch makes HPSA working with the linus
tree with this patchset.

The V2 can be found in the following tree/branch:

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-v2

V2:
- fix restart code for global tags
- fixes HPSA's IO hang issue
- add 'scsi: Add template flag 'host_tagset''
- reorder patch

Thanks
Ming

Hannes Reinecke (1):
  scsi: Add template flag 'host_tagset'

Ming Lei (7):
  blk-mq: tags: define several fields of tags as pointer
  blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  block: null_blk: introduce module parameter of 'g_global_tags'
  scsi: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
  scsi: hpsa: call hpsa_hba_inquiry() after adding host
  scsi: hpsa: use blk_mq to solve irq affinity issue

 block/bfq-iosched.c|  4 +--
 block/blk-mq-debugfs.c | 11 
 block/blk-mq-sched.c   | 13 -
 block/blk-mq-tag.c | 67 +-
 block/blk-mq-tag.h | 15 ---
 block/blk-mq.c | 31 -
 block/blk-mq.h |  3 ++-
 block/kyber-iosched.c  |  2 +-
 drivers/block/null_blk.c   |  6 +
 drivers/scsi/hosts.c   |  1 +
 drivers/scsi/hpsa.c| 56 --
 drivers/scsi/scsi_lib.c|  2 ++
 drivers/scsi/virtio_scsi.c | 59 +++-
 include/linux/blk-mq.h |  2 ++
 include/scsi/scsi_host.h   |  6 +
 15 files changed, 157 insertions(+), 121 deletions(-)

-- 
2.9.5



Re: All scsi_debug devices in the scsi_debug driver share the same RAM space

2018-02-05 Thread Laurence Oberman
On Sat, 2018-02-03 at 14:43 -0500, Laurence Oberman wrote:
> Hello Doug
> 
> I had emailed you earlier about this issue forgetting to copy others.
> 
> All test devices in the scsi_debug driver share the same ram space so
> we cannot really have individual devices for testing stuff like md-
> raid.
> 
> I bumped into this a few times already and I think it would be useful
> to make each device an individual RAM entry so we can create multiple
> unique devices.
> 
> Of course this means also now having to maybe start using attributes
> for each device do we can selectively choose which devices get the
> fault injections.
> 
> I know its adding complexity here but wanted to get input about the
> possibility of changing this.
> 
> For a particular fault injection case this morning I had to use 5
> regular drives and 1 scsi-debug drive to inject faults to reproduce
> an
> md-raid issue.
> 
> I also wanted a different start sector and range and sent an earlier
> patch for that.
> 
> So is this something worth considering ?
> If so, i will start working on it.
> 
> Regards
> Laurence
> 
> 
> 
> 

I chatted with Doug about this.

Seems that many of the use cases for scsi_debug involved creating high
count devices for testing etc. so having each with its own ram space
could land up chewing a bunch of memory.

I am going to drop this idea for now, but the patch I sent for the
choice of sector for MEDIUM errors and count is very useful so I hope
that one gets accepted.
That patch changes nothing if the two new parameters are not set so
should be low risk for acceptance.

Thanks
Laurence


[PATCH 2/2] scsi_debug: reset injection flags for every_nth > 0

2018-02-05 Thread Martin Wilck
If every_nth > 0, the injection flags must be reset for commands
that aren't supposed to fail (i.e. that aren't "nth"). Otherwise,
commands will continue to fail, like in the every_nth < 0 case.

Signed-off-by: Martin Wilck 
---
 drivers/scsi/scsi_debug.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 86c48797381e..820b98853e69 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -3988,8 +3988,13 @@ static void clear_queue_stats(void)
 static void setup_inject(struct sdebug_queue *sqp,
 struct sdebug_queued_cmd *sqcp)
 {
-   if ((atomic_read(_cmnd_count) % abs(sdebug_every_nth)) > 0)
+   if ((atomic_read(_cmnd_count) % abs(sdebug_every_nth)) > 0) {
+   if (sdebug_every_nth > 0)
+   sqcp->inj_recovered = sqcp->inj_transport
+   = sqcp->inj_dif
+   = sqcp->inj_dix = sqcp->inj_short = 0;
return;
+   }
sqcp->inj_recovered = !!(SDEBUG_OPT_RECOVERED_ERR & sdebug_opts);
sqcp->inj_transport = !!(SDEBUG_OPT_TRANSPORT_ERR & sdebug_opts);
sqcp->inj_dif = !!(SDEBUG_OPT_DIF_ERR & sdebug_opts);
-- 
2.16.1



RE: [PATCH 1/1] scsi: ufs: make sure all interrupts are processed

2018-02-05 Thread Avri Altman


> >>> -Original Message-
> >>> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> >>> ow...@vger.kernel.org] On Behalf Of Asutosh Das
> >>> Sent: Tuesday, January 30, 2018 6:54 AM
> >>> To: subha...@codeaurora.org; c...@codeaurora.org;
> >>> vivek.gau...@codeaurora.org; rna...@codeaurora.org;
> >>> vinholika...@gmail.com; j...@linux.vnet.ibm.com;
> >>> martin.peter...@oracle.com
> >>> Cc: linux-scsi@vger.kernel.org; Venkat Gopalakrishnan
> >>> ; Asutosh Das ;
> >>> open list 
> >>> Subject: [PATCH 1/1] scsi: ufs: make sure all interrupts are
> >>> processed
> >>>
> >>> From: Venkat Gopalakrishnan 
> >>>
> >>> As multiple requests are submitted to the ufs host controller in
> >>> parallel there could be instances where the command completion
> >>> interrupt arrives later for a request that is already processed
> >>> earlier as the corresponding doorbell was cleared when handling the
> >>> previous interrupt. Read the interrupt status in a loop after
> >>> processing the received interrupt to catch such interrupts and
> >>> handle it.
> >>>
> >>> Signed-off-by: Venkat Gopalakrishnan 
> >>> Signed-off-by: Asutosh Das 
Tested-by: avri.alt...@wdc.com

Tested on kirin960 (mate9)  and msm8998 (htc11), both where interrupt 
aggregation is not allowed.

As a side note, I noticed that this patch is part of several patches, fixing 
some qcom-staff.
Maybe you want to put them all in a patchset?

Thanks,
Avri


[PATCH] scsi: ufs-qcom: add number of lanes per direction

2018-02-05 Thread Can Guo
From: Gilad Broner 

Different platforms may have different number of lanes for the UFS link.
Add parameter to device tree specifying how many lanes should be
configured for the UFS link. And don't print err message for clocks
that are optional, this leads to unnecessary confusion about failure.

Signed-off-by: Gilad Broner 
Signed-off-by: Subhash Jadavani 
Signed-off-by: Can Guo 

diff --git a/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt 
b/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt
index 5357919..4cee3f9 100644
--- a/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt
+++ b/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt
@@ -31,6 +31,9 @@ Optional properties:
  defined or a value in the array is "0" then it is 
assumed
  that the frequency is set by the parent clock or a
  fixed rate clock source.
+- lanes-per-direction: number of lanes available per direction - either 1 or 2.
+   Note that it is assume same number of lanes is used 
both directions at once.
+   If not specified, default is 2 lanes per direction.
 
 Note: If above properties are not defined it can be assumed that the supply
 regulators or clocks are always on.
diff --git a/drivers/scsi/ufs/ufs-qcom.c b/drivers/scsi/ufs/ufs-qcom.c
index 4cdffa4..84d37e9 100644
--- a/drivers/scsi/ufs/ufs-qcom.c
+++ b/drivers/scsi/ufs/ufs-qcom.c
@@ -50,13 +50,10 @@ static int ufs_qcom_host_clk_get(struct device *dev,
int err = 0;
 
clk = devm_clk_get(dev, name);
-   if (IS_ERR(clk)) {
+   if (IS_ERR(clk))
err = PTR_ERR(clk);
-   dev_err(dev, "%s: failed to get %s err %d",
-   __func__, name, err);
-   } else {
+   else
*clk_out = clk;
-   }
 
return err;
 }
@@ -78,9 +75,11 @@ static void ufs_qcom_disable_lane_clks(struct ufs_qcom_host 
*host)
if (!host->is_lane_clks_enabled)
return;
 
-   clk_disable_unprepare(host->tx_l1_sync_clk);
+   if (host->tx_l1_sync_clk)
+   clk_disable_unprepare(host->tx_l1_sync_clk);
clk_disable_unprepare(host->tx_l0_sync_clk);
-   clk_disable_unprepare(host->rx_l1_sync_clk);
+   if (host->rx_l1_sync_clk)
+   clk_disable_unprepare(host->rx_l1_sync_clk);
clk_disable_unprepare(host->rx_l0_sync_clk);
 
host->is_lane_clks_enabled = false;
@@ -104,21 +103,21 @@ static int ufs_qcom_enable_lane_clks(struct ufs_qcom_host 
*host)
if (err)
goto disable_rx_l0;
 
-   err = ufs_qcom_host_clk_enable(dev, "rx_lane1_sync_clk",
-   host->rx_l1_sync_clk);
-   if (err)
-   goto disable_tx_l0;
+   if (host->hba->lanes_per_direction > 1) {
+   err = ufs_qcom_host_clk_enable(dev, "rx_lane1_sync_clk",
+   host->rx_l1_sync_clk);
+   if (err)
+   goto disable_tx_l0;
 
-   err = ufs_qcom_host_clk_enable(dev, "tx_lane1_sync_clk",
-   host->tx_l1_sync_clk);
-   if (err)
-   goto disable_rx_l1;
+   /* The tx lane1 clk could be muxed, hence keep this optional */
+   if (host->tx_l1_sync_clk)
+   ufs_qcom_host_clk_enable(dev, "tx_lane1_sync_clk",
+host->tx_l1_sync_clk);
+   }
 
host->is_lane_clks_enabled = true;
goto out;
 
-disable_rx_l1:
-   clk_disable_unprepare(host->rx_l1_sync_clk);
 disable_tx_l0:
clk_disable_unprepare(host->tx_l0_sync_clk);
 disable_rx_l0:
@@ -134,21 +133,34 @@ static int ufs_qcom_init_lane_clks(struct ufs_qcom_host 
*host)
 
err = ufs_qcom_host_clk_get(dev,
"rx_lane0_sync_clk", >rx_l0_sync_clk);
-   if (err)
+   if (err) {
+   dev_err(dev, "%s: failed to get rx_lane0_sync_clk, err %d",
+   __func__, err);
goto out;
+   }
 
err = ufs_qcom_host_clk_get(dev,
"tx_lane0_sync_clk", >tx_l0_sync_clk);
-   if (err)
+   if (err) {
+   dev_err(dev, "%s: failed to get tx_lane0_sync_clk, err %d",
+   __func__, err);
goto out;
+   }
 
-   err = ufs_qcom_host_clk_get(dev, "rx_lane1_sync_clk",
-   >rx_l1_sync_clk);
-   if (err)
-   goto out;
+   /* In case of single lane per direction, don't read lane1 clocks */
+   if (host->hba->lanes_per_direction > 1) {
+   err = ufs_qcom_host_clk_get(dev, "rx_lane1_sync_clk",
+   >rx_l1_sync_clk);
+   if (err) {
+   dev_err(dev, "%s: failed to get rx_lane1_sync_clk, err 
%d",
+

Re: [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-05 Thread Ming Lei
On Mon, Feb 05, 2018 at 07:54:29AM +0100, Hannes Reinecke wrote:
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> > reply queues, but tags is often HBA wide.
> > 
> > These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> > for automatic affinity assignment.
> > 
> > Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> > has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> > for some irq vectors, this can't be avoided even though the allocation
> > is improved.
> > 
> > So all these drivers have to avoid to ask HBA to complete request in
> > reply queue which hasn't online CPUs assigned, and HPSA has been broken
> > with v4.15+:
> > 
> > https://marc.info/?l=linux-kernel=151748144730409=2
> > 
> > This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> > hw queue by mapping each reply queue into hctx, but one tricky thing is
> > the HBA wide(instead of hw queue wide) tags.
> > 
> > This patch is based on the following Hannes's patch:
> > 
> > https://marc.info/?l=linux-block=149132580511346=2
> > 
> > One big difference with Hannes's is that this patch only makes the tags 
> > sbitmap
> > and active_queues data structure HBA wide, and others are kept as NUMA 
> > locality,
> > such as request, hctx, tags, ...
> > 
> > The following patch will support global tags on null_blk, also the 
> > performance
> > data is provided, no obvious performance loss is observed when the whole
> > hw queue depth is same.
> > 
> > Cc: Hannes Reinecke 
> > Cc: Arun Easi 
> > Cc: Omar Sandoval ,
> > Cc: "Martin K. Petersen" ,
> > Cc: James Bottomley ,
> > Cc: Christoph Hellwig ,
> > Cc: Don Brace 
> > Cc: Kashyap Desai 
> > Cc: Peter Rivera 
> > Cc: Laurence Oberman 
> > Cc: Mike Snitzer 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq-debugfs.c |  1 +
> >  block/blk-mq-sched.c   |  2 +-
> >  block/blk-mq-tag.c | 23 ++-
> >  block/blk-mq-tag.h |  5 -
> >  block/blk-mq.c | 29 -
> >  block/blk-mq.h |  3 ++-
> >  include/linux/blk-mq.h |  2 ++
> >  7 files changed, 52 insertions(+), 13 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 0dfafa4b655a..0f0fafe03f5d 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
> > HCTX_FLAG_NAME(SHOULD_MERGE),
> > HCTX_FLAG_NAME(TAG_SHARED),
> > HCTX_FLAG_NAME(SG_MERGE),
> > +   HCTX_FLAG_NAME(GLOBAL_TAGS),
> > HCTX_FLAG_NAME(BLOCKING),
> > HCTX_FLAG_NAME(NO_SCHED),
> >  };
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 55c0a745b427..191d4bc95f0e 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue 
> > *q,
> > int ret;
> >  
> > hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
> > -  set->reserved_tags);
> > +  set->reserved_tags, false);
> > if (!hctx->sched_tags)
> > return -ENOMEM;
> >  
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 571797dc36cb..66377d09eaeb 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -379,9 +379,11 @@ static struct blk_mq_tags 
> > *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
> > return NULL;
> >  }
> >  
> > -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> > +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> > +unsigned int total_tags,
> >  unsigned int reserved_tags,
> > -int node, int alloc_policy)
> > +int node, int alloc_policy,
> > +bool global_tag)
> >  {
> > struct blk_mq_tags *tags;
> >  
> > @@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
> > total_tags,
> > tags->nr_tags = total_tags;
> > tags->nr_reserved_tags = reserved_tags;
> >  
> > +   WARN_ON(global_tag && !set->global_tags);
> > +   if (global_tag && set->global_tags) {
> > +   tags->bitmap_tags = set->global_tags->bitmap_tags;
> > +   tags->breserved_tags = set->global_tags->breserved_tags;
> > +   tags->active_queues = set->global_tags->active_queues;
> > +   tags->global_tags = true;
> > +   return tags;
> > +   }
> > tags->bitmap_tags = >__bitmap_tags;
> > 

Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Ming Lei
On Mon, Feb 05, 2018 at 07:58:29AM +0100, Hannes Reinecke wrote:
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Hi All,
> > 
> > This patchset supports global tags which was started by Hannes originally:
> > 
> > https://marc.info/?l=linux-block=149132580511346=2
> > 
> > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > driver can avoid to support two IO paths(legacy and blk-mq), especially
> > recent discusion mentioned that SCSI_MQ will be enabled at default soon.
> > 
> > https://marc.info/?l=linux-scsi=151727684915589=2
> > 
> > With the above two changes, it should be easier to convert SCSI drivers'
> > reply queue into blk-mq's hctx, then the automatic irq affinity issue can
> > be solved easily, please see detailed descrption in commit log.
> > 
> > Also drivers may require to complete request on the submission CPU
> > for avoiding hard/soft deadlock, which can be done easily with blk_mq
> > too.
> > 
> > https://marc.info/?t=15160185141=1=2
> > 
> > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > so that IO hang issue can be avoided inside legacy IO path, this issue is
> > a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.
> > 
> > Thanks
> > Ming
> > 
> > Ming Lei (5):
> >   blk-mq: tags: define several fields of tags as pointer
> >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> >   block: null_blk: introduce module parameter of 'g_global_tags'
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> > 
> >  block/bfq-iosched.c|  4 +--
> >  block/blk-mq-debugfs.c | 11 
> >  block/blk-mq-sched.c   |  2 +-
> >  block/blk-mq-tag.c | 67 
> > +-
> >  block/blk-mq-tag.h | 15 ---
> >  block/blk-mq.c | 31 -
> >  block/blk-mq.h |  3 ++-
> >  block/kyber-iosched.c  |  2 +-
> >  drivers/block/null_blk.c   |  6 +
> >  drivers/scsi/hosts.c   |  1 +
> >  drivers/scsi/virtio_scsi.c | 59 +++-
> >  include/linux/blk-mq.h |  2 ++
> >  include/scsi/scsi_host.h   |  3 +++
> >  13 files changed, 105 insertions(+), 101 deletions(-)
> > 
> Thanks Ming for picking this up.
> 
> I'll give it a shot and see how it behaves on other hardware.

Hi Hannes,

Thanks for looking at it.

I am working on V2, which has fixed some issues, and added your patch
of 'scsi: Add template flag 'host_tagset', but causes a HPSA kernel
oops. Once it is fixed, I will posted V2 out, then there will be one
real example about how to use global tags for converting reply queue
to blk-mq hctx.

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-V2

Thanks,
Ming


Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Ming Lei
Hi Kashyap,

On Mon, Feb 05, 2018 at 12:35:13PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Hannes Reinecke [mailto:h...@suse.de]
> > Sent: Monday, February 5, 2018 12:28 PM
> > To: Ming Lei; Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig;
> > Mike Snitzer
> > Cc: linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; Martin K .
> > Petersen;
> > James Bottomley; Christoph Hellwig; Don Brace; Kashyap Desai; Peter
> > Rivera;
> > Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On 02/03/2018 05:21 AM, Ming Lei wrote:
> > > Hi All,
> > >
> > > This patchset supports global tags which was started by Hannes
> > > originally:
> > >
> > >   https://marc.info/?l=linux-block=149132580511346=2
> > >
> > > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > > driver can avoid to support two IO paths(legacy and blk-mq),
> > > especially recent discusion mentioned that SCSI_MQ will be enabled at
> > default soon.
> > >
> > >   https://marc.info/?l=linux-scsi=151727684915589=2
> > >
> > > With the above two changes, it should be easier to convert SCSI drivers'
> > > reply queue into blk-mq's hctx, then the automatic irq affinity issue
> > > can be solved easily, please see detailed descrption in commit log.
> > >
> > > Also drivers may require to complete request on the submission CPU for
> > > avoiding hard/soft deadlock, which can be done easily with blk_mq too.
> > >
> > >   https://marc.info/?t=15160185141=1=2
> > >
> > > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > > so that IO hang issue can be avoided inside legacy IO path, this issue
> > > is a bit generic, at least HPSA/virtio-scsi are found broken with
> > > v4.15+.
> > >
> > > Thanks
> > > Ming
> > >
> > > Ming Lei (5):
> > >   blk-mq: tags: define several fields of tags as pointer
> > >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> > >   block: null_blk: introduce module parameter of 'g_global_tags'
> > >   scsi: introduce force_blk_mq
> > >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> > >
> > >  block/bfq-iosched.c|  4 +--
> > >  block/blk-mq-debugfs.c | 11 
> > >  block/blk-mq-sched.c   |  2 +-
> > >  block/blk-mq-tag.c | 67
> > > +-
> > >  block/blk-mq-tag.h | 15 ---
> > >  block/blk-mq.c | 31 -
> > >  block/blk-mq.h |  3 ++-
> > >  block/kyber-iosched.c  |  2 +-
> > >  drivers/block/null_blk.c   |  6 +
> > >  drivers/scsi/hosts.c   |  1 +
> > >  drivers/scsi/virtio_scsi.c | 59
> > > +++-
> > >  include/linux/blk-mq.h |  2 ++
> > >  include/scsi/scsi_host.h   |  3 +++
> > >  13 files changed, 105 insertions(+), 101 deletions(-)
> > >
> > Thanks Ming for picking this up.
> >
> > I'll give it a shot and see how it behaves on other hardware.
> 
> Ming -
> 
> There is no way we can enable global tags from SCSI stack in this patch

It has been done in V2 of this patchset, which will be posted out after
HPSA's issue is fixed:

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-V2

Global tags will be enabled easily via .host_tagset of scsi_host_template.

> series.   I still think we have no solution for issue described below in
> this patch series.
> https://marc.info/?t=15160185141=1=2
> 
> What we will be doing is just use global tag HBA wide instead of h/w queue
> based.

Right, that is just what the 1st three patches are doing.

> We still have more than one reply queue ending up completion one CPU.

pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
smp_affinity_enable has to be set as 1, but seems it is the default
setting.

Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
which figures out an optimal number of vectors, and the computation is
based on cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
mapped to some of reply queues, these queues won't be active(no request
submitted to these queues). The mechanism of PCI_IRQ_AFFINITY basically
makes sure that more than one irq vector won't be handled by one same CPU,
and the irq vector spread is done in irq_create_affinity_masks().

> Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via module
> parameter to simulate the issue. We need more number of Online CPU than
> reply-queue.

IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
irq vector number, num_possible_cpus()/num_online_cpus() and each
irq vector's affinity assignment.

> We may see completion redirected to original CPU because of
> "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU busy
> in local ISR routine.

Could you dump each irq vector's affinity assignment of your 

Re: [PATCH 2/3] virtio-scsi: Add FC transport class

2018-02-05 Thread Hannes Reinecke
On 02/02/2018 06:39 PM, Steffen Maier wrote:
> 
> On 02/02/2018 05:00 PM, Hannes Reinecke wrote:
>> On 01/26/2018 05:54 PM, Steffen Maier wrote:
>>> On 12/18/2017 09:31 AM, Hannes Reinecke wrote:
 On 12/15/2017 07:08 PM, Steffen Maier wrote:
> On 12/14/2017 11:11 AM, Hannes Reinecke wrote:
> 
>>> To me, this raises the question which properties of the host's FC
>>> (driver core) objects should be mirrored to the guest. Ideally all (and
>>> that's a lot).
>>> This in turn makes me wonder if mirroring is really desirable (e.g.
>>> considering the effort) or if only the guest should have its own FC
>>> object hierarchy which does _not_ exist on the KVM host in case an
>>> fc_host is passed through with virtio-(v)fc.
> 
>>> A few more thoughts on your presentation [1]:
>>>
>>> "Devices on the vport will not be visible on the host"
>>> I could not agree more to the design point that devices (or at least
>>> their descendant object subtree) passed through to a guest should not
>>> appear on the host!
>>> With virtio-blk or virtio-scsi, we have SCSI devices and thus disks
>>> visible in the host, which needlessly scans partitions, or even worse
>>> automatically scans for LVM and maybe even activates PVs/VGs/LVs. It's
>>> hard for a KVM host admin to suppress this (and not break the devices
>>> the host needs itself).
>>> If we mirror the host's scsi_transport_fc tree including fc_rports and
>>> thus SCSI devices etc., we would have the same problems?
>>> Even more so, dev_loss_tmo and fast_io_fail_tmo would run independently
>>> on the host and in the guest on the same mirrored scsi_transport_fc
>>> object tree. I can envision user confusion having configured timeouts on
>>> the "wrong" side (host vs. guest). Also we would still need a mechanism
>>> to mirror fc_rport (un)block from host to guest for proper transport
>>> recovery. In zfcp we try to recover on transport rather than scsi_eh
>>> whenever possible because it is so much smoother.
>>>
>> As similar thing can be achieved event today, by setting the
>> 'no_uld_attach' parameter when scanning the scsi device
>> (that's what some RAID HBAs do).
>> However, there currently is no way of modifying it from user-space, and
>> certainly not to change the behaviour for existing devices.
>> It should be relatively simple to set this flag whenever the host is
>> exposed to a VM; we would still see the scsi devices, but the 'sd'
>> driver won't be attached so nothing will scan the device on the host.
> 
> Ah, nice, didn't know that. It would solve the undesired I/O problem in
> the host.
> But it would not solve the so far somewhat unsynchronized state
> transitions of fc_rports on the host and their mirrors in the guest?
> 
> I would be very interested in how you intend to do transport recovery.
> 
So am I :-)

Current plan is to check fc-transport events; probably implementing a
listener which then will manage the interface to the guest.
Also I'm looking at implementing a new block device, which just gets a
WWPN as input and manages all associated scsi (sg) devices.
But that still needs some more work to be done.

>>> "Towards virtio-fc?"
[ .. ]
>> I'm not convinced that moving to full virtio-fc is something we want or
>> even can do.
>> Neither qla2xxx nor lpfc allow for direct FC frame access; so one would
>> need to reformat the FC frames into something the driver understands,
>> just so that the hardware can transform it back into FC frames.
> 
> I thought of a more high level para-virtualized FCP HBA interface, than
> FC frames (which did exist in kernel v2.4 under drivers/fc4/ but no
> longer as it seems). Just like large parts of today's FCP LLDDs handle
> scatter gather lists and framing is done by the hardware.
> 
But that would amount to yet another abstraction layer; not sure if it's
worth the trouble...

>> Another thing is xid management; some drivers have to do their own xid
>> management, based on hardware capabilities etc.
>> So the FC frames would need to re-write the xids, making it hard if not
>> impossible to match things up when the response comes in.
> 
> For such things, where the hardware exposes more details (than, say,
> zfcp sees) I thought the LLDD on the KVM host would handle such details
> internally and only expose the higher level interface to virtio-fc.
> 
Yes; but that's what I'm aiming for with my virtio-vfc thingie already :-)

> Maybe something roughly like the basic transport protocol part of
> ibmvfc/ibmvscsi (not the other end in the firmware and not the cross
> partition DMA part), if I understood its overall design correctly by
> quickly looking at the code.
> I somewhat had the impression that zfcp isn't too far from the overall
> operations style. As seem qla2xxx or lpfc to me, they just see and need
> to handle some more low-level FC details.
> Conceptually replace CRQ/RDMA or QDIO with virtio.
> (And for ibmvscsi also: SRP => FCP because it uses a different SCSI
> transport.)
> 
This is actually a different