Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe  ha 
> scritto:
> 
> On 2/7/18 2:19 PM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
> 
> Thanks, applied.
> 

I Jens,
I forgot to add
Tested-by: Oleksandr Natalenko 
in the patch.

Is it still possible to add it?

Thanks,
Paolo

> -- 
> Jens Axboe



Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-07 Thread Hannes Reinecke
On 02/07/2018 03:14 PM, Kashyap Desai wrote:
>> -Original Message-
>> From: Ming Lei [mailto:ming@redhat.com]
>> Sent: Wednesday, February 7, 2018 5:53 PM
>> To: Hannes Reinecke
>> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
>> Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> Sandoval;
>> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
>> Rivera; Paolo Bonzini; Laurence Oberman
>> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
>> force_blk_mq
>>
>> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> [ .. ]
>
> Could you share us your patch for enabling global_tags/MQ on
 megaraid_sas
> so that I can reproduce your test?
>
>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
>
> Could you share us what the IOPS/CPU utilization effect is after
 applying the
> patch V2? And your test script?
 Regarding CPU utilization, I need to test one more time. Currently
 system is in used.

 I run below fio test on total 24 SSDs expander attached.

 numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
 --ioengine=libaio --rw=randread

 Performance dropped from 1.6 M IOPs to 770K IOPs.

>>> This is basically what we've seen with earlier iterations.
>>
>> Hi Hannes,
>>
>> As I mentioned in another mail[1], Kashyap's patch has a big issue,
> which
>> causes only reply queue 0 used.
>>
>> [1] https://marc.info/?l=linux-scsi=151793204014631=2
>>
>> So could you guys run your performance test again after fixing the
> patch?
> 
> Ming -
> 
> I tried after change you requested.  Performance drop is still unresolved.
> From 1.6 M IOPS to 770K IOPS.
> 
> See below data. All 24 reply queue is in used correctly.
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1  NAME
>  360  16422  0   16422  IR-PCI-MSI 70254653-edge megasas
>  364  15980  0   15980  IR-PCI-MSI 70254657-edge megasas
>  362  15979  0   15979  IR-PCI-MSI 70254655-edge megasas
>  345  15696  0   15696  IR-PCI-MSI 70254638-edge megasas
>  341  15659  0   15659  IR-PCI-MSI 70254634-edge megasas
>  369  15656  0   15656  IR-PCI-MSI 70254662-edge megasas
>  359  15650  0   15650  IR-PCI-MSI 70254652-edge megasas
>  358  15596  0   15596  IR-PCI-MSI 70254651-edge megasas
>  350  15574  0   15574  IR-PCI-MSI 70254643-edge megasas
>  342  15532  0   15532  IR-PCI-MSI 70254635-edge megasas
>  344  15527  0   15527  IR-PCI-MSI 70254637-edge megasas
>  346  15485  0   15485  IR-PCI-MSI 70254639-edge megasas
>  361  15482  0   15482  IR-PCI-MSI 70254654-edge megasas
>  348  15467  0   15467  IR-PCI-MSI 70254641-edge megasas
>  368  15463  0   15463  IR-PCI-MSI 70254661-edge megasas
>  354  15420  0   15420  IR-PCI-MSI 70254647-edge megasas
>  351  15378  0   15378  IR-PCI-MSI 70254644-edge megasas
>  352  15377  0   15377  IR-PCI-MSI 70254645-edge megasas
>  356  15348  0   15348  IR-PCI-MSI 70254649-edge megasas
>  337  15344  0   15344  IR-PCI-MSI 70254630-edge megasas
>  343  15320  0   15320  IR-PCI-MSI 70254636-edge megasas
>  355  15266  0   15266  IR-PCI-MSI 70254648-edge megasas
>  335  15247  0   15247  IR-PCI-MSI 70254628-edge megasas
>  363  15233  0   15233  IR-PCI-MSI 70254656-edge megasas
> 
> 
> Average:CPU  %usr %nice  %sys   %iowait%steal
> %irq %soft%guest%gnice %idle
> Average: 18  3.80  0.00 14.78 10.08  0.00
> 0.00  4.01  0.00  0.00 67.33
> Average: 19  3.26  0.00 15.35 10.62  0.00
> 0.00  4.03  0.00  0.00 66.74
> Average: 20  3.42  0.00 14.57 10.67  0.00
> 0.00  3.84  0.00  0.00 67.50
> Average: 21  3.19  0.00 15.60 10.75  0.00
> 0.00  4.16  0.00  0.00 66.30
> Average: 22  3.58  0.00 15.15 10.66  0.00
> 0.00  3.51  0.00  0.00 67.11
> Average: 23  3.34  0.00 15.36 10.63  0.00
> 0.00  4.17  0.00  0.00 66.50
> Average: 24  3.50  0.00 14.58 10.93  0.00
> 0.00  3.85  0.00  0.00 67.13
> Average: 25  3.20  0.00 14.68 10.86  0.00
> 0.00  4.31  0.00  0.00 66.95
> Average: 26  3.27  0.00 14.80 10.70  0.00
> 0.00  3.68  0.00  0.00 67.55
> Average: 27  3.58  0.00 15.36 10.80  0.00
> 0.00  3.79  0.00  0.00 66.48
> Average: 28  3.46  0.00 15.17 10.46  0.00
> 0.00  3.32  0.00  0.00 67.59
> Average: 29  3.34  0.00 14.42 10.72  0.00
> 0.00  3.34  0.00  0.00 68.18
> Average:   

Re: [PATCH RESEND] blk-throttle: dispatch more sync writes in block throttle layer

2018-02-07 Thread xuejiufei
Hi Tejun,

Could you please kindly review this patch or give some advice?

Thanks.
Jiufei Xue

On 2018/1/23 上午10:08, xuejiufei wrote:
> Cgroup writeback is supported since v4.2. But there exists a problem
> in the following case.
> 
> A cgroup may send both buffer and direct/sync IOs. The foreground
> thread will be stalled when periodic writeback IOs is flushed because
> the service queue in block throttle layer already has a plenty of
> writeback IOs, then foreground IOs should be enqueued with its FIFO
> policy. The current policy is dispatching 6 reads and 2 writes during
> each round, sync writes will be significantly delayed.
> 
> This patch adds another queue in block throttle. Now there are 3 queues
> in a service queue: read, sync write, async write, and we can dispatch
> more sync writes than aync writes.
> 
> Test:
> 1) setup a memcg and a blkcg with uplimit bandwidth 20m/s;
> 2) start a script writing 5G buffer to page cache and start to sync
> 3) start sync writes:
> dd if=/dev/zero of=test bs=4k count=12800 oflag=direct conv=notrunc
> 
> Unpatched:
> 52428800 bytes (52 MB) copied, 307.065 s, 171 kB/s
> 
> Patched:
> 52428800 bytes (52 MB) copied, 13.8192 s, 3.8 MB/s
> 
> Signed-off-by: Jiufei Xue 
> ---
>  block/blk-throttle.c | 195 
> ++-
>  1 file changed, 129 insertions(+), 66 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index d19f416..842257e 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -14,10 +14,10 @@
>  #include "blk.h"
>  
>  /* Max dispatch from a group in 1 round */
> -static int throtl_grp_quantum = 8;
> +static int throtl_grp_quantum = 32;
>  
>  /* Total max dispatch from all groups in one round */
> -static int throtl_quantum = 32;
> +static int throtl_quantum = 128;
>  
>  /* Throttling is performed over a slice and after that slice is renewed */
>  #define DFL_THROTL_SLICE_HD (HZ / 10)
> @@ -43,6 +43,12 @@
>  /* A workqueue to queue throttle related work */
>  static struct workqueue_struct *kthrotld_workqueue;
>  
> +enum wl_type {
> + READ_WORKLOAD = 0,
> + SYNC_WRITE_WORKLOAD = 1,
> + ASYNC_WRITE_WORKLOAD = 2
> +};
> +
>  /*
>   * To implement hierarchical throttling, throtl_grps form a tree and bios
>   * are dispatched upwards level by level until they reach the top and get
> @@ -79,8 +85,11 @@ struct throtl_service_queue {
>* Bios queued directly to this service_queue or dispatched from
>* children throtl_grp's.
>*/
> - struct list_headqueued[2];  /* throtl_qnode [READ/WRITE] */
> - unsigned intnr_queued[2];   /* number of queued bios */
> + /* throtl_qnode [READ/WRITE/ASYNC_WRITE] */
> + struct list_headqueued[3];
> +
> + unsigned intnr_queued[3];   /* number of queued bios */
> +
>  
>   /*
>* RB tree of active children throtl_grp's, which are sorted by
> @@ -127,8 +136,8 @@ struct throtl_grp {
>* with the sibling qnode_on_parents and the parent's
>* qnode_on_self.
>*/
> - struct throtl_qnode qnode_on_self[2];
> - struct throtl_qnode qnode_on_parent[2];
> + struct throtl_qnode qnode_on_self[3];
> + struct throtl_qnode qnode_on_parent[3];
>  
>   /*
>* Dispatch time in jiffies. This is the estimated time when group
> @@ -202,7 +211,7 @@ struct throtl_data
>   struct request_queue *queue;
>  
>   /* Total Number of queued bios on READ and WRITE lists */
> - unsigned int nr_queued[2];
> + unsigned int nr_queued[3];
>  
>   unsigned int throtl_slice;
>  
> @@ -274,6 +283,18 @@ static struct throtl_data *sq_to_td(struct 
> throtl_service_queue *sq)
>   return container_of(sq, struct throtl_data, service_queue);
>  }
>  
> +static inline enum wl_type bio_workload_type(struct bio *bio)
> +{
> + return bio_data_dir(bio) ?
> +((bio->bi_opf & REQ_SYNC) ? SYNC_WRITE_WORKLOAD :
> +ASYNC_WRITE_WORKLOAD) : READ_WORKLOAD;
> +}
> +
> +static inline bool wl_to_rw(enum wl_type type)
> +{
> + return type >= SYNC_WRITE_WORKLOAD;
> +}
> +
>  /*
>   * cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is 
> to
>   * make the IO dispatch more smooth.
> @@ -475,8 +496,9 @@ static struct bio *throtl_pop_queued(struct list_head 
> *queued,
>  /* init a service_queue, assumes the caller zeroed it */
>  static void throtl_service_queue_init(struct throtl_service_queue *sq)
>  {
> - INIT_LIST_HEAD(>queued[0]);
> - INIT_LIST_HEAD(>queued[1]);
> + INIT_LIST_HEAD(>queued[READ_WORKLOAD]);
> + INIT_LIST_HEAD(>queued[SYNC_WRITE_WORKLOAD]);
> + INIT_LIST_HEAD(>queued[ASYNC_WRITE_WORKLOAD]);
>   sq->pending_tree = RB_ROOT;
>   timer_setup(>pending_timer, throtl_pending_timer_fn, 0);
>  }
> @@ -484,7 +506,7 @@ static void throtl_service_queue_init(struct 
> throtl_service_queue *sq)
>  static struct 

Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-07 Thread Joseph Qi
Hi Tejun,
Thanks very much for reviewing this patch.

On 18/2/8 05:38, Tejun Heo wrote:
> Hello, Joseph.
> 
> On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote:
>> writeback kworker
>>   blkcg_bio_issue_check
>> rcu_read_lock
>> blkg_lookup
>> <<< *race window*
>> blk_throtl_bio
>>   spin_lock_irq(q->queue_lock)
>>   spin_unlock_irq(q->queue_lock)
>> rcu_read_unlock
>>
>> cgroup_rmdir
>>   cgroup_destroy_locked
>> kill_css
>>   css_killed_ref_fn
>> css_killed_work_fn
>>   offline_css
>> blkcg_css_offline
>>   spin_trylock(q->queue_lock)
>>   blkg_destroy
>>   spin_unlock(q->queue_lock)
> 
> Ah, right.  Thanks for spotting the bug.
> 
>> Since rcu can only prevent blkg from releasing when it is being used,
>> the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
>> blkg release.
>> Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
>> And then the corresponding blkg_put will schedule blkg release again,
>> which result in double free.
>> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
>> creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
>> first and then try to lookup/create again with queue_lock. So revive
>> this logic to fix the race.
> 
> The change seems a bit drastic to me.  Can't we do something like the
> following instead?
> 
> blk_throtl_bio()
> {
>   ... non throttled cases ...
> 
>   /* out-of-limit, queue to @tg */
> 
>   /*
>* We can look up and retry but the race window is tiny here.
>* Just letting it through should be good enough.
>*/
>   if (!css_tryget(blkcg->css))
>   goto out;
> 
>   ... actual queueing ...
>   css_put(blkcg->css);
>   ...
> }
So you mean checking css->refcnt to prevent the further use of
blkg_get? I think it makes sense.
IMO, we should use css_tryget_online instead, and rightly after taking
queue_lock. Because there may be more use of blkg_get in blk_throtl_bio
in the futher. Actually it already has two now. One is in
blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio.
What do you think of this?

Thanks,
Joseph


Re: [PATCH rfc 3/5] irq_poll: wire up irq_am

2018-02-07 Thread Bart Van Assche
On Tue, 2018-02-06 at 00:03 +0200, Sagi Grimberg wrote:
> +void irq_poll_init_am(struct irq_poll *iop, unsigned int nr_events,
> +unsigned short nr_levels, unsigned short start_level, irq_poll_am_fn 
> *amfn)
> +{
> + iop->amfn = amfn;
> + irq_am_init(>am, nr_events, nr_levels, start_level, irq_poll_am);
> +}

This function has a large number of parameters and most of them are passed 
verbatim to
irq_am_init(). Please consider to introduce a structure for these parameters 
such that
the number of function arguments stays reasonable.

Thanks,

Bart.





Re: [PATCH rfc 2/5] irq-am: add some debugfs exposure on tuning state

2018-02-07 Thread Bart Van Assche
On Tue, 2018-02-06 at 00:03 +0200, Sagi Grimberg wrote:
> +static int irq_am_register_debugfs(struct irq_am *am)
> +{
> + char name[20];
> +
> + snprintf(name, sizeof(name), "am%u", am->id);
> + am->debugfs_dir = debugfs_create_dir(name,
> + irq_am_debugfs_root);

How is a user expected to figure out which driver am%u is associated with?
Please make sure that these directories have a name that makes it easy for
users to figure out what driver the moderation settings apply to.

Thanks,

Bart.




Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-07 Thread Ming Lei
Hi Kashyap,

On Wed, Feb 07, 2018 at 07:44:04PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, February 7, 2018 5:53 PM
> > To: Hannes Reinecke
> > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > > Hi all,
> > >
> > > [ .. ]
> > > >>
> > > >> Could you share us your patch for enabling global_tags/MQ on
> > > > megaraid_sas
> > > >> so that I can reproduce your test?
> > > >>
> > > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> > > >>
> > > >> Could you share us what the IOPS/CPU utilization effect is after
> > > > applying the
> > > >> patch V2? And your test script?
> > > > Regarding CPU utilization, I need to test one more time. Currently
> > > > system is in used.
> > > >
> > > > I run below fio test on total 24 SSDs expander attached.
> > > >
> > > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > --ioengine=libaio --rw=randread
> > > >
> > > > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > >
> > > This is basically what we've seen with earlier iterations.
> >
> > Hi Hannes,
> >
> > As I mentioned in another mail[1], Kashyap's patch has a big issue,
> which
> > causes only reply queue 0 used.
> >
> > [1] https://marc.info/?l=linux-scsi=151793204014631=2
> >
> > So could you guys run your performance test again after fixing the
> patch?
> 
> Ming -
> 
> I tried after change you requested.  Performance drop is still unresolved.
> From 1.6 M IOPS to 770K IOPS.
> 
> See below data. All 24 reply queue is in used correctly.
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1  NAME
>  360  16422  0   16422  IR-PCI-MSI 70254653-edge megasas
>  364  15980  0   15980  IR-PCI-MSI 70254657-edge megasas
>  362  15979  0   15979  IR-PCI-MSI 70254655-edge megasas
>  345  15696  0   15696  IR-PCI-MSI 70254638-edge megasas
>  341  15659  0   15659  IR-PCI-MSI 70254634-edge megasas
>  369  15656  0   15656  IR-PCI-MSI 70254662-edge megasas
>  359  15650  0   15650  IR-PCI-MSI 70254652-edge megasas
>  358  15596  0   15596  IR-PCI-MSI 70254651-edge megasas
>  350  15574  0   15574  IR-PCI-MSI 70254643-edge megasas
>  342  15532  0   15532  IR-PCI-MSI 70254635-edge megasas
>  344  15527  0   15527  IR-PCI-MSI 70254637-edge megasas
>  346  15485  0   15485  IR-PCI-MSI 70254639-edge megasas
>  361  15482  0   15482  IR-PCI-MSI 70254654-edge megasas
>  348  15467  0   15467  IR-PCI-MSI 70254641-edge megasas
>  368  15463  0   15463  IR-PCI-MSI 70254661-edge megasas
>  354  15420  0   15420  IR-PCI-MSI 70254647-edge megasas
>  351  15378  0   15378  IR-PCI-MSI 70254644-edge megasas
>  352  15377  0   15377  IR-PCI-MSI 70254645-edge megasas
>  356  15348  0   15348  IR-PCI-MSI 70254649-edge megasas
>  337  15344  0   15344  IR-PCI-MSI 70254630-edge megasas
>  343  15320  0   15320  IR-PCI-MSI 70254636-edge megasas
>  355  15266  0   15266  IR-PCI-MSI 70254648-edge megasas
>  335  15247  0   15247  IR-PCI-MSI 70254628-edge megasas
>  363  15233  0   15233  IR-PCI-MSI 70254656-edge megasas
> 
> 
> Average:CPU  %usr %nice  %sys   %iowait%steal
> %irq %soft%guest%gnice %idle
> Average: 18  3.80  0.00 14.78 10.08  0.00
> 0.00  4.01  0.00  0.00 67.33
> Average: 19  3.26  0.00 15.35 10.62  0.00
> 0.00  4.03  0.00  0.00 66.74
> Average: 20  3.42  0.00 14.57 10.67  0.00
> 0.00  3.84  0.00  0.00 67.50
> Average: 21  3.19  0.00 15.60 10.75  0.00
> 0.00  4.16  0.00  0.00 66.30
> Average: 22  3.58  0.00 15.15 10.66  0.00
> 0.00  3.51  0.00  0.00 67.11
> Average: 23  3.34  0.00 15.36 10.63  0.00
> 0.00  4.17  0.00  0.00 66.50
> Average: 24  3.50  0.00 14.58 10.93  0.00
> 0.00  3.85  0.00  0.00 67.13
> Average: 25  3.20  0.00 14.68 10.86  0.00
> 0.00  4.31  0.00  0.00 66.95
> Average: 26  3.27  0.00 14.80 10.70  0.00
> 0.00  3.68  0.00  0.00 67.55
> Average: 27  3.58  0.00 15.36 10.80  0.00
> 0.00  3.79  0.00  0.00 66.48
> Average: 28  3.46  0.00 15.17 10.46  0.00
> 0.00  3.32  0.00  0.00 67.59
> Average:  

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 23:48 +, Bart Van Assche wrote:
> With this patch applied I see requests for which it seems like the timeout 
> handler
> did not get invoked: [ ... ]

I just noticed the following in the system log, which is probably the reason 
why some
requests got stuck:

Feb  7 15:16:26 ubuntu-vm kernel: BUG: unable to handle kernel NULL pointer 
dereference at   (null)
Feb  7 15:16:26 ubuntu-vm kernel: IP: scsi_times_out+0x17/0x2c0 [scsi_mod]
Feb  7 15:16:26 ubuntu-vm kernel: PGD 0 P4D 0  
Feb  7 15:16:26 ubuntu-vm kernel: Oops:  [#1] PREEMPT SMP
Feb  7 15:16:26 ubuntu-vm kernel: Modules linked in: dm_service_time ib_srp 
libcrc32c crc32c_generic scsi_transport_srp target_core_pscsi target_core_file 
ib_srpt target_core_iblock target_core_mod
rdma_rxe ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs scsi_debug brd mq_deadline 
kyber_iosched deadline_iosched cfq_iosched bfq crct10dif_pclmul crc32_pclmul 
af_packet ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw virtio_console 
virtio_balloon sg button i2c_piix4 dm_multipath dm_mod dax scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua ib_iser rdma_cm iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
ip_tables x_tables ipv6 autofs4 ext4 crc16 mbcache jbd2 sd_mod virtio_net 
virtio_blk virtio_scsi sr_mod cdrom ata_generic
pata_acpi psmouse crc32c_intel i2c_core atkbd
Feb  7 15:16:26 ubuntu-vm kernel: uhci_hcd ehci_hcd intel_agp ata_piix 
intel_gtt agpgart libata virtio_pci usbcore virtio_ring virtio scsi_mod 
usb_common unix
Feb  7 15:16:26 ubuntu-vm kernel: CPU: 0 PID: 146 Comm: kworker/0:1H Not 
tainted 4.15.0-dbg+ #1
Feb  7 15:16:26 ubuntu-vm kernel: Hardware name: QEMU Standard PC (i440FX + 
PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Feb  7 15:16:26 ubuntu-vm kernel: Workqueue: kblockd blk_mq_timeout_work
Feb  7 15:16:26 ubuntu-vm kernel: RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Feb  7 15:16:26 ubuntu-vm kernel: RSP: 0018:98f0c02abd58 EFLAGS: 00010246
Feb  7 15:16:26 ubuntu-vm kernel: RAX:  RBX: 965de2b3a2c0 
RCX: 
Feb  7 15:16:26 ubuntu-vm kernel: RDX: c0094740 RSI:  
RDI: 965de2b3a2c0
Feb  7 15:16:26 ubuntu-vm kernel: RBP: 965de2b3a438 R08: fffc 
R09: 0007
Feb  7 15:16:26 ubuntu-vm kernel: R10:  R11:  
R12: 0002
Feb  7 15:16:26 ubuntu-vm kernel: R13:  R14: 965de2a44218 
R15: 965de2a48728
Feb  7 15:16:26 ubuntu-vm kernel: FS:  () 
GS:965dffc0() knlGS:
Feb  7 15:16:26 ubuntu-vm kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Feb  7 15:16:26 ubuntu-vm kernel: CR2:  CR3: 3ce0f003 
CR4: 003606f0
Feb  7 15:16:26 ubuntu-vm kernel: DR0:  DR1:  
DR2: 
Feb  7 15:16:26 ubuntu-vm kernel: DR3:  DR6: fffe0ff0 
DR7: 0400
Feb  7 15:16:26 ubuntu-vm kernel: Call Trace:
Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_terminate_expired+0x42/0x80
Feb  7 15:16:26 ubuntu-vm kernel: bt_iter+0x3d/0x50
Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_queue_tag_busy_iter+0xe9/0x200
Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_timeout_work+0x181/0x2e0
Feb  7 15:16:26 ubuntu-vm kernel: process_one_work+0x21c/0x6d0
Feb  7 15:16:26 ubuntu-vm kernel: worker_thread+0x35/0x380
Feb  7 15:16:26 ubuntu-vm kernel: kthread+0x117/0x130
Feb  7 15:16:26 ubuntu-vm kernel: ret_from_fork+0x24/0x30
Feb  7 15:16:26 ubuntu-vm kernel: Code: ff ff 0f ff e9 fd fe ff ff 90 66 2e 0f 
1f 84 00 00 00 00 00 41 55 41 54 55 48 8d af 78 01 00 00 53 48 8b 87 b0 01 00 
00 48 89 fb <4c> 8b 28 0f 1f 44 00 00 65
8b 05 6a b5 f8 3f 83 f8 0f 0f 87 ed  
Feb  7 15:16:26 ubuntu-vm kernel: RIP: scsi_times_out+0x17/0x2c0 [scsi_mod] 
RSP: 98f0c02abd58
Feb  7 15:16:26 ubuntu-vm kernel: CR2: 
Feb  7 15:16:26 ubuntu-vm kernel: ---[ end trace ce6c20d95bab450e ]---

Bart.








Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 12:07 -0800, t...@kernel.org wrote:
> Ah, you're right.  u64_stat_sync doesn't imply barriers, so we want
> something like the following.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df93102..d6edf3b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -593,7 +593,7 @@ static void blk_mq_rq_update_aborted_gstate(struct 
> request *rq, u64 gstate)
>*/
>   local_irq_save(flags);
>   u64_stats_update_begin(>aborted_gstate_sync);
> - rq->aborted_gstate = gstate;
> + smp_store_release(>aborted_gstate, gstate);
>   u64_stats_update_end(>aborted_gstate_sync);
>   local_irq_restore(flags);
>  }
> @@ -605,7 +605,7 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
>  
>   do {
>   start = u64_stats_fetch_begin(>aborted_gstate_sync);
> - aborted_gstate = rq->aborted_gstate;
> + aborted_gstate = smp_load_acquire(>aborted_gstate);
>   } while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
>  
>   return aborted_gstate;
> @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>* ->aborted_gstate is set, this may lead to ignored
>* completions and further spurious timeouts.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
>   blk_add_timer(req);
> + blk_mq_rq_update_aborted_gstate(req, 0);
>   break;
>   case BLK_EH_NOT_HANDLED:
>   break;

Hello Tejun,

With this patch applied I see requests for which it seems like the timeout 
handler
did not get invoked:

sdc/hctx0/busy:95e04b7c {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT|MQ_TIMEOUT_EXPIRED, .state=i
n_flight, .gstate=0xed/0xed, .tag=26, .internal_tag=-1, .cmd=Write(10) 2a 00 00 
 
00 60 ba 00 00 08 00, .retries=0, .result = 0x5, .flags=TAGGED|INITIALIZED, 
 
.timeout=1.000, allocated 1093.180 s ago}

sdc/hctx0/busy:65a64e9b {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT|MQ_TIMEOUT_EXPIRED, .state=i
n_flight, .gstate=0x5/0x5, .tag=27, .internal_tag=-1, .cmd=Write(10) 2a 00 00 00
 62 d2 00 00 08 00, .retries=0, .result = 0x5, .flags=TAGGED|INITIALIZED, .t
imeout=1.000, allocated 1093.180 s ago}

[ ... ]

sdc/hctx3/busy:479cc2a9 {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT, .state=in_flight, .gstate=0
x11/0x11, .tag=57, .internal_tag=-1, .cmd=Write(10) 2a 00 00 00 61 d2 00 00 08 0
0, .retries=0, .result = 0x0, .flags=TAGGED|INITIALIZED, .timeout=1.000, allocat
ed 1093.150 s ago}

sdc/hctx3/busy:8fd130d5 {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT, .state=in_flight, .gstate=0
xd/0xd, .tag=61, .internal_tag=-1, .cmd=Write(10) 2a 00 00 00 c3 94 00 00 08 00,
 .retries=0, .result = 0x0, .flags=TAGGED|INITIALIZED, .timeout=1.000, allocated
 1093.140 s ago}

As one can see for some requests MQ_TIMEOUT_EXPIRED is set and .result = 
0x5.
The value of .result means that the SCSI error handler has submitted an abort 
(see
also scmnd->result = DID_ABORT << 16 in drivers/infiniband/ulp/srp/ib_srp.c). 
For
the last two requests shown above however MQ_TIMEOUT_EXPIRED is not set and the
SCSI result has value 0. I think that means that it can happen that a request
times out but that the timeout handler does not get invoked ...

Thanks,

Bart.





Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello,

On Wed, Feb 07, 2018 at 09:02:21PM +, Bart Van Assche wrote:
> The patch that I used in my test had an smp_wmb() call (see also below). 
> Anyway,
> I will see whether I can extract more state information through debugfs.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ef4f6df0f1df..8eb2105d82b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -827,13 +827,9 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> - /*
> -  * As nothing prevents from completion happening while
> -  * ->aborted_gstate is set, this may lead to ignored
> -  * completions and further spurious timeouts.
> -  */
> - blk_mq_rq_update_aborted_gstate(req, 0);
>   blk_add_timer(req);
> + smp_wmb();
> + blk_mq_rq_update_aborted_gstate(req, 0);

Without the matching rmb, just adding rmb won't do much but given the
default strong ordering on x86 and other operations around, what you
were seeing is probably not caused by lack of barriers.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-07 Thread Tejun Heo
Hello, Joseph.

On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote:
> writeback kworker
>   blkcg_bio_issue_check
> rcu_read_lock
> blkg_lookup
> <<< *race window*
> blk_throtl_bio
>   spin_lock_irq(q->queue_lock)
>   spin_unlock_irq(q->queue_lock)
> rcu_read_unlock
> 
> cgroup_rmdir
>   cgroup_destroy_locked
> kill_css
>   css_killed_ref_fn
> css_killed_work_fn
>   offline_css
> blkcg_css_offline
>   spin_trylock(q->queue_lock)
>   blkg_destroy
>   spin_unlock(q->queue_lock)

Ah, right.  Thanks for spotting the bug.

> Since rcu can only prevent blkg from releasing when it is being used,
> the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
> blkg release.
> Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
> And then the corresponding blkg_put will schedule blkg release again,
> which result in double free.
> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
> creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
> first and then try to lookup/create again with queue_lock. So revive
> this logic to fix the race.

The change seems a bit drastic to me.  Can't we do something like the
following instead?

blk_throtl_bio()
{
... non throttled cases ...

/* out-of-limit, queue to @tg */

/*
 * We can look up and retry but the race window is tiny here.
 * Just letting it through should be good enough.
 */
if (!css_tryget(blkcg->css))
goto out;

... actual queueing ...
css_put(blkcg->css);
...
}

Thanks.

-- 
tejun


[PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik 
Reported-by: Alban Browaeys 
Tested-by: Mike Galbraith 
Signed-off-by: Paolo Valente 
Signed-off-by: Serena Ziviani 
---
V2: contains fix to bug reported in [2]
V3: implements the improvement suggested in [3]

[2] https://lkml.org/lkml/2018/2/5/599
[3] https://lkml.org/lkml/2018/2/7/532

 block/bfq-iosched.c | 107 
 1 file changed, 82 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..aeca22d91101 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}

/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif

+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request 

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 12:09 -0800, t...@kernel.org wrote:
> Hello,
> 
> On Wed, Feb 07, 2018 at 07:03:56PM +, Bart Van Assche wrote:
> > I tried the above patch but already during the first iteration of the test I
> > noticed that the test hung, probably due to the following request that got 
> > stuck:
> > 
> > $ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list)
> > a98cff60 {.op=SCSI_IN, .cmd_flags=, 
> > .rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM,
> >  .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 
> > 00 00 00 00, .retries=0,
> >  .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago}
> 
> I'm wonder how this happened, so we can lose a completion when it
> races against BLK_EH_RESET_TIMER; however, the command should timeout
> later cuz the timer is running again now.  Maybe we actually had the
> memory barrier race that you pointed out in the other message?

Hello Tejun,

The patch that I used in my test had an smp_wmb() call (see also below). Anyway,
I will see whether I can extract more state information through debugfs.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ef4f6df0f1df..8eb2105d82b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -827,13 +827,9 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   smp_wmb();
+   blk_mq_rq_update_aborted_gstate(req, 0);
break;
case BLK_EH_NOT_HANDLED:
break;





Re: [PATCH BUGFIX V2 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 19:06, Jens Axboe  ha 
> scritto:
> 
> On 2/7/18 11:00 AM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
>> 
>> [1] https://marc.info/?l=linux-block=15127608676
>> 
>> Reported-by: Ivan Kozik 
>> Reported-by: Alban Browaeys 
>> Tested-by: Mike Galbraith 
>> Signed-off-by: Paolo Valente 
>> Signed-off-by: Serena Ziviani 
>> ---
>> block/bfq-iosched.c | 109 
>> 
>> 1 file changed, 84 insertions(+), 25 deletions(-)
>> 
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 47e6ec7427c4..21e6b9e45638 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
>> blk_mq_hw_ctx *hctx)
>>  }
>> 
>>  /*
>> - * We exploit the bfq_finish_request hook to decrement
>> - * rq_in_driver, but bfq_finish_request will not be
>> - * invoked on this request. So, to avoid unbalance,
>> - * just start this request, without incrementing
>> - * rq_in_driver. As a negative consequence,
>> - * rq_in_driver is deceptively lower than it should be
>> - * while this request is in service. This may cause
>> - * bfq_schedule_dispatch to be invoked uselessly.
>> + * We exploit the bfq_finish_requeue_request hook to
>> + * decrement rq_in_driver, but
>> + * bfq_finish_requeue_request will not be invoked on
>> + * this request. So, to avoid unbalance, just start
>> + * this request, without incrementing rq_in_driver. As
>> + * a negative consequence, rq_in_driver is deceptively
>> + * lower than it should be while this request is in
>> + * service. This may cause bfq_schedule_dispatch to be
>> + * invoked uselessly.
>>   *
>>   * As for implementing an exact solution, the
>> - * bfq_finish_request hook, if defined, is probably
>> - * invoked also on this request. So, by exploiting
>> - * this hook, we could 1) increment rq_in_driver here,
>> - * and 2) decrement it in bfq_finish_request. Such a
>> - * solution would let the value of the counter be
>> - * always accurate, but it would entail using an extra
>> - * interface function. This cost seems higher than the
>> - * benefit, being the frequency of non-elevator-private
>> + * bfq_finish_requeue_request hook, if defined, is
>> + * probably invoked also on this request. So, by
>> + * exploiting this hook, we could 1) increment
>> + * rq_in_driver here, and 2) decrement it in
>> + * bfq_finish_requeue_request. Such a solution would
>> + * let the value of the counter be always accurate,
>> + * but it would entail using an extra interface
>> + * function. This cost seems higher than the benefit,
>> + * being the frequency of non-elevator-private
>>   * requests very low.
>>   */
>>  goto start_rq;
>> @@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
>> request_queue *q,
>> unsigned int cmd_flags) {}
>> #endif
>> 
>> +static void bfq_prepare_request(struct 

Re: Bcache patches for inclusion in 4.16

2018-02-07 Thread Jens Axboe
On 2/7/18 12:41 PM, Michael Lyle wrote:
> Hi Jens---
> 
> Here are a few more changes for 4.16 from Coly Li and Tang Junhui.
> 
> Probably the most "functional" is 2/8, which makes a change to improve a
> performance bug under sustained small writes.  Still, it's a small, safe
> change that seems good to take.  The rest are small bugfixes.
> 
> All have seen some degree of test (which is, of course, always ongoing).
> 
> Note that there's one checkpatch error in this, but it was already present
> and is just triggered by altering an adjacent line.
> 
>  9 files changed, 111 insertions(+), 36 deletions(-)
> 
> Coly Li (3):
>   bcache: properly set task state in bch_writeback_thread()
>   bcache: set error_limit correctly
>   bcache: set writeback_rate_update_seconds in range [1, 60] seconds
> 
> Tang Junhui (5):
>   bcache: add journal statistic
>   bcache: fix high CPU occupancy during journal
>   bcache: fix for allocator and register thread race
>   bcache: return attach error when no cache set exist
>   bcache: fix for data collapse after re-attaching an attached device

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello,

On Wed, Feb 07, 2018 at 07:03:56PM +, Bart Van Assche wrote:
> I tried the above patch but already during the first iteration of the test I
> noticed that the test hung, probably due to the following request that got 
> stuck:
> 
> $ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list)
> a98cff60 {.op=SCSI_IN, .cmd_flags=, 
> .rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM,
>  .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 00 
> 00 00 00, .retries=0,
>  .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago}

I'm wonder how this happened, so we can lose a completion when it
races against BLK_EH_RESET_TIMER; however, the command should timeout
later cuz the timer is running again now.  Maybe we actually had the
memory barrier race that you pointed out in the other message?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, Bart.

On Wed, Feb 07, 2018 at 06:14:13PM +, Bart Van Assche wrote:
> When I wrote my comment I was not sure whether or not non-reentrancy is
> guaranteed for work queue items. However, according to what I found in the
> workqueue implementation I think that is guaranteed. So it shouldn't be
> possible that the timer activated by blk_add_timer() gets handled before
> aborted_gstate is reset. But since the timeout handler and completion

Yeah, we're basically single threaded in the timeout path.

> handlers can be executed by different CPUs, shouldn't a memory barrier be
> inserted between the blk_add_timer() call and resetting aborted_gsync to
> guarantee that a completion cannot occur before blk_add_timer() has reset
> RQF_MQ_TIMEOUT_EXPIRED?

Ah, you're right.  u64_stat_sync doesn't imply barriers, so we want
something like the following.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..d6edf3b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -593,7 +593,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request 
*rq, u64 gstate)
 */
local_irq_save(flags);
u64_stats_update_begin(>aborted_gstate_sync);
-   rq->aborted_gstate = gstate;
+   smp_store_release(>aborted_gstate, gstate);
u64_stats_update_end(>aborted_gstate_sync);
local_irq_restore(flags);
 }
@@ -605,7 +605,7 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 
do {
start = u64_stats_fetch_begin(>aborted_gstate_sync);
-   aborted_gstate = rq->aborted_gstate;
+   aborted_gstate = smp_load_acquire(>aborted_gstate);
} while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
 
return aborted_gstate;
@@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 * ->aborted_gstate is set, this may lead to ignored
 * completions and further spurious timeouts.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   blk_mq_rq_update_aborted_gstate(req, 0);
break;
case BLK_EH_NOT_HANDLED:
break;


[PATCH 7/8] bcache: return attach error when no cache set exist

2018-02-07 Thread Michael Lyle
From: Tang Junhui 

I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
[root]#

In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.

This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.

Signed-off-by: Tang Junhui 
Reviewed-by: Michael Lyle 
---
 drivers/md/bcache/sysfs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 4a6a697e1680..a7552f509f50 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -198,7 +198,7 @@ STORE(__cached_dev)
 {
struct cached_dev *dc = container_of(kobj, struct cached_dev,
 disk.kobj);
-   ssize_t v = size;
+   ssize_t v;
struct cache_set *c;
struct kobj_uevent_env *env;
 
@@ -275,6 +275,7 @@ STORE(__cached_dev)
if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
return -EINVAL;
 
+   v = -ENOENT;
list_for_each_entry(c, _cache_sets, list) {
v = bch_cached_dev_attach(dc, c);
if (!v)
@@ -282,7 +283,7 @@ STORE(__cached_dev)
}
 
pr_err("Can't attach %s: cache set not found", buf);
-   size = v;
+   return v;
}
 
if (attr == _detach && dc->disk.c)
-- 
2.14.1



[PATCH 5/8] bcache: fix for allocator and register thread race

2018-02-07 Thread Michael Lyle
From: Tang Junhui 

After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[] closure_sync+0x25/0x90 [bcache]
[] bch_journal+0x118/0x2b0 [bcache]
[] bch_journal_meta+0x47/0x70 [bcache]
[] bch_prio_write+0x237/0x340 [bcache]
[] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[] kthread+0xcf/0xe0
[] ret_from_fork+0x58/0x90
[] 0x
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[] bch_btree_insert+0xf1/0x170 [bcache]
[] bch_journal_replay+0x13f/0x230 [bcache]
[] run_cache_set+0x79a/0x7c2 [bcache]
[] register_bcache+0xd48/0x1310 [bcache]
[] kobj_attr_store+0xf/0x20
[] sysfs_write_file+0xc6/0x140
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

I reboot the machine several times, the issue always
exsit in this machine.

I debug the code, and found the call trace as bellow:
register_bcache()
   ==>run_cache_set()
  ==>bch_journal_replay()
 ==>bch_btree_insert()
==>__bch_btree_map_nodes()
   ==>btree_insert_fn()
  ==>btree_split() //node need split
 ==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
   ==>bch_journal_meta()
  ==>bch_journal()
 ==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.

Signed-off-by: Hua Rui 
Signed-off-by: Tang Junhui 
Reviewed-by: Michael Lyle 
---
 drivers/md/bcache/btree.c |  9 ++---
 drivers/md/bcache/super.c | 13 -
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index bf3a48aa9a9a..fad9fe8817eb 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1869,14 +1869,17 @@ void bch_initial_gc_finish(struct cache_set *c)
 */
for_each_cache(ca, c, i) {
for_each_bucket(b, ca) {
-   if (fifo_full(>free[RESERVE_PRIO]))
+   if (fifo_full(>free[RESERVE_PRIO]) &&
+   fifo_full(>free[RESERVE_BTREE]))
break;
 

[PATCH 8/8] bcache: fix for data collapse after re-attaching an attached device

2018-02-07 Thread Michael Lyle
From: Tang Junhui 

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 > label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc->sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc->sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc->sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc->sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc->sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui 
Reviewed-by: Michael Lyle 
---
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 10 ++
 drivers/md/bcache/sysfs.c  |  6 --
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b8c2e1bef1f1..12e5197f186c 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -924,7 +924,7 @@ void bcache_write_super(struct cache_set *);
 
 int bch_flash_dev_create(struct cache_set *c, uint64_t size);
 
-int bch_cached_dev_attach(struct cached_dev *, struct cache_set *);
+int bch_cached_dev_attach(struct cached_dev *, struct cache_set *, uint8_t *);
 void bch_cached_dev_detach(struct cached_dev *);
 void bch_cached_dev_run(struct cached_dev *);
 void bcache_device_stop(struct bcache_device *);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a2ad37a8afc0..312895788036 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -957,7 +957,8 @@ void bch_cached_dev_detach(struct cached_dev *dc)
cached_dev_put(dc);
 }
 
-int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
+int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
+ uint8_t *set_uuid)
 {
uint32_t rtime = cpu_to_le32(get_seconds());
struct uuid_entry *u;
@@ -965,7 +966,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c)
 
bdevname(dc->bdev, buf);
 
-   if (memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16))
+   if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) ||
+   (!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16)))
return -ENOENT;
 
if (dc->disk.c) {
@@ -1194,7 +1196,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
 
list_add(>list, _devices);
list_for_each_entry(c, _cache_sets, list)
-   bch_cached_dev_attach(dc, c);
+   bch_cached_dev_attach(dc, c, NULL);
 
if (BDEV_STATE(>sb) == BDEV_STATE_NONE ||
BDEV_STATE(>sb) == BDEV_STATE_STALE)
@@ -1716,7 +1718,7 @@ static void run_cache_set(struct cache_set *c)
bcache_write_super(c);
 
list_for_each_entry_safe(dc, t, _devices, list)
-   bch_cached_dev_attach(dc, c);
+   bch_cached_dev_attach(dc, c, NULL);
 
flash_devs_run(c);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index a7552f509f50..78cd7bd50fdd 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -272,12 +272,14 @@ STORE(__cached_dev)
}
 
if (attr == _attach) {
-   if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
+   uint8_t set_uuid[16];
+
+   if (bch_parse_uuid(buf, set_uuid) < 16)
return -EINVAL;
 
v = -ENOENT;
list_for_each_entry(c, _cache_sets, list) {
-   v = bch_cached_dev_attach(dc, c);
+   v = bch_cached_dev_attach(dc, c, set_uuid);
if (!v)
return size;
}
-- 
2.14.1



[PATCH 4/8] bcache: set error_limit correctly

2018-02-07 Thread Michael Lyle
From: Coly Li 

Struct cache uses io_errors for two purposes,
- Error decay: when cache set error_decay is set, io_errors is used to
  generate a small piece of delay when I/O error happens.
- I/O errors counter: in order to generate big enough value for error
  decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
  IO_ERROR_SHIFT).

In function bch_count_io_errors(), if I/O errors counter reaches cache set
error limit, bch_cache_set_error() will be called to retire the whold cache
set. But current code is problematic when checking the error limit, see the
following code piece from bch_count_io_errors(),

 90 if (error) {
 91 char buf[BDEVNAME_SIZE];
 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
 93 >io_errors);
 94 errors >>= IO_ERROR_SHIFT;
 95
 96 if (errors < ca->set->error_limit)
 97 pr_err("%s: IO error on %s, recovering",
 98bdevname(ca->bdev, buf), m);
 99 else
100 bch_cache_set_error(ca->set,
101 "%s: too many IO errors %s",
102 bdevname(ca->bdev, buf), m);
103 }

At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
errors counter to compare at line 96. But ca->set->error_limit is initia-
lized with an amplified value in bch_cache_set_alloc(),
1545 c->error_limit  = 8 << IO_ERROR_SHIFT;

It means by default, in bch_count_io_errors(), before 8<<20 errors happened
bch_cache_set_error() won't be called to retire the problematic cache
device. If the average request size is 64KB, it means bcache won't handle
failed device until 512GB data is requested. This is too large to be an I/O
threashold. So I believe the correct error limit should be much less.

This patch sets default cache set error limit to 8, then in
bch_count_io_errors() when errors counter reaches 8 (if it is default
value), function bch_cache_set_error() will be called to retire the whole
cache set. This patch also removes bits shifting when store or show
io_error_limit value via sysfs interface.

Nowadays most of SSDs handle internal flash failure automatically by LBA
address re-indirect mapping. If an I/O error can be observed by upper layer
code, it will be a notable error because that SSD can not re-indirect
map the problematic LBA address to an available flash block. This situation
indicates the whole SSD will be failed very soon. Therefore setting 8 as
the default io error limit value makes sense, it is enough for most of
cache devices.

Changelog:
v2: add reviewed-by from Hannes.
v1: initial version for review.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Tang Junhui 
Reviewed-by: Michael Lyle 
Cc: Junhui Tang 
---
 drivers/md/bcache/bcache.h | 1 +
 drivers/md/bcache/super.c  | 2 +-
 drivers/md/bcache/sysfs.c  | 4 ++--
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index a857eb3c10de..b8c2e1bef1f1 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -666,6 +666,7 @@ struct cache_set {
ON_ERROR_UNREGISTER,
ON_ERROR_PANIC,
}   on_error;
+#define DEFAULT_IO_ERROR_LIMIT 8
unsignederror_limit;
unsignederror_decay;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 133b81225ea9..e29150ec17e7 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1553,7 +1553,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 
c->congested_read_threshold_us  = 2000;
c->congested_write_threshold_us = 2;
-   c->error_limit  = 8 << IO_ERROR_SHIFT;
+   c->error_limit  = DEFAULT_IO_ERROR_LIMIT;
 
return c;
 err:
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 46e7a6b3e7c7..c524305cc9a7 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -568,7 +568,7 @@ SHOW(__bch_cache_set)
 
/* See count_io_errors for why 88 */
sysfs_print(io_error_halflife,  c->error_decay * 88);
-   sysfs_print(io_error_limit, c->error_limit >> IO_ERROR_SHIFT);
+   sysfs_print(io_error_limit, c->error_limit);
 
sysfs_hprint(congested,
 ((uint64_t) bch_get_congested(c)) << 9);
@@ -668,7 +668,7 @@ STORE(__bch_cache_set)
}
 
if (attr == _io_error_limit)
-   c->error_limit = strtoul_or_return(buf) << IO_ERROR_SHIFT;
+   c->error_limit = strtoul_or_return(buf);
 
/* See count_io_errors() for why 88 */
if (attr == _io_error_halflife)
-- 
2.14.1



[PATCH 3/8] bcache: properly set task state in bch_writeback_thread()

2018-02-07 Thread Michael Lyle
From: Coly Li 

Kernel thread routine bch_writeback_thread() has the following code block,

447 down_write(>writeback_lock);
448~450 if (check conditions) {
451 up_write(>writeback_lock);
452 set_current_state(TASK_INTERRUPTIBLE);
453
454 if (kthread_should_stop())
455 return 0;
456
457 schedule();
458 continue;
459 }

If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.

There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
   another process changes the condition and call wake_up_process(dc->
   writeback_thread), then at line 452 task state is set back to
   TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
   waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
   will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
   call do_exit(). It is not good to enter do_exit() with task state
   TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
   warning message is reported by __might_sleep(): "WARNING: do not call
   blocking ops when !TASK_RUNNING; state=1 set at []".

For the first issue, task state should be set before condition checks.
Ineed because dc->writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc->
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.

For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data.  Setting task state to TASK_RUNNING before returning fixes this
problem.

In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.

Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Michael Lyle 
Cc: Michael Lyle 
Cc: Junhui Tang 
---
 drivers/md/bcache/alloc.c | 4 +++-
 drivers/md/bcache/writeback.c | 7 +--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 6cc6c0f9c3a9..458e1d38577d 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -287,8 +287,10 @@ do {   
\
break;  \
\
mutex_unlock(&(ca)->set->bucket_lock);  \
-   if (kthread_should_stop())  \
+   if (kthread_should_stop()) {\
+   set_current_state(TASK_RUNNING);\
return 0;   \
+   }   \
\
schedule(); \
mutex_lock(&(ca)->set->bucket_lock);\
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 51306a19ab03..58218f7e77c3 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -564,18 +564,21 @@ static int bch_writeback_thread(void *arg)
 
while (!kthread_should_stop()) {
down_write(>writeback_lock);
+   set_current_state(TASK_INTERRUPTIBLE);
if (!atomic_read(>has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
 !dc->writeback_running)) {
up_write(>writeback_lock);
-   set_current_state(TASK_INTERRUPTIBLE);
 
-   if (kthread_should_stop())
+   if (kthread_should_stop()) {
+   set_current_state(TASK_RUNNING);
return 0;
+   }
 
schedule();
continue;
}
+   set_current_state(TASK_RUNNING);
 
searched_full_index = refill_dirty(dc);
 
-- 
2.14.1



[PATCH 1/8] bcache: add journal statistic

2018-02-07 Thread Michael Lyle
From: Tang Junhui 

Sometimes, Journal takes up a lot of CPU, we need statistics
to know what's the journal is doing. So this patch provide
some journal statistics:
1) reclaim: how many times the journal try to reclaim resource,
   usually the journal bucket or/and the pin are exhausted.
2) flush_write: how many times the journal try to flush btree node
   to cache device, usually the journal bucket are exhausted.
3) retry_flush_write: how many times the journal retry to flush
   the next btree node, usually the previous tree node have been
   flushed by other thread.
we show these statistic by sysfs interface. Through these statistics
We can totally see the status of journal module when the CPU is too
high.

Signed-off-by: Tang Junhui 
Reviewed-by: Michael Lyle 
---
 drivers/md/bcache/bcache.h  |  4 
 drivers/md/bcache/journal.c |  5 +
 drivers/md/bcache/sysfs.c   | 15 +++
 3 files changed, 24 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5e2d4e80198e..b98d7705acb6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -658,6 +658,10 @@ struct cache_set {
atomic_long_t   writeback_keys_done;
atomic_long_t   writeback_keys_failed;
 
+   atomic_long_t   reclaim;
+   atomic_long_t   flush_write;
+   atomic_long_t   retry_flush_write;
+
enum{
ON_ERROR_UNREGISTER,
ON_ERROR_PANIC,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index a87165c1d8e5..f5296007a9d5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -377,6 +377,8 @@ static void btree_flush_write(struct cache_set *c)
 */
struct btree *b, *best;
unsigned i;
+
+   atomic_long_inc(>flush_write);
 retry:
best = NULL;
 
@@ -397,6 +399,7 @@ static void btree_flush_write(struct cache_set *c)
if (!btree_current_write(b)->journal) {
mutex_unlock(>write_lock);
/* We raced */
+   atomic_long_inc(>retry_flush_write);
goto retry;
}
 
@@ -476,6 +479,8 @@ static void journal_reclaim(struct cache_set *c)
unsigned iter, n = 0;
atomic_t p;
 
+   atomic_long_inc(>reclaim);
+
while (!atomic_read(_front(>journal.pin)))
fifo_pop(>journal.pin, p);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index b4184092c727..46e7a6b3e7c7 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -65,6 +65,9 @@ read_attribute(bset_tree_stats);
 
 read_attribute(state);
 read_attribute(cache_read_races);
+read_attribute(reclaim);
+read_attribute(flush_write);
+read_attribute(retry_flush_write);
 read_attribute(writeback_keys_done);
 read_attribute(writeback_keys_failed);
 read_attribute(io_errors);
@@ -545,6 +548,15 @@ SHOW(__bch_cache_set)
sysfs_print(cache_read_races,
atomic_long_read(>cache_read_races));
 
+   sysfs_print(reclaim,
+   atomic_long_read(>reclaim));
+
+   sysfs_print(flush_write,
+   atomic_long_read(>flush_write));
+
+   sysfs_print(retry_flush_write,
+   atomic_long_read(>retry_flush_write));
+
sysfs_print(writeback_keys_done,
atomic_long_read(>writeback_keys_done));
sysfs_print(writeback_keys_failed,
@@ -731,6 +743,9 @@ static struct attribute *bch_cache_set_internal_files[] = {
 
_bset_tree_stats,
_cache_read_races,
+   _reclaim,
+   _flush_write,
+   _retry_flush_write,
_writeback_keys_done,
_writeback_keys_failed,
 
-- 
2.14.1



[PATCH 6/8] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-02-07 Thread Michael Lyle
From: Coly Li 

dc->writeback_rate_update_seconds can be set via sysfs and its value can
be set to [1, ULONG_MAX].  It does not make sense to set such a large
value, 60 seconds is long enough value considering the default 5 seconds
works well for long time.

Because dc->writeback_rate_update is a special delayed work, it re-arms
itself inside the delayed work routine update_writeback_rate(). When
stopping it by cancel_delayed_work_sync(), there should be a timeout to
wait and make sure the re-armed delayed work is stopped too. A small max
value of dc->writeback_rate_update_seconds is also helpful to decide a
reasonable small timeout.

This patch limits sysfs interface to set dc->writeback_rate_update_seconds
in range of [1, 60] seconds, and replaces the hand-coded number by macros.

Changelog:
v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
v1: initial version.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Michael Lyle 
---
 drivers/md/bcache/sysfs.c | 4 +++-
 drivers/md/bcache/writeback.c | 2 +-
 drivers/md/bcache/writeback.h | 3 +++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index c524305cc9a7..4a6a697e1680 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -218,7 +218,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_rate,
dc->writeback_rate.rate, 1, INT_MAX);
 
-   d_strtoul_nonzero(writeback_rate_update_seconds);
+   sysfs_strtoul_clamp(writeback_rate_update_seconds,
+   dc->writeback_rate_update_seconds,
+   1, WRITEBACK_RATE_UPDATE_SECS_MAX);
d_strtoul(writeback_rate_i_term_inverse);
d_strtoul_nonzero(writeback_rate_p_term_inverse);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 58218f7e77c3..f1d2fc15abcc 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate.rate = 1024;
dc->writeback_rate_minimum  = 8;
 
-   dc->writeback_rate_update_seconds = 5;
+   dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
dc->writeback_rate_p_term_inverse = 40;
dc->writeback_rate_i_term_inverse = 1;
 
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 66f1c527fa24..587b25599856 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -8,6 +8,9 @@
 #define MAX_WRITEBACKS_IN_PASS  5
 #define MAX_WRITESIZE_IN_PASS   5000   /* *512b */
 
+#define WRITEBACK_RATE_UPDATE_SECS_MAX 60
+#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
+
 /*
  * 14 (16384ths) is chosen here as something that each backing device
  * should be a reasonable fraction of the share, and not to blow up
-- 
2.14.1



[PATCH 2/8] bcache: fix high CPU occupancy during journal

2018-02-07 Thread Michael Lyle
From: Tang Junhui 

After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:

[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem

  PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
 2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
 9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1
 8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
 1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
 2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
 9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
 9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
 6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd

The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.

Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.

In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.

[note by mpl: this triggers a checkpatch error because of adjacent,
pre-existing style violations]

Signed-off-by: Tang Junhui 
Reviewed-by: Michael Lyle 
---
 drivers/md/bcache/bcache.h  |  2 ++
 drivers/md/bcache/journal.c | 47 ++---
 drivers/md/bcache/util.h|  2 ++
 3 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b98d7705acb6..a857eb3c10de 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -679,6 +679,8 @@ struct cache_set {
 
 #define BUCKET_HASH_BITS   12
struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
+
+   DECLARE_HEAP(struct btree *, flush_btree);
 };
 
 struct bbio {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index f5296007a9d5..1b736b860739 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -368,6 +368,12 @@ int bch_journal_replay(struct cache_set *s, struct 
list_head *list)
 }
 
 /* Journalling */
+#define journal_max_cmp(l, r) \
+   (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
+fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
+#define journal_min_cmp(l, r) \
+   (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
+fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
 
 static void btree_flush_write(struct cache_set *c)
 {
@@ -375,25 +381,35 @@ static void btree_flush_write(struct cache_set *c)
 * Try to find the btree node with that references the oldest journal
 * entry, best is our current candidate and is locked if non NULL:
 */
-   struct btree *b, *best;
-   unsigned i;
+   struct btree *b;
+   int i;
 
atomic_long_inc(>flush_write);
+
 retry:
-   best = NULL;
-
-   for_each_cached_btree(b, c, i)
-   if (btree_current_write(b)->journal) {
-   if (!best)
-   best = b;
-   else if (journal_pin_cmp(c,
-   

Bcache patches for inclusion in 4.16

2018-02-07 Thread Michael Lyle
Hi Jens---

Here are a few more changes for 4.16 from Coly Li and Tang Junhui.

Probably the most "functional" is 2/8, which makes a change to improve a
performance bug under sustained small writes.  Still, it's a small, safe
change that seems good to take.  The rest are small bugfixes.

All have seen some degree of test (which is, of course, always ongoing).

Note that there's one checkpatch error in this, but it was already present
and is just triggered by altering an adjacent line.

 9 files changed, 111 insertions(+), 36 deletions(-)

Coly Li (3):
  bcache: properly set task state in bch_writeback_thread()
  bcache: set error_limit correctly
  bcache: set writeback_rate_update_seconds in range [1, 60] seconds

Tang Junhui (5):
  bcache: add journal statistic
  bcache: fix high CPU occupancy during journal
  bcache: fix for allocator and register thread race
  bcache: return attach error when no cache set exist
  bcache: fix for data collapse after re-attaching an attached device

Thanks, as always,

Mike


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote:
> Can you see whether by any chance the following patch fixes the issue?
> If not, can you share the repro case?
> 
> Thanks.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df93102..651d18c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>* ->aborted_gstate is set, this may lead to ignored
>* completions and further spurious timeouts.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
>   blk_add_timer(req);
> + blk_mq_rq_update_aborted_gstate(req, 0);
>   break;
>   case BLK_EH_NOT_HANDLED:
>   break;

Hello Tejun,

I tried the above patch but already during the first iteration of the test I
noticed that the test hung, probably due to the following request that got 
stuck:

$ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list)
a98cff60 {.op=SCSI_IN, .cmd_flags=, 
.rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM,
 .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 00 00 
00 00, .retries=0,
 .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago}

Thanks,

Bart.




Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 09:35 -0800, t...@kernel.org wrote:
> On Wed, Feb 07, 2018 at 05:27:10PM +, Bart Van Assche wrote:
> > Even with the above change I think that there is still a race between the
> > code that handles timer resets and the completion handler.
> 
> Can you elaborate the scenario a bit further?  If you're referring to
> lost completions, we've always had that and while we can try to close
> that hole too, I don't think it's a critical issue.

Hello Tejun,

When I wrote my comment I was not sure whether or not non-reentrancy is
guaranteed for work queue items. However, according to what I found in the
workqueue implementation I think that is guaranteed. So it shouldn't be
possible that the timer activated by blk_add_timer() gets handled before
aborted_gstate is reset. But since the timeout handler and completion
handlers can be executed by different CPUs, shouldn't a memory barrier be
inserted between the blk_add_timer() call and resetting aborted_gsync to
guarantee that a completion cannot occur before blk_add_timer() has reset
RQF_MQ_TIMEOUT_EXPIRED?

Thanks,

Bart.




Re: [PATCH BUGFIX V2 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Jens Axboe
On 2/7/18 11:00 AM, Paolo Valente wrote:
> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
> be re-inserted into the active I/O scheduler for that device. As a
> consequence, I/O schedulers may get the same request inserted again,
> even several times, without a finish_request invoked on that request
> before each re-insertion.
> 
> This fact is the cause of the failure reported in [1]. For an I/O
> scheduler, every re-insertion of the same re-prepared request is
> equivalent to the insertion of a new request. For schedulers like
> mq-deadline or kyber, this fact causes no harm. In contrast, it
> confuses a stateful scheduler like BFQ, which keeps state for an I/O
> request, until the finish_request hook is invoked on the request. In
> particular, BFQ may get stuck, waiting forever for the number of
> request dispatches, of the same request, to be balanced by an equal
> number of request completions (while there will be one completion for
> that request). In this state, BFQ may refuse to serve I/O requests
> from other bfq_queues. The hang reported in [1] then follows.
> 
> However, the above re-prepared requests undergo a requeue, thus the
> requeue_request hook of the active elevator is invoked for these
> requests, if set. This commit then addresses the above issue by
> properly implementing the hook requeue_request in BFQ.
> 
> [1] https://marc.info/?l=linux-block=15127608676
> 
> Reported-by: Ivan Kozik 
> Reported-by: Alban Browaeys 
> Tested-by: Mike Galbraith 
> Signed-off-by: Paolo Valente 
> Signed-off-by: Serena Ziviani 
> ---
>  block/bfq-iosched.c | 109 
> 
>  1 file changed, 84 insertions(+), 25 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 47e6ec7427c4..21e6b9e45638 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
> blk_mq_hw_ctx *hctx)
>   }
>  
>   /*
> -  * We exploit the bfq_finish_request hook to decrement
> -  * rq_in_driver, but bfq_finish_request will not be
> -  * invoked on this request. So, to avoid unbalance,
> -  * just start this request, without incrementing
> -  * rq_in_driver. As a negative consequence,
> -  * rq_in_driver is deceptively lower than it should be
> -  * while this request is in service. This may cause
> -  * bfq_schedule_dispatch to be invoked uselessly.
> +  * We exploit the bfq_finish_requeue_request hook to
> +  * decrement rq_in_driver, but
> +  * bfq_finish_requeue_request will not be invoked on
> +  * this request. So, to avoid unbalance, just start
> +  * this request, without incrementing rq_in_driver. As
> +  * a negative consequence, rq_in_driver is deceptively
> +  * lower than it should be while this request is in
> +  * service. This may cause bfq_schedule_dispatch to be
> +  * invoked uselessly.
>*
>* As for implementing an exact solution, the
> -  * bfq_finish_request hook, if defined, is probably
> -  * invoked also on this request. So, by exploiting
> -  * this hook, we could 1) increment rq_in_driver here,
> -  * and 2) decrement it in bfq_finish_request. Such a
> -  * solution would let the value of the counter be
> -  * always accurate, but it would entail using an extra
> -  * interface function. This cost seems higher than the
> -  * benefit, being the frequency of non-elevator-private
> +  * bfq_finish_requeue_request hook, if defined, is
> +  * probably invoked also on this request. So, by
> +  * exploiting this hook, we could 1) increment
> +  * rq_in_driver here, and 2) decrement it in
> +  * bfq_finish_requeue_request. Such a solution would
> +  * let the value of the counter be always accurate,
> +  * but it would entail using an extra interface
> +  * function. This cost seems higher than the benefit,
> +  * being the frequency of non-elevator-private
>* requests very low.
>*/
>   goto start_rq;
> @@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
> request_queue *q,
>  unsigned int cmd_flags) {}
>  #endif
>  
> +static void bfq_prepare_request(struct request *rq, struct bio *bio);
> +
>  static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request 
> *rq,
>   

[PATCH BUGFIX V2 0/1] block, bfq: handle requeues of I/O requests

2018-02-07 Thread Paolo Valente
Hi,
fixed version, after bug reports and fixes in [1].

Thanks,
Paolo

[1] https://lkml.org/lkml/2018/2/5/599


Paolo Valente (1):
  block, bfq: add requeue-request hook

 block/bfq-iosched.c | 109 
 1 file changed, 84 insertions(+), 25 deletions(-)

--
2.15.1


Re: [PATCH v5 07/10] bcache: fix inaccurate io state for detached bcache devices

2018-02-07 Thread Michael Lyle
detatched should be detached, otherwise lgtm.

Reviewed-by: Michael Lyle 

On 02/05/2018 08:20 PM, Coly Li wrote:
> From: Tang Junhui 
> 
> When we run IO in a detached device,  and run iostat to shows IO status,
> normally it will show like bellow (Omitted some fields):
> Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdd... 15.89 0.531.820.202.23   1.81  52.30
> bcache0... 15.89   115.420.000.000.00   2.40  69.60
> but after IO stopped, there are still very big avgqu-sz and %util
> values as bellow:
> Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> bcache0   ...  0   5326.320.000.000.00   0.00 100.10
> 
> The reason for this issue is that, only generic_start_io_acct() called
> and no generic_end_io_acct() called for detached device in
> cached_dev_make_request(). See the code:
> //start generic_start_io_acct()
> generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0);
> if (cached_dev_get(dc)) {
>   //will callback generic_end_io_acct()
> }
> else {
>   //will not call generic_end_io_acct()
> }
> 
> This patch calls generic_end_io_acct() in the end of IO for detached
> devices, so we can show IO state correctly.
> 
> (Modified to use GFP_NOIO in kzalloc() by Coly Li)
> 
> Signed-off-by: Tang Junhui 
> Reviewed-by: Coly Li 
> Reviewed-by: Hannes Reinecke 
> ---
>  drivers/md/bcache/request.c | 58 
> +++--
>  1 file changed, 51 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 02296bda6384..e09c5ae745be 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -986,6 +986,55 @@ static void cached_dev_nodata(struct closure *cl)
>   continue_at(cl, cached_dev_bio_complete, NULL);
>  }
>  
> +struct detached_dev_io_private {
> + struct bcache_device*d;
> + unsigned long   start_time;
> + bio_end_io_t*bi_end_io;
> + void*bi_private;
> +};
> +
> +static void detatched_dev_end_io(struct bio *bio)
> +{
> + struct detached_dev_io_private *ddip;
> +
> + ddip = bio->bi_private;
> + bio->bi_end_io = ddip->bi_end_io;
> + bio->bi_private = ddip->bi_private;
> +
> + generic_end_io_acct(ddip->d->disk->queue,
> + bio_data_dir(bio),
> + >d->disk->part0, ddip->start_time);
> +
> + kfree(ddip);
> +
> + bio->bi_end_io(bio);
> +}
> +
> +static void detached_dev_do_request(struct bcache_device *d, struct bio *bio)
> +{
> + struct detached_dev_io_private *ddip;
> + struct cached_dev *dc = container_of(d, struct cached_dev, disk);
> +
> + /*
> +  * no need to call closure_get(>disk.cl),
> +  * because upper layer had already opened bcache device,
> +  * which would call closure_get(>disk.cl)
> +  */
> + ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO);
> + ddip->d = d;
> + ddip->start_time = jiffies;
> + ddip->bi_end_io = bio->bi_end_io;
> + ddip->bi_private = bio->bi_private;
> + bio->bi_end_io = detatched_dev_end_io;
> + bio->bi_private = ddip;
> +
> + if ((bio_op(bio) == REQ_OP_DISCARD) &&
> + !blk_queue_discard(bdev_get_queue(dc->bdev)))
> + bio->bi_end_io(bio);
> + else
> + generic_make_request(bio);
> +}
> +
>  /* Cached devices - read & write stuff */
>  
>  static blk_qc_t cached_dev_make_request(struct request_queue *q,
> @@ -1028,13 +1077,8 @@ static blk_qc_t cached_dev_make_request(struct 
> request_queue *q,
>   else
>   cached_dev_read(dc, s);
>   }
> - } else {
> - if ((bio_op(bio) == REQ_OP_DISCARD) &&
> - !blk_queue_discard(bdev_get_queue(dc->bdev)))
> - bio_endio(bio);
> - else
> - generic_make_request(bio);
> - }
> + } else
> + detached_dev_do_request(d, bio);
>  
>   return BLK_QC_T_NONE;
>  }
> 



Re: [PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to backing device

2018-02-07 Thread Michael Lyle
I agree about the typos-- after they're fixed,

Reviewed-by: Michael Lyle 

On 02/05/2018 08:20 PM, Coly Li wrote:
> When there are too many I/O errors on cache device, current bcache code
> will retire the whole cache set, and detach all bcache devices. But the
> detached bcache devices are not stopped, which is problematic when bcache
> is in writeback mode.
> 
> If the retired cache set has dirty data of backing devices, continue
> writing to bcache device will write to backing device directly. If the
> LBA of write request has a dirty version cached on cache device, next time
> when the cache device is re-registered and backing device re-attached to
> it again, the stale dirty data on cache device will be written to backing
> device, and overwrite latest directly written data. This situation causes
> a quite data corruption.
> 
> But we cannot simply stop all attached bcache devices when the cache set is
> broken or disconnected. For example, use bcache to accelerate performance
> of an email service. In such workload, if cache device is broken but no
> dirty data lost, keep the bcache device alive and permit email service
> continue to access user data might be a better solution for the cache
> device failure.
> 
> Nix  points out the issue and provides the above example
> to explain why it might be necessary to not stop bcache device for broken
> cache device. Pavel Goran  provides a brilliant
> suggestion to provide "always" and "auto" options to per-cached device
> sysfs file stop_when_cache_set_failed. If cache set is retiring and the
> backing device has no dirty data on cache, it should be safe to keep the
> bcache device alive. In this case, if stop_when_cache_set_failed is set to
> "auto", the device failure handling code will not stop this bcache device
> and permit application to access the backing device with a unattached
> bcache device.
> 
> Changelog:
> v2: change option values of stop_when_cache_set_failed from 1/0 to
> "auto"/"always".
> v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1
> (always stop).
> 
> Signed-off-by: Coly Li 
> Cc: Nix 
> Cc: Pavel Goran 
> Cc: Michael Lyle 
> Cc: Junhui Tang 
> Cc: Hannes Reinecke 
> ---
>  drivers/md/bcache/bcache.h |  9 +
>  drivers/md/bcache/super.c  | 82 
> --
>  drivers/md/bcache/sysfs.c  | 17 ++
>  3 files changed, 98 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 7917b3820dd5..263164490833 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -287,6 +287,12 @@ struct io {
>   sector_tlast;
>  };
>  
> +enum stop_on_faliure {
> + BCH_CACHED_DEV_STOP_ATUO = 0,
> + BCH_CACHED_DEV_STOP_ALWAYS,
> + BCH_CACHED_DEV_STOP_MODE_MAX,
> +};
> +
>  struct cached_dev {
>   struct list_headlist;
>   struct bcache_devicedisk;
> @@ -379,6 +385,8 @@ struct cached_dev {
>   unsignedwriteback_rate_i_term_inverse;
>   unsignedwriteback_rate_p_term_inverse;
>   unsignedwriteback_rate_minimum;
> +
> + enum stop_on_faliurestop_when_cache_set_failed;
>  };
>  
>  enum alloc_reserve {
> @@ -924,6 +932,7 @@ void bch_write_bdev_super(struct cached_dev *, struct 
> closure *);
>  
>  extern struct workqueue_struct *bcache_wq;
>  extern const char * const bch_cache_modes[];
> +extern const char * const bch_stop_on_failure_modes[];
>  extern struct mutex bch_register_lock;
>  extern struct list_head bch_cache_sets;
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index f8b0d1196c12..e335433bdfb7 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = {
>   NULL
>  };
>  
> +/* Default is -1; we skip past it for stop_when_cache_set_failed */
> +const char * const bch_stop_on_failure_modes[] = {
> + "default",
> + "auto",
> + "always",
> + NULL
> +};
> +
>  static struct kobject *bcache_kobj;
>  struct mutex bch_register_lock;
>  LIST_HEAD(bch_cache_sets);
> @@ -1187,6 +1195,9 @@ static int cached_dev_init(struct cached_dev *dc, 
> unsigned block_size)
>   max(dc->disk.disk->queue->backing_dev_info->ra_pages,
>   q->backing_dev_info->ra_pages);
>  
> + /* default to auto */
> + dc->stop_when_cache_set_failed = BCH_CACHED_DEV_STOP_ATUO;
> +
>   bch_cached_dev_request_init(dc);
>   bch_cached_dev_writeback_init(dc);
>   return 0;
> @@ -1463,25 +1474,76 @@ static void cache_set_flush(struct closure *cl)
>   closure_return(cl);
>  }
>  
> +/*
> + * This function is only called when CACHE_SET_IO_DISABLE is set, which means
> + * 

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, Bart.

On Wed, Feb 07, 2018 at 05:27:10PM +, Bart Van Assche wrote:
> Even with the above change I think that there is still a race between the
> code that handles timer resets and the completion handler. Anyway, the test

Can you elaborate the scenario a bit further?  If you're referring to
lost completions, we've always had that and while we can try to close
that hole too, I don't think it's a critical issue.

> with which I triggered these races is as follows:
> - Start from what will become kernel v4.16-rc1 and apply the patch that adds
>   SRP over RoCE support to the ib_srpt driver. See also the "[PATCH v2 00/14]
>   IB/srpt: Add RDMA/CM support" patch series
>   (https://www.spinics.net/lists/linux-rdma/msg59589.html).
> - Apply my patch series that fixes a race between the SCSI error handler and
>   SCSI transport recovery.
> - Apply my patch series that improves the stability of the SCSI target core
>   (LIO).
> - Build and install that kernel.
> - Clone the following repository: https://github.com/bvanassche/srp-test.
> - Run the following test:
>   while true; do srp-test/run_tests -c -t 02-mq; done
> - While the test is running, check whether or not something weird happens.
>   Sometimes I see that scsi_times_out() crashes. Sometimes I see while running
>   this test that a soft lockup is reported inside blk_mq_do_dispatch_ctx().
> 
> If you want I can share the tree on github that I use myself for my tests.

Heh, that's quite a bit.  I'll take up on that git tree later but for
now I'd really appreciate if you can test the patch.

Thank you very much.

-- 
tejun


Re: [PATCH v5 03/10] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set

2018-02-07 Thread Michael Lyle
Reviewed-by: Michael Lyle 

On 02/05/2018 08:20 PM, Coly Li wrote:
> In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
> cached_dev_get() is called when creating dc->writeback_thread, and
> cached_dev_put() is called when exiting dc->writeback_thread. This
> modification works well unless people detach the bcache device manually by
> 'echo 1 > /sys/block/bcache/bcache/detach'
> Because this sysfs interface only calls bch_cached_dev_detach() which wakes
> up dc->writeback_thread but does not stop it. The reason is, before patch
> "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
> bch_writeback_thread(), if cache is not dirty after writeback,
> cached_dev_put() will be called here. And in cached_dev_make_request() when
> a new write request makes cache from clean to dirty, cached_dev_get() will
> be called there. Since we don't operate dc->count in these locations,
> refcount d->count cannot be dropped after cache becomes clean, and
> cached_dev_detach_finish() won't be called to detach bcache device.
> 
> This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
> set inside bch_writeback_thread(). If this bit is set and cache is clean
> (no existing writeback_keys), break the while-loop, call cached_dev_put()
> and quit the writeback thread.
> 
> Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
> writeback thread should continue to perform writeback, this is the original
> design of manually detach.
> 
> It is safe to do the following check without locking, let me explain why,
> + if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
> + (!atomic_read(>has_dirty) || !dc->writeback_running)) {
> 
> If the kenrel thread does not sleep and continue to run due to conditions
> are not updated in time on the running CPU core, it just consumes more CPU
> cycles and has no hurt. This should-sleep-but-run is safe here. We just
> focus on the should-run-but-sleep condition, which means the writeback
> thread goes to sleep in mistake while it should continue to run.
> 1, First of all, no matter the writeback thread is hung or not, 
> kthread_stop() from
>cached_dev_detach_finish() will wake up it and terminate by making
>kthread_should_stop() return true. And in normal run time, bit on index
>BCACHE_DEV_DETACHING is always cleared, the condition
>   !test_bit(BCACHE_DEV_DETACHING, >disk.flags)
>is always true and can be ignored as constant value.
> 2, If one of the following conditions is true, the writeback thread should
>go to sleep,
>"!atomic_read(>has_dirty)" or "!dc->writeback_running)"
>each of them independently controls the writeback thread should sleep or
>not, let's analyse them one by one.
> 2.1 condition "!atomic_read(>has_dirty)"
>If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
>call bch_writeback_queue() immediately or call bch_writeback_add() which
>indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
>wake_up_process(dc->writeback_thread) is called. It sets writeback
>thread's task state to TASK_RUNNING and following an implicit memory
>barrier, then tries to wake up the writeback thread.
>In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
>doing the condition check. If other CPU core sets the TASK_RUNNING state
>after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
>will be scheduled to run very soon because its state is not
>TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
>writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
>of wake_up_process() will make sure modification of dc->has_dirty on
>other CPU core is updated and observed on the CPU core of writeback
>thread. Therefore the condition check will correctly be false, and
>continue writeback code without sleeping.
> 2.2 condition "!dc->writeback_running)"
>dc->writeback_running can be changed via sysfs file, every time it is
>modified, a following bch_writeback_queue() is alwasy called. So the
>change is always observed on the CPU core of writeback thread. If
>dc->writeback_running is changed from 0 to 1 on other CPU core, this
>condition check will observe the modification and allow writeback
>thread to continue to run without sleeping.
> Now we can see, even without a locking protection, multiple conditions
> check is safe here, no deadlock or process hang up will happen.
> 
> I compose a separte patch because that patch "bcache: fix cached_dev->count
> usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
> Reinecke. Also this fix is not trivial and good for a separate patch.
> 
> Signed-off-by: Coly Li 
> Cc: Michael Lyle 
> Cc: Hannes Reinecke 
> Cc: Huijun Tang 
> ---
>  

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 16:50, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> 07.02.2018 12:27, Paolo Valente wrote:
>> Hi Oleksandr, Holger,
>> before I prepare a V2 candidate patch, could you please test my
>> instrumentation patch too, with the above change made.  For your
>> convenience, I have attached a compressed archive with both the
>> instrumentation patch and a patch making the above change.
>> Crossing my fingers,
>> Paolo
> 
> With all three patches applied I cannot reproduce neither cfdisk hang nor 
> some kind of boom.
> 
> Thank you.
> 

Thank you, Oleksandr. I'm now ready to prepare a new version of this patch.

Paolo

> Oleksandr



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 18:18 +0100, Roman Penyaev wrote:
> So the question is: are there real life setups where
> some of the local IB network members can be untrusted?

Hello Roman,

You may want to read more about the latest evolutions with regard to network
security. An article that I can recommend is the following: "Google reveals
own security regime policy trusts no network, anywhere, ever"
(https://www.theregister.co.uk/2016/04/06/googles_beyondcorp_security_policy/).

If data-centers would start deploying RDMA among their entire data centers
(maybe they are already doing this) then I think they will want to restrict
access to block devices to only those initiator systems that need it.

Thanks,

Bart.




Re: [PATCH v5 01/10] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-02-07 Thread Michael Lyle
LGTM-- in my staging branch.

Reviewed-by: Michael Lyle 

On 02/05/2018 08:20 PM, Coly Li wrote:
> dc->writeback_rate_update_seconds can be set via sysfs and its value can
> be set to [1, ULONG_MAX].  It does not make sense to set such a large
> value, 60 seconds is long enough value considering the default 5 seconds
> works well for long time.
> 
> Because dc->writeback_rate_update is a special delayed work, it re-arms
> itself inside the delayed work routine update_writeback_rate(). When
> stopping it by cancel_delayed_work_sync(), there should be a timeout to
> wait and make sure the re-armed delayed work is stopped too. A small max
> value of dc->writeback_rate_update_seconds is also helpful to decide a
> reasonable small timeout.
> 
> This patch limits sysfs interface to set dc->writeback_rate_update_seconds
> in range of [1, 60] seconds, and replaces the hand-coded number by macros.
> 
> Changelog:
> v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
> v1: initial version.
> 
> Signed-off-by: Coly Li 
> Reviewed-by: Hannes Reinecke 
> Cc: Michael Lyle 
> ---
>  drivers/md/bcache/sysfs.c | 4 +++-
>  drivers/md/bcache/writeback.c | 2 +-
>  drivers/md/bcache/writeback.h | 3 +++
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index c524305cc9a7..4a6a697e1680 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -218,7 +218,9 @@ STORE(__cached_dev)
>   sysfs_strtoul_clamp(writeback_rate,
>   dc->writeback_rate.rate, 1, INT_MAX);
>  
> - d_strtoul_nonzero(writeback_rate_update_seconds);
> + sysfs_strtoul_clamp(writeback_rate_update_seconds,
> + dc->writeback_rate_update_seconds,
> + 1, WRITEBACK_RATE_UPDATE_SECS_MAX);
>   d_strtoul(writeback_rate_i_term_inverse);
>   d_strtoul_nonzero(writeback_rate_p_term_inverse);
>  
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 58218f7e77c3..f1d2fc15abcc 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>   dc->writeback_rate.rate = 1024;
>   dc->writeback_rate_minimum  = 8;
>  
> - dc->writeback_rate_update_seconds = 5;
> + dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
>   dc->writeback_rate_p_term_inverse = 40;
>   dc->writeback_rate_i_term_inverse = 1;
>  
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index 66f1c527fa24..587b25599856 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -8,6 +8,9 @@
>  #define MAX_WRITEBACKS_IN_PASS  5
>  #define MAX_WRITESIZE_IN_PASS   5000 /* *512b */
>  
> +#define WRITEBACK_RATE_UPDATE_SECS_MAX   60
> +#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT   5
> +
>  /*
>   * 14 (16384ths) is chosen here as something that each backing device
>   * should be a reasonable fraction of the share, and not to blow up
> 



Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote:
> On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote:
> > The following race can occur between the code that resets the timer
> > and completion handling:
> > - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> > - A completion occurs and blk_mq_complete_request() calls
> >   __blk_mq_complete_request().
> > - The timeout code calls blk_add_timer() and that function sets the
> >   request deadline and adjusts the timer.
> > - __blk_mq_complete_request() frees the request tag.
> > - The timer fires and the timeout handler gets called for a freed
> >   request.
> 
> Can you see whether by any chance the following patch fixes the issue?
> If not, can you share the repro case?
> 
> Thanks.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df93102..651d18c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>* ->aborted_gstate is set, this may lead to ignored
>* completions and further spurious timeouts.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
>   blk_add_timer(req);
> + blk_mq_rq_update_aborted_gstate(req, 0);
>   break;
>   case BLK_EH_NOT_HANDLED:
>   break;

Hello Tejun,

Even with the above change I think that there is still a race between the
code that handles timer resets and the completion handler. Anyway, the test
with which I triggered these races is as follows:
- Start from what will become kernel v4.16-rc1 and apply the patch that adds
  SRP over RoCE support to the ib_srpt driver. See also the "[PATCH v2 00/14]
  IB/srpt: Add RDMA/CM support" patch series
  (https://www.spinics.net/lists/linux-rdma/msg59589.html).
- Apply my patch series that fixes a race between the SCSI error handler and
  SCSI transport recovery.
- Apply my patch series that improves the stability of the SCSI target core
  (LIO).
- Build and install that kernel.
- Clone the following repository: https://github.com/bvanassche/srp-test.
- Run the following test:
  while true; do srp-test/run_tests -c -t 02-mq; done
- While the test is running, check whether or not something weird happens.
  Sometimes I see that scsi_times_out() crashes. Sometimes I see while running
  this test that a soft lockup is reported inside blk_mq_do_dispatch_ctx().

If you want I can share the tree on github that I use myself for my tests.

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Roman Penyaev
On Wed, Feb 7, 2018 at 5:35 PM, Christopher Lameter  wrote:
> On Mon, 5 Feb 2018, Bart Van Assche wrote:
>
>> That approach may work well for your employer but sorry I don't think this is
>> sufficient for an upstream driver. I think that most users who configure a
>> network storage target expect full control over which storage devices are 
>> exported
>> and also over which clients do have and do not have access.
>
> Well is that actually true for IPoIB? It seems that I can arbitrarily
> attach to any partition I want without access control. In many ways some
> of the RDMA layers and modules are loose with security since performance
> is what matters mostly and deployments occur in separate production
> environments.
>
> We have had security issues (that not fully resolved yet) with the RDMA
> RPC API for years.. So maybe lets relax on the security requirements a
> bit?
>

Frankly speaking I do not understand the "security" about this kind of
block devices and RDMA in particular.  I can admit that personally I do
not see the whole picture, so can someone provide the real usecase/scenario?
What we have in our datacenters is trusted environment (do others exist?).
You need a volume, you create it.  You need to map a volume remotely -
you map it.  Of course there are provisioning checks, rw/ro checks, etc.
But in general any IP/key checks (is that client really a "good" guy or not?)
are simply useless.  So the question is: are there real life setups where
some of the local IB network members can be untrusted?

--
Roman


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Tejun Heo
Hello, Bart.

On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote:
> The following race can occur between the code that resets the timer
> and completion handling:
> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> - A completion occurs and blk_mq_complete_request() calls
>   __blk_mq_complete_request().
> - The timeout code calls blk_add_timer() and that function sets the
>   request deadline and adjusts the timer.
> - __blk_mq_complete_request() frees the request tag.
> - The timer fires and the timeout handler gets called for a freed
>   request.

Can you see whether by any chance the following patch fixes the issue?
If not, can you share the repro case?

Thanks.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..651d18c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 * ->aborted_gstate is set, this may lead to ignored
 * completions and further spurious timeouts.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   blk_mq_rq_update_aborted_gstate(req, 0);
break;
case BLK_EH_NOT_HANDLED:
break;


Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-07 Thread Bart Van Assche
On Mon, 2018-02-05 at 23:20 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55c0a745b427..385bbec73804 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -81,6 +81,17 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
> *hctx)
>   } else
>   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>  
> + /* need to restart all hw queues for global tags */
> + if (hctx->flags & BLK_MQ_F_GLOBAL_TAGS) {
> + struct blk_mq_hw_ctx *hctx2;
> + int i;
> +
> + queue_for_each_hw_ctx(hctx->queue, hctx2, i)
> + if (blk_mq_run_hw_queue(hctx2, true))
> + return true;
> + return false;
> + }
> +
>   return blk_mq_run_hw_queue(hctx, true);
>  }

This new loop looks misplaced to me. If both the BLK_MQ_F_GLOBAL_TAGS and the
BLK_MQ_F_TAG_SHARED flags are set then the outer loop in blk_mq_sched_restart()
and the inner loop in blk_mq_sched_restart_hctx() will cause more calls of
blk_mq_run_hw_queue() than necessary. Have you considered to merge the above
loop into blk_mq_sched_restart()?

Thanks,

Bart.




Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Christopher Lameter
On Mon, 5 Feb 2018, Bart Van Assche wrote:

> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are 
> exported
> and also over which clients do have and do not have access.

Well is that actually true for IPoIB? It seems that I can arbitrarily
attach to any partition I want without access control. In many ways some
of the RDMA layers and modules are loose with security since performance
is what matters mostly and deployments occur in separate production
environments.

We have had security issues (that not fully resolved yet) with the RDMA
RPC API for years.. So maybe lets relax on the security requirements a
bit?



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 13:57 +0100, Roman Penyaev wrote:
> On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche  
> wrote:
> > On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
> > Something else I would like to understand better is how much of the latency
> > gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
> > protocol. Was e.g. support for immediate data present in the NVMeOF and/or
> > SRP drivers used on your test setup?
> 
> I did not get the question. IBTRS uses empty messages with only imm_data
> field set to respond on IO. This is a part of the IBTRS protocol.  I do
> not understand how can immediate data be present in other drivers, if
> those do not use it in their protocols.  I am lost here.

With "immediate data" I was referring to including the entire write buffer
in the write PDU itself. See e.g. the enable_imm_data kernel module parameter
of the ib_srp-backport driver. See also the use of SRP_DATA_DESC_IMM in the
SCST ib_srpt target driver. Neither the upstream SRP initiator nor the upstream
SRP target support immediate data today. However, sending that code upstream
is on my to-do list.

For the upstream NVMeOF initiator and target drivers, see also the call of
nvme_rdma_map_sg_inline() in nvme_rdma_map_data().

> > Are you aware that the NVMeOF target driver calls page_alloc() from the hot 
> > path but that there are plans to
> > avoid these calls in the hot path by using a caching mechanism similar to
> > the SGV cache in SCST? Are you aware that a significant latency reduction
> > can be achieved by changing the SCST SGV cache from a global into a per-CPU
> > cache?
> 
> No, I am not aware. That is nice, that there is a lot of room for performance
> tweaks. I will definitely retest on fresh kernel once everything is done on
> nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys).

Recently the functions sgl_alloc() and sgl_free() were introduced in the 
upstream
kernel (these will be included in kernel v4.16). The NVMe target driver, LIO and
several other drivers have been modified to use these functions instead of their
own copy of that function. The next step is to replace these function calls by
calls to functions that perform cached allocations.

> > Regarding the SRP measurements: have you tried to set the
> > never_register kernel module parameter to true? I'm asking this because I
> > think that mode is most similar to how the IBNBD initiator driver works.
> 
> yes, according to my notes from that link (frankly, I do not remember,
> but that is what I wrote 1 year ago):
> 
> * Where suffixes mean:
> 
>  _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
>   with 'register_always=N' param
> 
> That what you are asking, right?

Not really. With register_always=Y memory registration is always used by the
SRP initiator, even if the data can be coalesced into a single sg entry. With
register_always=N memory registration is only performed if multiple sg entries
are needed to describe the data. And with never_register=Y memory registration
is not used even if multiple sg entries are needed to describe the data buffer.

Thanks,

Bart.






Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-07 Thread Bart Van Assche

On 02/06/18 15:18, Jens Axboe wrote:

GLOBAL implies that it's, strangely enough, global. That isn't really the
case. Why not call this BLK_MQ_F_HOST_TAGS or something like that? I'd
welcome better names, but global doesn't seem to be a great choice.

BLK_MQ_F_SET_TAGS?


I like the name BLK_MQ_F_HOST_TAGS because it refers how these tags will 
be used by SCSI LLDs, namely as a host-wide tag set.


Thanks,

Bart.


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Oleksandr Natalenko

Hi.

07.02.2018 12:27, Paolo Valente wrote:

Hi Oleksandr, Holger,
before I prepare a V2 candidate patch, could you please test my
instrumentation patch too, with the above change made.  For your
convenience, I have attached a compressed archive with both the
instrumentation patch and a patch making the above change.

Crossing my fingers,
Paolo


With all three patches applied I cannot reproduce neither cfdisk hang 
nor some kind of boom.


Thank you.

Oleksandr


RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-07 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Wednesday, February 7, 2018 5:53 PM
> To: Hannes Reinecke
> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > Hi all,
> >
> > [ .. ]
> > >>
> > >> Could you share us your patch for enabling global_tags/MQ on
> > > megaraid_sas
> > >> so that I can reproduce your test?
> > >>
> > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> > >>
> > >> Could you share us what the IOPS/CPU utilization effect is after
> > > applying the
> > >> patch V2? And your test script?
> > > Regarding CPU utilization, I need to test one more time. Currently
> > > system is in used.
> > >
> > > I run below fio test on total 24 SSDs expander attached.
> > >
> > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > --ioengine=libaio --rw=randread
> > >
> > > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > >
> > This is basically what we've seen with earlier iterations.
>
> Hi Hannes,
>
> As I mentioned in another mail[1], Kashyap's patch has a big issue,
which
> causes only reply queue 0 used.
>
> [1] https://marc.info/?l=linux-scsi=151793204014631=2
>
> So could you guys run your performance test again after fixing the
patch?

Ming -

I tried after change you requested.  Performance drop is still unresolved.
>From 1.6 M IOPS to 770K IOPS.

See below data. All 24 reply queue is in used correctly.

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1  NAME
 360  16422  0   16422  IR-PCI-MSI 70254653-edge megasas
 364  15980  0   15980  IR-PCI-MSI 70254657-edge megasas
 362  15979  0   15979  IR-PCI-MSI 70254655-edge megasas
 345  15696  0   15696  IR-PCI-MSI 70254638-edge megasas
 341  15659  0   15659  IR-PCI-MSI 70254634-edge megasas
 369  15656  0   15656  IR-PCI-MSI 70254662-edge megasas
 359  15650  0   15650  IR-PCI-MSI 70254652-edge megasas
 358  15596  0   15596  IR-PCI-MSI 70254651-edge megasas
 350  15574  0   15574  IR-PCI-MSI 70254643-edge megasas
 342  15532  0   15532  IR-PCI-MSI 70254635-edge megasas
 344  15527  0   15527  IR-PCI-MSI 70254637-edge megasas
 346  15485  0   15485  IR-PCI-MSI 70254639-edge megasas
 361  15482  0   15482  IR-PCI-MSI 70254654-edge megasas
 348  15467  0   15467  IR-PCI-MSI 70254641-edge megasas
 368  15463  0   15463  IR-PCI-MSI 70254661-edge megasas
 354  15420  0   15420  IR-PCI-MSI 70254647-edge megasas
 351  15378  0   15378  IR-PCI-MSI 70254644-edge megasas
 352  15377  0   15377  IR-PCI-MSI 70254645-edge megasas
 356  15348  0   15348  IR-PCI-MSI 70254649-edge megasas
 337  15344  0   15344  IR-PCI-MSI 70254630-edge megasas
 343  15320  0   15320  IR-PCI-MSI 70254636-edge megasas
 355  15266  0   15266  IR-PCI-MSI 70254648-edge megasas
 335  15247  0   15247  IR-PCI-MSI 70254628-edge megasas
 363  15233  0   15233  IR-PCI-MSI 70254656-edge megasas


Average:CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
Average: 18  3.80  0.00 14.78 10.08  0.00
0.00  4.01  0.00  0.00 67.33
Average: 19  3.26  0.00 15.35 10.62  0.00
0.00  4.03  0.00  0.00 66.74
Average: 20  3.42  0.00 14.57 10.67  0.00
0.00  3.84  0.00  0.00 67.50
Average: 21  3.19  0.00 15.60 10.75  0.00
0.00  4.16  0.00  0.00 66.30
Average: 22  3.58  0.00 15.15 10.66  0.00
0.00  3.51  0.00  0.00 67.11
Average: 23  3.34  0.00 15.36 10.63  0.00
0.00  4.17  0.00  0.00 66.50
Average: 24  3.50  0.00 14.58 10.93  0.00
0.00  3.85  0.00  0.00 67.13
Average: 25  3.20  0.00 14.68 10.86  0.00
0.00  4.31  0.00  0.00 66.95
Average: 26  3.27  0.00 14.80 10.70  0.00
0.00  3.68  0.00  0.00 67.55
Average: 27  3.58  0.00 15.36 10.80  0.00
0.00  3.79  0.00  0.00 66.48
Average: 28  3.46  0.00 15.17 10.46  0.00
0.00  3.32  0.00  0.00 67.59
Average: 29  3.34  0.00 14.42 10.72  0.00
0.00  3.34  0.00  0.00 68.18
Average: 30  3.34  0.00 15.08 10.70  0.00
0.00  3.89  0.00  0.00 66.99
Average: 31  3.26  0.00 15.33 10.47  0.00
0.00 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Roman Penyaev
Hi Sagi and all,

On Mon, Feb 5, 2018 at 1:30 PM, Sagi Grimberg  wrote:
> Hi Roman and the team (again), replying to my own email :)
>
> I forgot to mention that first of all thank you for upstreaming
> your work! I fully support your goal to have your production driver
> upstream to minimize your maintenance efforts. I hope that my
> feedback didn't came across with a different impression, that was
> certainly not my intent.

Well, I've just recovered from two heart attacks, which I got
while reading your replies, but now I am fine, thanks :)

> It would be great if you can address and/or reply to my feedback
> (as well as others) and re-spin it again.

Jokes aside, we would like to thank you all, guys, for valuable
feedback. I got a lot of useful remarks from you Sagi and you Bart.
We will try to cover them in next version and will provide up-to-date
comparison results.

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-07 Thread Roman Penyaev
On Tue, Feb 6, 2018 at 5:01 PM, Bart Van Assche  wrote:
> On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
>> On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg  wrote:
>> > [ ... ]
>> > - srp/scst comparison is really not fair having it in legacy request
>> >   mode. Can you please repeat it and report a bug to either linux-rdma
>> >   or to the scst mailing list?
>>
>> Yep, I can retest with mq.
>>
>> > - Your latency measurements are surprisingly high for a null target
>> >   device (even for low end nvme device actually) regardless of the
>> >   transport implementation.
>>
>> Hm, network configuration?  These are results on machines dedicated
>> to our team for testing in one of our datacenters. Nothing special
>> in configuration.
>

Hello Bart,

> I agree that the latency numbers are way too high for a null target device.
> Last time I measured latency for the SRP protocol against an SCST target
> + null block driver at the target side and ConnectX-3 adapters I measured a
> latency of about 14 microseconds. That's almost 100 times less than the
> measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html.

Here is the following configuration of the setup:

Initiator and target HW configuration:
AMD Opteron 6386 SE, 64CPU, 128Gb
InfiniBand: Mellanox Technologies MT26428
[ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Also, I remember that between initiator and target there were two IB switches.
Unfortunately, I can't repeat the same configuration, but will retest as
soon as we get new HW.

> Something else I would like to understand better is how much of the latency
> gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
> protocol. Was e.g. support for immediate data present in the NVMeOF and/or
> SRP drivers used on your test setup?

I did not get the question. IBTRS uses empty messages with only imm_data
field set to respond on IO. This is a part of the IBTRS protocol.  I do
not understand how can immediate data be present in other drivers, if
those do not use it in their protocols.  I am lost here.

> Are you aware that the NVMeOF target driver calls page_alloc() from the hot 
> path but that there are plans to
> avoid these calls in the hot path by using a caching mechanism similar to
> the SGV cache in SCST? Are you aware that a significant latency reduction
> can be achieved by changing the SCST SGV cache from a global into a per-CPU
> cache?

No, I am not aware. That is nice, that there is a lot of room for performance
tweaks. I will definitely retest on fresh kernel once everything is done on
nvme, scst or ibtrs (especially when we get rid of fmrs and UNSAFE rkeys).
Maybe there are some other parameters which can be also tweaked?

> Regarding the SRP measurements: have you tried to set the
> never_register kernel module parameter to true? I'm asking this because I
> think that mode is most similar to how the IBNBD initiator driver works.

yes, according to my notes from that link (frankly, I do not remember,
but that is what I wrote 1 year ago):

* Where suffixes mean:

 _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
  with 'register_always=N' param

That what you are asking, right?

--
Roman


Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-07 Thread Ming Lei
On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> [ .. ]
> >>
> >> Could you share us your patch for enabling global_tags/MQ on
> > megaraid_sas
> >> so that I can reproduce your test?
> >>
> >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> >>
> >> Could you share us what the IOPS/CPU utilization effect is after
> > applying the
> >> patch V2? And your test script?
> > Regarding CPU utilization, I need to test one more time. Currently system
> > is in used.
> > 
> > I run below fio test on total 24 SSDs expander attached.
> > 
> > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > --ioengine=libaio --rw=randread
> > 
> > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > 
> This is basically what we've seen with earlier iterations.

Hi Hannes,

As I mentioned in another mail[1], Kashyap's patch has a big issue, which
causes only reply queue 0 used.

[1] https://marc.info/?l=linux-scsi=151793204014631=2

So could you guys run your performance test again after fixing the patch?


Thanks,
Ming


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-07 Thread Jan Kara
On Mon 05-02-18 17:58:12, Bart Van Assche wrote:
> On Sat, 2018-02-03 at 10:51 +0800, Joseph Qi wrote:
> > Hi Bart,
> > 
> > On 18/2/3 00:21, Bart Van Assche wrote:
> > > On Fri, 2018-02-02 at 09:02 +0800, Joseph Qi wrote:
> > > > We triggered this race when using single queue. I'm not sure if it
> > > > exists in multi-queue.
> > > 
> > > Regarding the races between modifying the queue_lock pointer and the code 
> > > that
> > > uses that pointer, I think the following construct in blk_cleanup_queue() 
> > > is
> > > sufficient to avoid races between the queue_lock pointer assignment and 
> > > the code
> > > that executes concurrently with blk_cleanup_queue():
> > > 
> > >   spin_lock_irq(lock);
> > >   if (q->queue_lock != >__queue_lock)
> > >   q->queue_lock = >__queue_lock;
> > >   spin_unlock_irq(lock);
> > > 
> > 
> > IMO, the race also exists.
> > 
> > blk_cleanup_queue   blkcg_print_blkgs
> >   spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
> > q->queue_lock = >__queue_lock (3)
> >   spin_unlock_irq(lock) (4)
> > spin_unlock_irq(blkg->q->queue_lock) (6)
> > 
> > (1) take driver lock;
> > (2) busy loop for driver lock;
> > (3) override driver lock with internal lock;
> > (4) unlock driver lock; 
> > (5) can take driver lock now;
> > (6) but unlock internal lock.
> > 
> > If we get blkg->q->queue_lock to local first like blk_cleanup_queue,
> > it indeed can fix the different lock use in lock/unlock. But since
> > blk_cleanup_queue has overridden queue lock to internal lock now, I'm
> > afraid we couldn't still use driver lock in blkcg_print_blkgs.
> 
> (+ Jan Kara)
> 
> Hello Joseph,
> 
> That's a good catch. Since modifying all code that accesses the queue_lock
> pointer and that can race with blk_cleanup_queue() would be too cumbersome I
> see only one solution, namely making the request queue cgroup and sysfs
> attributes invisible before the queue_lock pointer is restored. Leaving the
> debugfs attributes visible while blk_cleanup_queue() is in progress should
> be fine if the request queue initialization code is modified such that it
> only modifies the queue_lock pointer for legacy queues. Jan, I think some
> time ago you objected when I proposed to move code from __blk_release_queue()
> into blk_cleanup_queue(). Would you be fine with a slightly different
> approach, namely making block cgroup and sysfs attributes invisible earlier,
> namely from inside blk_cleanup_queue() instead of from inside
> __blk_release_queue()?

Making attributes invisible earlier should be fine. But this whole
switching of queue_lock in blk_cleanup_queue() looks error-prone to me.
Generally anyone having access to request_queue can have old value of
q->queue_lock in its CPU caches and can happily use that value after
blk_cleanup_queue() finishes and the driver specific structure storing the
lock is freed. blkcg_print_blkgs() is one such example but I wouldn't bet a
penny that there are no other paths with a similar problem.

Logically, the lifetime of storage for q->queue_lock should be at least as
long as that of the request_queue itself - i.e., released only after
__blk_release_queue(). Otherwise all users of q->queue_lock need a proper
synchronization against lock switch in blk_cleanup_queue(). Either of these
looks like a lot of work. I guess since this involves only a legacy path,
your approach to move removal sysfs attributes earlier might be a
reasonable band aid.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Mike Galbraith
On Wed, 2018-02-07 at 12:12 +0100, Paolo Valente wrote:

> Just to be certain, before submitting a new patch: you changed *only*
> the BUG_ON at line 4742, on top of my instrumentation patch.

Nah, I completely rewrite it with only a little help from an ouija
board to compensate for missing (all) knowledge wrt BFQ internals.

(iow yeah, I did exactly and only what I was asked to do:)


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 12:03, Mike Galbraith  ha 
> scritto:
> 
> On Wed, 2018-02-07 at 11:27 +0100, Paolo Valente wrote:
>> 
>> 2. Could you please turn that BUG_ON into:
>> if (!(rq->rq_flags & RQF_ELVPRIV))
>>  return;
>> and see what happens?
> 
> That seems to make it forgets how to make boom.
> 

Great! This deserves to be celebrated with a packet of Ikea chips.

Just to be certain, before submitting a new patch: you changed *only*
the BUG_ON at line 4742, on top of my instrumentation patch.

Thanks,
Paolo


>   -Mike



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Mike Galbraith
On Wed, 2018-02-07 at 11:27 +0100, Paolo Valente wrote:
> 
> 2. Could you please turn that BUG_ON into:
> if (!(rq->rq_flags & RQF_ELVPRIV))
>   return;
> and see what happens?

That seems to make it forgets how to make boom.

-Mike


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-02-07 Thread John Garry

On 30/01/2018 10:33, John Garry wrote:

On 30/01/2018 01:24, Ming Lei wrote:

On Mon, Jan 29, 2018 at 12:56:30PM -0800, James Bottomley wrote:

On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
[...]

2. When to enable SCSI_MQ at default again?


I'm not sure there's much to discuss ... I think the basic answer is as
soon as Christoph wants to try it again.


I guess Christoph still need to evaluate if there are existed issues or
blockers before trying it again. And more input may be got from F2F
discussion, IMHO.




SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
V4.13-rc1, it is enabled at default, but later the patch is reverted
in V4.13-rc7, and becomes disabled at default too.

Now both the original reported PM issue(actually SCSI quiesce) and
the sequential IO performance issue have been addressed.


Is the blocker bug just not closed because no-one thought to do it:

https://bugzilla.kernel.org/show_bug.cgi?id=178381

(we have confirmed that this issue is now fixed with the original
reporter?)



From a developer view, this issue is fixed by the following commit:

3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably),
and it is verified by kernel list reporter.



And did the Huawei guy (Jonathan Cameron) confirm his performance issue
was fixed (I don't think I saw email that he did)?


Last time I talked with John Garry about the issue, and the merged
.get_budget
based patch improves much on the IO performance, but there is still a
bit gap
compared with legacy path. Seems a driver specific issue, remembered
that removing
a driver's lock can improve performance much.

Garry, could you provide further update on this issue?


Hi Ming,

From our testing with experimental changes to our driver to support SCSI
mq we were almost getting on par performance with legacy path. But
without these MQ was hitting performance (and I would not necessarily
say it was a driver issue).

We can retest from today's mainline and see where we are.

BTW, Have you got performance figures for many other single queue HBAs
with and without CONFIG_SCSI_MQ_DEFAULT=Y?


We finally got around to retesting this (on hisi_sas controller). So the 
results are generally ok, in that we are now not seeing such big 
performance drops in our hardware for enabling SCSI MQ - in some 
scenarios the performance is better. Generally fio rw mode is better.


Anyway, for what it's worth, it's a green light from us to set SCSI MQ 
on by default.


John



Thanks,
John



Thanks,
Ming

.




___
Linuxarm mailing list
linux...@huawei.com
http://hulk.huawei.com/mailman/listinfo/linuxarm

.






Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules

2018-02-07 Thread Roman Penyaev
On Tue, Feb 6, 2018 at 5:10 PM, Jason Gunthorpe  wrote:
> On Tue, Feb 06, 2018 at 01:01:23PM +0100, Roman Penyaev wrote:
>
>> >> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
>> >> *dev)
>> >> +{
>> >> +   int err;
>> >> +
>> >> +   d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
>> >> +   if (IS_ERR(d->pd))
>> >> +   return PTR_ERR(d->pd);
>> >> +   d->dev = dev;
>> >> +   d->lkey = d->pd->local_dma_lkey;
>> >> +   d->rkey = d->pd->unsafe_global_rkey;
>> >> +
>> >> +   err = ibtrs_query_device(d);
>> >> +   if (unlikely(err))
>> >> +   ib_dealloc_pd(d->pd);
>> >> +
>> >> +   return err;
>> >> +}
>> >
>> >
>> > I must say that this makes me frustrated.. We stopped doing these
>> > sort of things long time ago. No way we can even consider accepting
>> > the unsafe use of the global rkey exposing the entire memory space for
>> > remote access permissions.
>> >
>> > Sorry for being blunt, but this protocol design which makes a concious
>> > decision to expose unconditionally is broken by definition.
>>
>> I suppose we can also afford the same trick which nvme does: provide
>> register_always module argument, can we?  That can be also interesting
>> to measure the performance difference.
>
> I can be firmer than Sagi - new code that has IB_PD_UNSAFE_GLOBAL_RKEY
> at can not be accepted upstream.
>
> Once you get that fixed, you should go and read my past comments on
> how to properly order dma mapping and completions and fix that
> too. Then redo your performance..

Clear.

--
Roman


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Mike Galbraith
On Wed, 2018-02-07 at 11:27 +0100, Paolo Valente wrote:
> 
> 1. Could you paste a stack trace for this OOPS, just to understand how we
> get there?

[  442.421058] kernel BUG at block/bfq-iosched.c:4742!
[  442.421762] invalid opcode:  [#1] SMP PTI
[  442.422436] Dumping ftrace buffer:
[  442.423116](ftrace buffer empty)
[  442.423785] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
iptable_filter(E) ip6table_>
[  442.426685]  pcbc(E) iTCO_wdt(E) aesni_intel(E) aes_x86_64(E) 
iTCO_vendor_support(E) crypto_simd(E) glue_helper(E) mei_me(E) i2c_i801(E) 
r8169(E) cryptd(E) lpc_ich(E) mfd_core(E) mii(E) mei(E) soundcore(E) pcspkr(E) 
shpchp(E) thermal>
[  442.429634]  dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) 
scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
[  442.430166] CPU: 2 PID: 489 Comm: kworker/2:1H Tainted: GE
4.15.0.ga2e5790-master #612
[  442.430675] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  442.431195] Workqueue: kblockd blk_mq_requeue_work
[  442.431706] RIP: 0010:bfq_only_requeue_request+0xd1/0xe0
[  442.432221] RSP: 0018:8803f7b8fcc0 EFLAGS: 00010246
[  442.432733] RAX: 2090 RBX: 8803dbda2580 RCX: 
[  442.433250] RDX: 8803fa3c8000 RSI: 0004 RDI: 8803dbda2580
[  442.433778] RBP: 8803f9ec85d0 R08: 8803fa3c81c0 R09: 
[  442.434296] R10: 8803fa3c81d0 R11: 1000 R12: 8803dbda2600
[  442.434815] R13: 000d R14:  R15: 8803f7b8fd88
[  442.435334] FS:  () GS:88041ec8() 
knlGS:
[  442.435863] CS:  0010 DS:  ES:  CR0: 80050033
[  442.436381] CR2: 7f00e001e198 CR3: 01e0a002 CR4: 001606e0
[  442.436909] Call Trace:
[  442.437445]  __blk_mq_requeue_request+0x8f/0x120
[  442.437980]  blk_mq_dispatch_rq_list+0x342/0x550
[  442.438506]  ? kyber_dispatch_request+0xd0/0xd0
[  442.439041]  blk_mq_sched_dispatch_requests+0xf7/0x180
[  442.439568]  __blk_mq_run_hw_queue+0x58/0xd0
[  442.440103]  __blk_mq_delay_run_hw_queue+0x99/0xa0
[  442.440628]  blk_mq_run_hw_queue+0x54/0xf0
[  442.441157]  blk_mq_run_hw_queues+0x4b/0x60
[  442.441822]  blk_mq_requeue_work+0x13a/0x150
[  442.442518]  process_one_work+0x147/0x350
[  442.443205]  worker_thread+0x47/0x3e0
[  442.443887]  kthread+0xf8/0x130
[  442.444579]  ? rescuer_thread+0x360/0x360
[  442.445264]  ? kthread_stop+0x120/0x120
[  442.445965]  ? do_syscall_64+0x75/0x1a0
[  442.446651]  ? SyS_exit_group+0x10/0x10
[  442.447340]  ret_from_fork+0x35/0x40
[  442.448023] Code: ff 4c 89 f6 4c 89 ef e8 be ec 2e 00 48 c7 83 80 00 00 00 
00 00 00 00 48 c7 83 88 00 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b 
0f 0b 0f 0b 0f 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41
[  442.448668] RIP: bfq_only_requeue_request+0xd1/0xe0 RSP: 8803f7b8fcc0



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 11:15, Mike Galbraith  ha 
> scritto:
> 
> On Wed, 2018-02-07 at 10:45 +0100, Paolo Valente wrote:
>> 
>>> Il giorno 07 feb 2018, alle ore 10:23, Mike Galbraith  ha 
>>> scritto:
>>> 
>>> On Wed, 2018-02-07 at 10:08 +0100, Paolo Valente wrote:
 
 The first piece of information I need is whether this failure happens
 even without "BFQ hierarchical scheduling support".
>>> 
>>> I presume you mean BFQ_GROUP_IOSCHED, which I do not have enabled.
>>> 
>> 
>> Great (so to speak), this saves us one step.
>> 
>> So, here's my next request for help: please apply the attached patch
>> (compressed to preserve it from my email client) and retry. It adds
>> several anomaly checks. I hope I have not added any false-positive
>> check.
> 
> kernel BUG at block/bfq-iosched.c:4742!
> 
> 4742 BUG_ON(!(rq->rq_flags & RQF_ELVPRIV));

Oh my, this is as crazy as, fortunately, easy to fix.  The problem is
that this is easy to fix in bfq, but increases the doubts I have
expressed in my cover letter: is it ok that, in blk-mq, the functions
of an elevator may get invoked, without control, on requests that do
not belong to that elevator?

Anyway, two requests, Mike, if you haven't had enough already:

1. Could you paste a stack trace for this OOPS, just to understand how we
get there?

2. Could you please turn that BUG_ON into:
if (!(rq->rq_flags & RQF_ELVPRIV))
return;
and see what happens?

Thanks a lot,
Paolo

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Mike Galbraith
On Wed, 2018-02-07 at 10:45 +0100, Paolo Valente wrote:
> 
> > Il giorno 07 feb 2018, alle ore 10:23, Mike Galbraith  ha 
> > scritto:
> > 
> > On Wed, 2018-02-07 at 10:08 +0100, Paolo Valente wrote:
> >> 
> >> The first piece of information I need is whether this failure happens
> >> even without "BFQ hierarchical scheduling support".
> > 
> > I presume you mean BFQ_GROUP_IOSCHED, which I do not have enabled.
> > 
> 
> Great (so to speak), this saves us one step.
> 
> So, here's my next request for help: please apply the attached patch
> (compressed to preserve it from my email client) and retry. It adds
> several anomaly checks. I hope I have not added any false-positive
> check.

kernel BUG at block/bfq-iosched.c:4742!

4742 BUG_ON(!(rq->rq_flags & RQF_ELVPRIV));




Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Mike Galbraith
On Wed, 2018-02-07 at 10:08 +0100, Paolo Valente wrote:
> 
> The first piece of information I need is whether this failure happens
> even without "BFQ hierarchical scheduling support".

I presume you mean BFQ_GROUP_IOSCHED, which I do not have enabled.

-Mike 


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 19:35, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> 06.02.2018 15:50, Paolo Valente wrote:
>> Could you please do a
>> gdb /block/bfq-iosched.o # or vmlinux.o if bfq is builtin
>> list *(bfq_finish_requeue_request+0x54)
>> list *(bfq_put_queue+0x10b)
>> for me?
> 
> Fresh crashes and gdb output are given below. A side note: it is harder to 
> trigger things on a slower machine, so clearly some timing-bounded race 
> condition there.
> 
> [  134.276548] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [  134.283699] IP: blk_flush_complete_seq+0x20a/0x300
> [  134.288163] PGD 0 P4D 0
> [  134.291284] Oops: 0002 [#1] PREEMPT SMP PTI
> [  134.293842] Modules linked in: bochs_drm ttm nls_iso8859_1 kvm_intel 
> nls_cp437 vfat fat drm_kms_helper kvm drm irqbypass psmouse iTCO_wdt ppdev 
> iTCO_vendor_support input_leds led_class i2c_i801 parport_pc joydev intel_agp 
> parport intel_gtt mousedev lpc_ich rtc_cmos syscopyarea evdev sysfillrect 
> agpgart qemu_fw_cfg mac_hid sysimgblt fb_sys_fops sch_fq_codel ip_tables 
> x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c 
> crc32c_generic dm_crypt algif_skcipher af_alg dm_mod hid_generic usbhid 
> raid10 hid md_mod sr_mod sd_mod cdrom uhci_hcd ehci_pci serio_raw 
> crct10dif_pclmul crc32_pclmul atkbd crc32c_intel libps2 ghash_clmulni_intel 
> pcbc xhci_pci xhci_hcd ehci_hcd aesni_intel aes_x86_64 crypto_simd 
> glue_helper cryptd ahci libahci libata usbcore usb_common i8042 serio 
> virtio_scsi
> [  134.340606]  scsi_mod virtio_blk virtio_net virtio_pci virtio_ring virtio
> [  134.345803] CPU: 0 PID: 178 Comm: kworker/0:1H Not tainted 4.15.0-pf2 #1
> [  134.350309] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [  134.355106] Workqueue: kblockd blk_mq_run_work_fn
> [  134.359034] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
> [  134.367647] RSP: :88000f803ce8 EFLAGS: 00010082
> [  134.371632] RAX: 88000d9755c0 RBX: 88000d9755a0 RCX: 
> 88000c9b39a8
> [  134.375675] RDX:  RSI: 88000d9755d0 RDI: 
> 88000c9b3900
> [  134.381068] RBP: 88000d21a990 R08: 88000d9755b0 R09: 
> 
> [  134.386302] R10: 8800058ff100 R11: 0002000b R12: 
> 
> [  134.396915] R13: 88000d9755f0 R14: 0046 R15: 
> 88000d9755a0
> [  134.401140] FS:  () GS:88000f80() 
> knlGS:
> [  134.407361] CS:  0010 DS:  ES:  CR0: 80050033
> [  134.412384] CR2:  CR3: 04008006 CR4: 
> 001606f0
> [  134.416913] Call Trace:
> [  134.420251]  
> [  134.427731]  mq_flush_data_end_io+0xb3/0xf0
> [  134.431848]  scsi_end_request+0x90/0x1e0 [scsi_mod]
> [  134.436424]  scsi_io_completion+0x237/0x650 [scsi_mod]
> [  134.440109]  __blk_mq_complete_request+0xc4/0x150
> [  134.444517]  ? scsi_mq_get_budget+0x110/0x110 [scsi_mod]
> [  134.449603]  ata_scsi_qc_complete+0x8d/0x430 [libata]
> [  134.458487]  ata_qc_complete_multiple+0x8d/0xe0 [libata]
> [  134.461726]  ahci_handle_port_interrupt+0xc9/0x5b0 [libahci]
> [  134.466420]  ahci_handle_port_intr+0x54/0xb0 [libahci]
> [  134.470128]  ahci_single_level_irq_intr+0x3b/0x60 [libahci]
> [  134.473327]  __handle_irq_event_percpu+0x44/0x1e0
> [  134.476700]  handle_irq_event_percpu+0x30/0x70
> [  134.480227]  handle_irq_event+0x37/0x60
> [  134.490341]  handle_edge_irq+0x107/0x1c0
> [  134.492876]  handle_irq+0x1f/0x30
> [  134.495497]  do_IRQ+0x4d/0xe0
> [  134.497963]  common_interrupt+0xa2/0xa2
> [  134.500877]  
> [  134.503129] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x40
> [  134.506782] RSP: :c9307d30 EFLAGS: 0293 ORIG_RAX: 
> ffdb
> [  134.511845] RAX: 0001 RBX: 88000db04000 RCX: 
> 0008
> [  134.523019] RDX: 0100 RSI: 0293 RDI: 
> 0293
> [  134.527968] RBP: 0293 R08:  R09: 
> 0040
> [  134.532289] R10: 008e66bf R11: 0002000b R12: 
> 
> [  134.536376] R13: 88000d26a000 R14: 88000b99ac48 R15: 
> 88000d26a000
> [  134.541046]  ata_scsi_queuecmd+0xa0/0x210 [libata]
> [  134.544363]  scsi_dispatch_cmd+0xe8/0x260 [scsi_mod]
> [  134.552883]  scsi_queue_rq+0x4cf/0x560 [scsi_mod]
> [  134.556811]  blk_mq_dispatch_rq_list+0x8f/0x4c0
> [  134.559741]  blk_mq_sched_dispatch_requests+0x105/0x190
> [  134.563253]  __blk_mq_run_hw_queue+0x80/0x90
> [  134.565540]  process_one_work+0x1df/0x420
> [  134.568041]  worker_thread+0x2b/0x3d0
> [  134.571032]  ? process_one_work+0x420/0x420
> [  134.573964]  kthread+0x113/0x130
> [  134.584370]  ? kthread_create_on_node+0x70/0x70
> [  134.587355]  ret_from_fork+0x35/0x40
> [  134.589796] Code: 39 d0 0f 84 8f 00 00 00 48 8b 97 b0 00 00 00 49 c1 e0 04 
> 45 31 e4 48 8b b7 a8 00 00 00 49 01 d8 48 8d 8f a8 00 

[PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-07 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get, and
then kernel crashes with invalid page request.
After investigating this issue, we've found there is a race between
blkcg_bio_issue_check and cgroup_rmdir. The race is described below.

writeback kworker
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
<<< *race window*
blk_throtl_bio
  spin_lock_irq(q->queue_lock)
  spin_unlock_irq(q->queue_lock)
rcu_read_unlock

cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
blkg release.
Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
And then the corresponding blkg_put will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
first and then try to lookup/create again with queue_lock. So revive
this logic to fix the race.

v2: fix a potential NULL pointer dereference when stats

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue 
Signed-off-by: Joseph Qi 
Cc: sta...@vger.kernel.org
---
 block/blk-cgroup.c |  2 +-
 block/blk-throttle.c   | 30 ++
 block/cfq-iosched.c| 33 +++--
 include/linux/blk-cgroup.h | 27 +--
 4 files changed, 55 insertions(+), 37 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524..b1d22e5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -162,7 +162,6 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 
return NULL;
 }
-EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);
 
 /*
  * If @new_blkg is %NULL, this function tries to allocate a new one as
@@ -306,6 +305,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
return blkg;
}
 }
+EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c5a1316..decdd76 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2143,26 +2143,41 @@ static void blk_throtl_assoc_bio(struct throtl_grp *tg, 
struct bio *bio)
 #endif
 }
 
-bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
+bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg,
struct bio *bio)
 {
+   struct throtl_data *td = q->td;
struct throtl_qnode *qn = NULL;
-   struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
+   struct throtl_grp *tg;
+   struct blkcg_gq *blkg;
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
-   struct throtl_data *td = tg->td;
 
WARN_ON_ONCE(!rcu_read_lock_held());
 
+   /*
+* If a group has no rules, just update the dispatch stats in lockless
+* manner and return.
+*/
+   blkg = blkg_lookup(blkcg, q);
+   tg = blkg_to_tg(blkg);
+   if (tg && !tg->has_rules[rw])
+   goto out;
+
/* see throtl_charge_bio() */
-   if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
+   if (bio_flagged(bio, BIO_THROTTLED))
goto out;
 
spin_lock_irq(q->queue_lock);
 
throtl_update_latency_buckets(td);
 
+   blkg = blkg_lookup_create(blkcg, q);
+   if (IS_ERR(blkg))
+   blkg = q->root_blkg;
+   tg = blkg_to_tg(blkg);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;
 
@@ -2253,6 +2268,13 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
if (throttled || !td->track_bio_latency)
bio->bi_issue_stat.stat |= SKIP_LATENCY;
 #endif
+   if (!throttled) {
+   blkg = blkg ?: q->root_blkg;
+   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
+   bio->bi_iter.bi_size);
+   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
+   }
+
return throttled;
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef..60f53b5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1674,15 +1674,28 @@ static void cfq_pd_reset_stats(struct blkg_policy_data 
*pd)
cfqg_stats_reset(>stats);
 }
 
-static struct cfq_group *cfq_lookup_cfqg(struct cfq_data *cfqd,
-struct blkcg *blkcg)
+/*
+ * Search