Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments

2016-11-02 Thread Ming Lei
On Wed, Nov 2, 2016 at 10:24 PM, Mike Snitzer  wrote:
> On Wed, Nov 02 2016 at  3:56am -0400,
> Ming Lei  wrote:
>
>> On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreet
>>  wrote:
>> > On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote:
>> >> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
>> >> > Avoid to access .bi_vcnt directly, because it may be not what
>> >> > the driver expected any more after supporting multipage bvec.
>> >> >
>> >> > Signed-off-by: Ming Lei 
>> >>
>> >> It would be really nice to have a comment in the code why it's
>> >> even checking for multiple segments.
>> >
>> > Or ideally refactor the code to not care about multiple segments at all.
>>
>> The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm:
>> don't start current request if it would've merged with the previous), which
>> fixed one performance issue.[1]
>>
>> Looks the idea of the patch is to delay dispatching the rq if it
>> would've merged with previous request and the rq is small(single bvec).
>> I guess the motivation is to try to increase chance of merging with the 
>> delay.
>>
>> But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is
>> submitted, .bi_vcnt isn't changed any more and merging doesn't change
>> it too. So should the check have been on blk_rq_bytes(rq)?
>>
>> Mike, please correct me if my understanding is wrong.
>>
>>
>> [1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html
>
> The patch was labored over for quite a while and is based on suggestions I
> got from Jens when discussing a very problematic aspect of old
> .request_fn request-based DM performance for a multi-threaded (64
> threads) sequential IO benchmark (vdbench IIRC).  The issue was reported
> by NetApp.
>
> The patch in question fixed the lack of merging that was seen with this
> interleaved sequential IO benchmark.  The lack of merging was made worse
> if a DM multipath device had more underlying paths (e.g. 4 instead of 2).
>
> As for your question, about using blk_rq_bytes(rq) vs 'bio->bi_vcnt == 1'
> .. not sure how that would be a suitable replacement.  But it has been a
> while since I've delved into these block core merge details of old

Just last year, looks not long enough, :-)

> .request_fn but _please_ don't change the logic of this code simply

As I explained before, neither .bi_vcnt will be changed after submitting,
nor be changed during merging, so I think the checking is wrong,
could you explain what is your initial motivation of checking on
'bio->bi_vcnt == 1'?

> because it is proving itself to be problematic for your current
> patchset's cleanliness.

Could you explain what is problematic for the cleanliness?

Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] block: add code to track actual device queue depth

2016-11-02 Thread Johannes Thumshirn
On Wed, Nov 02, 2016 at 09:02:08AM -0600, Jens Axboe wrote:
> On 11/02/2016 08:59 AM, Christoph Hellwig wrote:
> > On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote:
> > > For blk-mq, ->nr_requests does track queue depth, at least at init
> > > time. But for the older queue paths, it's simply a soft setting.
> > > On top of that, it's generally larger than the hardware setting
> > > on purpose, to allow backup of requests for merging.
> > > 
> > > Fill a hole in struct request with a 'queue_depth' member, that
> > 
> > That would be struct request_queue..
> 
> Good catch, will fix.
> 
> > >  /**
> > > + * blk_set_queue_depth - tell the block layer about the device queue 
> > > depth
> > > + * @q:   the request queue for the device
> > > + * @depth:   queue depth
> > > + *
> > > + */
> > > +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> > > +{
> > > + q->queue_depth = depth;
> > > +}
> > > +EXPORT_SYMBOL(blk_set_queue_depth);
> > 
> > Do we really need this wrapper?
> 
> Not necessarily, just seems like a nicer API than manually setting the
> field. Not a big deal to me, though.

A lot of block code uses this kind of setters so I _think_ it complies with
the overall style. But I have no strong opinion on this either...

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] block: add WRITE_BACKGROUND

2016-11-02 Thread Jens Axboe
On Wed, Nov 02 2016, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 03:08:44PM -0600, Jens Axboe wrote:
> > This adds a new request flag, REQ_BACKGROUND, that callers can use to
> > tell the block layer that this is background (non-urgent) IO.
> 
> The subject still says WRITE_BACKGROUND :)

Gah - will fix that up.

> Otherwise looks fine:
> 
> Reviewed-by: Christoph Hellwig 

Added, thanks.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] LightNVM patchset V2

2016-11-02 Thread Matias Bjørling
On 10/27/2016 08:01 PM, Javier González wrote:
> V2: Rebase patches on Matias' for-next
> 
> Javier González (7):
>   lightnvm: enable to send hint to erase command
>   lightnvm: do not decide on device blocks
>   lightnvm: manage block list on LUN owner
>   lightnvm: drop reserve and release LUN callbacks
>   lightnvm: export set bad block table
>   lightnvm: add ECC error codes
>   lightnvm: rrpc: split bios of size > 256kb
> 
>  drivers/lightnvm/core.c  |  41 --
>  drivers/lightnvm/gennvm.c| 180 
> +--
>  drivers/lightnvm/gennvm.h|  19 ++---
>  drivers/lightnvm/rrpc.c  | 128 --
>  drivers/lightnvm/rrpc.h  |   4 +
>  drivers/lightnvm/sysblk.c|  33 ++--
>  drivers/nvme/host/lightnvm.c |   1 +
>  include/linux/lightnvm.h |  52 +
>  8 files changed, 305 insertions(+), 153 deletions(-)
> 

Thanks Javier. I've picked 1-2 and 5-7 to my for-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/14] Fix race conditions related to stopping block layer queues

2016-11-02 Thread Christoph Hellwig
Hi Bart,

this whole series looks great to me:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] block: move poll code to blk-mq

2016-11-02 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] block: add code to track actual device queue depth

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote:
> For blk-mq, ->nr_requests does track queue depth, at least at init
> time. But for the older queue paths, it's simply a soft setting.
> On top of that, it's generally larger than the hardware setting
> on purpose, to allow backup of requests for merging.
> 
> Fill a hole in struct request with a 'queue_depth' member, that

That would be struct request_queue..

>  /**
> + * blk_set_queue_depth - tell the block layer about the device queue depth
> + * @q:   the request queue for the device
> + * @depth:   queue depth
> + *
> + */
> +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> +{
> + q->queue_depth = depth;
> +}
> +EXPORT_SYMBOL(blk_set_queue_depth);

Do we really need this wrapper?

> +++ b/drivers/scsi/scsi.c
> @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int 
> depth)
>   wmb();
>   }
>  
> + if (sdev->request_queue)
> + blk_set_queue_depth(sdev->request_queue, depth);
> +
>   return sdev->queue_depth;

Can we kill the scsi_device queue_depth member and just always use
the request_queue one?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] block: add code to track actual device queue depth

2016-11-02 Thread Jens Axboe

On 11/02/2016 08:59 AM, Christoph Hellwig wrote:

On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote:

For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that


That would be struct request_queue..


Good catch, will fix.


 /**
+ * blk_set_queue_depth - tell the block layer about the device queue depth
+ * @q: the request queue for the device
+ * @depth: queue depth
+ *
+ */
+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+   q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);


Do we really need this wrapper?


Not necessarily, just seems like a nicer API than manually setting the
field. Not a big deal to me, though.


+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int 
depth)
wmb();
}

+   if (sdev->request_queue)
+   blk_set_queue_depth(sdev->request_queue, depth);
+
return sdev->queue_depth;


Can we kill the scsi_device queue_depth member and just always use
the request_queue one?


Yes. I'd prefer that we do that through the SCSI tree, once this is in
though.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:08:47PM -0600, Jens Axboe wrote:
> Note in the bdi_writeback structure whenever a task ends up sleeping
> waiting for progress. We can use that information in the lower layers
> to increase the priority of writes.

Do we need to care about atomicy of multiple threads updating the value
?

Otherwise this looks fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-02 Thread Jens Axboe

On 11/02/2016 08:52 AM, Christoph Hellwig wrote:

On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.


what is the use case for the legacy request tracking?


Buffered writeback code uses the same base. Additionally, it could
replace some user space tracking for latency outliers that people are
currently running.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] writeback: mark background writeback as such

2016-11-02 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] writeback: add wbc_to_write_flags()

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:08:45PM -0600, Jens Axboe wrote:
> Add wbc_to_write_flags(), which returns the write modifier flags to use,
> based on a struct writeback_control. No functional changes in this
> patch, but it prepares us for factoring other wbc fields for write type.
> 
> Signed-off-by: Jens Axboe 
> Reviewed-by: Jan Kara 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:05:24PM -0600, Jens Axboe wrote:
> This patch enables a hybrid polling mode. Instead of polling after IO
> submission, we can induce an artificial delay, and then poll after that.
> For example, if the IO is presumed to complete in 8 usecs from now, we
> can sleep for 4 usecs, wake up, and then do our polling. This still puts
> a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
> after the IO has completed, it'll happen before. With this hybrid
> scheme, we can achieve big latency reductions while still using the same
> (or less) amount of CPU.
> 
> Signed-off-by: Jens Axboe 

This looks very nice.  I'll need to run some tests before giving the
formal ACK, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] block: add WRITE_BACKGROUND

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:08:44PM -0600, Jens Axboe wrote:
> This adds a new request flag, REQ_BACKGROUND, that callers can use to
> tell the block layer that this is background (non-urgent) IO.

The subject still says WRITE_BACKGROUND :)

Otherwise looks fine:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET] block: IO polling improvements

2016-11-02 Thread Jens Axboe

On 11/02/2016 08:51 AM, Christoph Hellwig wrote:

On Tue, Nov 01, 2016 at 03:05:21PM -0600, Jens Axboe wrote:

This builds on top of Christophs simplified bdev O_DIRECT code,
posted earlier today [1].


Btw, can you create a stable branch just for the
"block: add bio_iov_iter_get_pages()" so that we can use it as
a baseline for the iomap dio work?


Sure

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] block: add scalable completion tracking of requests

2016-11-02 Thread Christoph Hellwig
On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.

what is the use case for the legacy request tracking?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-02 Thread Christoph Hellwig
On Wed, Nov 02, 2016 at 11:32:04AM +0300, Kirill A. Shutemov wrote:
> Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> why syscall-based IO sucks. We spend a lot of time looking for desired
> block.
> 
> We need to switch to some other data structure for storing buffer_heads.
> Is there a reason why we have list there in first place?
> Why not just array?
> 
> I will look into it, but this sounds like a separate infrastructure change
> project.

We're working on it with the iomap code.  And yes, it's really something
that needs to be done before we can consider the THP patches.  Same for
the biovec thing were we really need the > PAGE_SIZE bio_vecs first.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments

2016-11-02 Thread Mike Snitzer
On Wed, Nov 02 2016 at  3:56am -0400,
Ming Lei  wrote:

> On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreet
>  wrote:
> > On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote:
> >> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
> >> > Avoid to access .bi_vcnt directly, because it may be not what
> >> > the driver expected any more after supporting multipage bvec.
> >> >
> >> > Signed-off-by: Ming Lei 
> >>
> >> It would be really nice to have a comment in the code why it's
> >> even checking for multiple segments.
> >
> > Or ideally refactor the code to not care about multiple segments at all.
> 
> The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm:
> don't start current request if it would've merged with the previous), which
> fixed one performance issue.[1]
> 
> Looks the idea of the patch is to delay dispatching the rq if it
> would've merged with previous request and the rq is small(single bvec).
> I guess the motivation is to try to increase chance of merging with the delay.
> 
> But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is
> submitted, .bi_vcnt isn't changed any more and merging doesn't change
> it too. So should the check have been on blk_rq_bytes(rq)?
> 
> Mike, please correct me if my understanding is wrong.
> 
> 
> [1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html

The patch was labored over for quite a while and is based on suggestions I
got from Jens when discussing a very problematic aspect of old
.request_fn request-based DM performance for a multi-threaded (64
threads) sequential IO benchmark (vdbench IIRC).  The issue was reported
by NetApp.

The patch in question fixed the lack of merging that was seen with this
interleaved sequential IO benchmark.  The lack of merging was made worse
if a DM multipath device had more underlying paths (e.g. 4 instead of 2).

As for your question, about using blk_rq_bytes(rq) vs 'bio->bi_vcnt == 1'
.. not sure how that would be a suitable replacement.  But it has been a
while since I've delved into these block core merge details of old
.request_fn but _please_ don't change the logic of this code simply
because it is proving itself to be problematic for your current
patchset's cleanliness.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-02 Thread Kirill A. Shutemov
On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > If I understand the motivation right, it is mostly about being able to 
> > > mmap
> > > PMD-sized chunks to userspace. So my naive idea would be that we could 
> > > just
> > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > page cache, we don't even have to read them all unless we come from PMD
> > > fault path.
> > 
> > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > per-hugepage, one common list of buffer heads...
> > 
> > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > it otherwise doesn't make sense) and handling it differently for file-THP
> > is nightmare from maintenance POV.
> 
> But the complexity of two different page sizes for page cache and *each*
> filesystem that wants to support it does not make the maintenance easy
> either.

I think with time we can make small pages just a subcase of huge pages.
And some generalization can be made once more than one filesystem with
backing storage will adopt huge pages.

> So I'm not convinced that using the same rules for anon-THP and
> file-THP is a clear win.

We already have file-THP with the same rules: tmpfs. Backing storage is
what changes the picture.

> And if we have these two options neither of which has negligible
> maintenance cost, I'd also like to see more justification for why it is
> a good idea to have file-THP for normal filesystems. Do you have any
> performance numbers that show it is a win under some realistic workload?

See below. As usual with huge pages, they make sense when you plenty of
memory.

> I'd also note that having PMD-sized pages has some obvious disadvantages as
> well:
>
> 1) I'm not sure buffer head handling code will quite scale to 512 or even
> 2048 buffer_heads on a linked list referenced from a page. It may work but
> I suspect the performance will suck.

Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
why syscall-based IO sucks. We spend a lot of time looking for desired
block.

We need to switch to some other data structure for storing buffer_heads.
Is there a reason why we have list there in first place?
Why not just array?

I will look into it, but this sounds like a separate infrastructure change
project.

> 2) PMD-sized pages result in increased space & memory usage.

Space? Do you mean disk space? Not really: we still don't write beyond
i_size or into holes.

Behaviour wrt to holes may change with mmap()-IO as we have less
granularity, but the same can be seen just between different
architectures: 4k vs. 64k base page size.

> 3) In ext4 we have to estimate how much metadata we may need to modify when
> allocating blocks underlying a page in the worst case (you don't seem to
> update this estimate in your patch set). With 2048 blocks underlying a page,
> each possibly in a different block group, it is a lot of metadata forcing
> us to reserve a large transaction (not sure if you'll be able to even
> reserve such large transaction with the default journal size), which again
> makes things slower.

I didn't saw this on profiles. And xfstests looks fine. I probably need to
run them with 1k blocks once again.

> 4) As you have noted some places like write_begin() still depend on 4k
> pages which creates a strange mix of places that use subpages and that use
> head pages.

Yes, this need to be addressed to restore syscall-IO performance and take
advantage of huge pages.

But again, it's an infrastructure change that would likely affect
interface between VFS and filesystems. It deserves a separate patchset.

> All this would be a non-issue (well, except 2 I guess) if we just didn't
> expose filesystems to the fact that something like file-THP exists.

The numbers below generated with fio. The working set is relatively small,
so it fits into page cache and writing set doesn't hit dirty_ratio.

I think the mmap performance should be enough to justify initial inclusion
of an experimental feature: it useful for workloads that targets mmap()-IO.
It will take time to get feature mature anyway.

Configuration:
 - 2x E5-2697v2, 64G RAM;
 - INTEL SSDSC2CW24;
 - IO request size is 4k;
 - 8 processes, 512MB data set each;

Workload
 read/write baselinestddev  huge=always stddev  change

sync-read
 read 21439.00  348.1420297.33  259.62   -5.33%
sync-write
 write 6833.20  147.08 3630.13   52.86  -46.88%
sync-readwrite
 read  4377.17   17.53 2366.33   19.52  -45.94%
 write 4378.50   17.83 2365.80   19.94  -45.97%
sync-randread
 read  5491.20   66.6614664.00  288.29  167.05%
sync-randwrite
 write 6396.13   98.79 2035.80 

Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments

2016-11-02 Thread Ming Lei
On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreet
 wrote:
> On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote:
>> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote:
>> > Avoid to access .bi_vcnt directly, because it may be not what
>> > the driver expected any more after supporting multipage bvec.
>> >
>> > Signed-off-by: Ming Lei 
>>
>> It would be really nice to have a comment in the code why it's
>> even checking for multiple segments.
>
> Or ideally refactor the code to not care about multiple segments at all.

The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm:
don't start current request if it would've merged with the previous), which
fixed one performance issue.[1]

Looks the idea of the patch is to delay dispatching the rq if it
would've merged with previous request and the rq is small(single bvec).
I guess the motivation is to try to increase chance of merging with the delay.

But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is
submitted, .bi_vcnt isn't changed any more and merging doesn't change
it too. So should the check have been on blk_rq_bytes(rq)?

Mike, please correct me if my understanding is wrong.


[1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html


thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html