Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
On Wed, Nov 2, 2016 at 10:24 PM, Mike Snitzerwrote: > On Wed, Nov 02 2016 at 3:56am -0400, > Ming Lei wrote: > >> On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreet >> wrote: >> > On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote: >> >> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote: >> >> > Avoid to access .bi_vcnt directly, because it may be not what >> >> > the driver expected any more after supporting multipage bvec. >> >> > >> >> > Signed-off-by: Ming Lei >> >> >> >> It would be really nice to have a comment in the code why it's >> >> even checking for multiple segments. >> > >> > Or ideally refactor the code to not care about multiple segments at all. >> >> The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm: >> don't start current request if it would've merged with the previous), which >> fixed one performance issue.[1] >> >> Looks the idea of the patch is to delay dispatching the rq if it >> would've merged with previous request and the rq is small(single bvec). >> I guess the motivation is to try to increase chance of merging with the >> delay. >> >> But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is >> submitted, .bi_vcnt isn't changed any more and merging doesn't change >> it too. So should the check have been on blk_rq_bytes(rq)? >> >> Mike, please correct me if my understanding is wrong. >> >> >> [1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html > > The patch was labored over for quite a while and is based on suggestions I > got from Jens when discussing a very problematic aspect of old > .request_fn request-based DM performance for a multi-threaded (64 > threads) sequential IO benchmark (vdbench IIRC). The issue was reported > by NetApp. > > The patch in question fixed the lack of merging that was seen with this > interleaved sequential IO benchmark. The lack of merging was made worse > if a DM multipath device had more underlying paths (e.g. 4 instead of 2). > > As for your question, about using blk_rq_bytes(rq) vs 'bio->bi_vcnt == 1' > .. not sure how that would be a suitable replacement. But it has been a > while since I've delved into these block core merge details of old Just last year, looks not long enough, :-) > .request_fn but _please_ don't change the logic of this code simply As I explained before, neither .bi_vcnt will be changed after submitting, nor be changed during merging, so I think the checking is wrong, could you explain what is your initial motivation of checking on 'bio->bi_vcnt == 1'? > because it is proving itself to be problematic for your current > patchset's cleanliness. Could you explain what is problematic for the cleanliness? Thanks, Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] block: add code to track actual device queue depth
On Wed, Nov 02, 2016 at 09:02:08AM -0600, Jens Axboe wrote: > On 11/02/2016 08:59 AM, Christoph Hellwig wrote: > > On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote: > > > For blk-mq, ->nr_requests does track queue depth, at least at init > > > time. But for the older queue paths, it's simply a soft setting. > > > On top of that, it's generally larger than the hardware setting > > > on purpose, to allow backup of requests for merging. > > > > > > Fill a hole in struct request with a 'queue_depth' member, that > > > > That would be struct request_queue.. > > Good catch, will fix. > > > > /** > > > + * blk_set_queue_depth - tell the block layer about the device queue > > > depth > > > + * @q: the request queue for the device > > > + * @depth: queue depth > > > + * > > > + */ > > > +void blk_set_queue_depth(struct request_queue *q, unsigned int depth) > > > +{ > > > + q->queue_depth = depth; > > > +} > > > +EXPORT_SYMBOL(blk_set_queue_depth); > > > > Do we really need this wrapper? > > Not necessarily, just seems like a nicer API than manually setting the > field. Not a big deal to me, though. A lot of block code uses this kind of setters so I _think_ it complies with the overall style. But I have no strong opinion on this either... Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] block: add WRITE_BACKGROUND
On Wed, Nov 02 2016, Christoph Hellwig wrote: > On Tue, Nov 01, 2016 at 03:08:44PM -0600, Jens Axboe wrote: > > This adds a new request flag, REQ_BACKGROUND, that callers can use to > > tell the block layer that this is background (non-urgent) IO. > > The subject still says WRITE_BACKGROUND :) Gah - will fix that up. > Otherwise looks fine: > > Reviewed-by: Christoph HellwigAdded, thanks. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] LightNVM patchset V2
On 10/27/2016 08:01 PM, Javier González wrote: > V2: Rebase patches on Matias' for-next > > Javier González (7): > lightnvm: enable to send hint to erase command > lightnvm: do not decide on device blocks > lightnvm: manage block list on LUN owner > lightnvm: drop reserve and release LUN callbacks > lightnvm: export set bad block table > lightnvm: add ECC error codes > lightnvm: rrpc: split bios of size > 256kb > > drivers/lightnvm/core.c | 41 -- > drivers/lightnvm/gennvm.c| 180 > +-- > drivers/lightnvm/gennvm.h| 19 ++--- > drivers/lightnvm/rrpc.c | 128 -- > drivers/lightnvm/rrpc.h | 4 + > drivers/lightnvm/sysblk.c| 33 ++-- > drivers/nvme/host/lightnvm.c | 1 + > include/linux/lightnvm.h | 52 + > 8 files changed, 305 insertions(+), 153 deletions(-) > Thanks Javier. I've picked 1-2 and 5-7 to my for-next. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/14] Fix race conditions related to stopping block layer queues
Hi Bart, this whole series looks great to me: Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] block: move poll code to blk-mq
Looks fine, Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] block: add code to track actual device queue depth
On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote: > For blk-mq, ->nr_requests does track queue depth, at least at init > time. But for the older queue paths, it's simply a soft setting. > On top of that, it's generally larger than the hardware setting > on purpose, to allow backup of requests for merging. > > Fill a hole in struct request with a 'queue_depth' member, that That would be struct request_queue.. > /** > + * blk_set_queue_depth - tell the block layer about the device queue depth > + * @q: the request queue for the device > + * @depth: queue depth > + * > + */ > +void blk_set_queue_depth(struct request_queue *q, unsigned int depth) > +{ > + q->queue_depth = depth; > +} > +EXPORT_SYMBOL(blk_set_queue_depth); Do we really need this wrapper? > +++ b/drivers/scsi/scsi.c > @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int > depth) > wmb(); > } > > + if (sdev->request_queue) > + blk_set_queue_depth(sdev->request_queue, depth); > + > return sdev->queue_depth; Can we kill the scsi_device queue_depth member and just always use the request_queue one? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] block: add code to track actual device queue depth
On 11/02/2016 08:59 AM, Christoph Hellwig wrote: On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote: For blk-mq, ->nr_requests does track queue depth, at least at init time. But for the older queue paths, it's simply a soft setting. On top of that, it's generally larger than the hardware setting on purpose, to allow backup of requests for merging. Fill a hole in struct request with a 'queue_depth' member, that That would be struct request_queue.. Good catch, will fix. /** + * blk_set_queue_depth - tell the block layer about the device queue depth + * @q: the request queue for the device + * @depth: queue depth + * + */ +void blk_set_queue_depth(struct request_queue *q, unsigned int depth) +{ + q->queue_depth = depth; +} +EXPORT_SYMBOL(blk_set_queue_depth); Do we really need this wrapper? Not necessarily, just seems like a nicer API than manually setting the field. Not a big deal to me, though. +++ b/drivers/scsi/scsi.c @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth) wmb(); } + if (sdev->request_queue) + blk_set_queue_depth(sdev->request_queue, depth); + return sdev->queue_depth; Can we kill the scsi_device queue_depth member and just always use the request_queue one? Yes. I'd prefer that we do that through the SCSI tree, once this is in though. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
On Tue, Nov 01, 2016 at 03:08:47PM -0600, Jens Axboe wrote: > Note in the bdi_writeback structure whenever a task ends up sleeping > waiting for progress. We can use that information in the lower layers > to increase the priority of writes. Do we need to care about atomicy of multiple threads updating the value ? Otherwise this looks fine. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] block: add scalable completion tracking of requests
On 11/02/2016 08:52 AM, Christoph Hellwig wrote: On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote: For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. what is the use case for the legacy request tracking? Buffered writeback code uses the same base. Additionally, it could replace some user space tracking for latency outliers that people are currently running. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] writeback: mark background writeback as such
Looks fine, Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] writeback: add wbc_to_write_flags()
On Tue, Nov 01, 2016 at 03:08:45PM -0600, Jens Axboe wrote: > Add wbc_to_write_flags(), which returns the write modifier flags to use, > based on a struct writeback_control. No functional changes in this > patch, but it prepares us for factoring other wbc fields for write type. > > Signed-off-by: Jens Axboe> Reviewed-by: Jan Kara Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
On Tue, Nov 01, 2016 at 03:05:24PM -0600, Jens Axboe wrote: > This patch enables a hybrid polling mode. Instead of polling after IO > submission, we can induce an artificial delay, and then poll after that. > For example, if the IO is presumed to complete in 8 usecs from now, we > can sleep for 4 usecs, wake up, and then do our polling. This still puts > a sleep/wakeup cycle in the IO path, but instead of the wakeup happening > after the IO has completed, it'll happen before. With this hybrid > scheme, we can achieve big latency reductions while still using the same > (or less) amount of CPU. > > Signed-off-by: Jens AxboeThis looks very nice. I'll need to run some tests before giving the formal ACK, though. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] block: add WRITE_BACKGROUND
On Tue, Nov 01, 2016 at 03:08:44PM -0600, Jens Axboe wrote: > This adds a new request flag, REQ_BACKGROUND, that callers can use to > tell the block layer that this is background (non-urgent) IO. The subject still says WRITE_BACKGROUND :) Otherwise looks fine: Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET] block: IO polling improvements
On 11/02/2016 08:51 AM, Christoph Hellwig wrote: On Tue, Nov 01, 2016 at 03:05:21PM -0600, Jens Axboe wrote: This builds on top of Christophs simplified bdev O_DIRECT code, posted earlier today [1]. Btw, can you create a stable branch just for the "block: add bio_iov_iter_get_pages()" so that we can use it as a baseline for the iomap dio work? Sure -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] block: add scalable completion tracking of requests
On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote: > For legacy block, we simply track them in the request queue. For > blk-mq, we track them on a per-sw queue basis, which we can then > sum up through the hardware queues and finally to a per device > state. what is the use case for the legacy request tracking? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()
On Wed, Nov 02, 2016 at 11:32:04AM +0300, Kirill A. Shutemov wrote: > Yes, buffer_head list doesn't scale. That's the main reason (along with 4) > why syscall-based IO sucks. We spend a lot of time looking for desired > block. > > We need to switch to some other data structure for storing buffer_heads. > Is there a reason why we have list there in first place? > Why not just array? > > I will look into it, but this sounds like a separate infrastructure change > project. We're working on it with the iomap code. And yes, it's really something that needs to be done before we can consider the THP patches. Same for the biovec thing were we really need the > PAGE_SIZE bio_vecs first. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
On Wed, Nov 02 2016 at 3:56am -0400, Ming Leiwrote: > On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreet > wrote: > > On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote: > >> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote: > >> > Avoid to access .bi_vcnt directly, because it may be not what > >> > the driver expected any more after supporting multipage bvec. > >> > > >> > Signed-off-by: Ming Lei > >> > >> It would be really nice to have a comment in the code why it's > >> even checking for multiple segments. > > > > Or ideally refactor the code to not care about multiple segments at all. > > The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm: > don't start current request if it would've merged with the previous), which > fixed one performance issue.[1] > > Looks the idea of the patch is to delay dispatching the rq if it > would've merged with previous request and the rq is small(single bvec). > I guess the motivation is to try to increase chance of merging with the delay. > > But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is > submitted, .bi_vcnt isn't changed any more and merging doesn't change > it too. So should the check have been on blk_rq_bytes(rq)? > > Mike, please correct me if my understanding is wrong. > > > [1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html The patch was labored over for quite a while and is based on suggestions I got from Jens when discussing a very problematic aspect of old .request_fn request-based DM performance for a multi-threaded (64 threads) sequential IO benchmark (vdbench IIRC). The issue was reported by NetApp. The patch in question fixed the lack of merging that was seen with this interleaved sequential IO benchmark. The lack of merging was made worse if a DM multipath device had more underlying paths (e.g. 4 instead of 2). As for your question, about using blk_rq_bytes(rq) vs 'bio->bi_vcnt == 1' .. not sure how that would be a suitable replacement. But it has been a while since I've delved into these block core merge details of old .request_fn but _please_ don't change the logic of this code simply because it is proving itself to be problematic for your current patchset's cleanliness. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()
On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote: > On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote: > > > If I understand the motivation right, it is mostly about being able to > > > mmap > > > PMD-sized chunks to userspace. So my naive idea would be that we could > > > just > > > implement it by allocating PMD sized chunks of pages when adding pages to > > > page cache, we don't even have to read them all unless we come from PMD > > > fault path. > > > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc} > > per-hugepage, one common list of buffer heads... > > > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling > > it otherwise doesn't make sense) and handling it differently for file-THP > > is nightmare from maintenance POV. > > But the complexity of two different page sizes for page cache and *each* > filesystem that wants to support it does not make the maintenance easy > either. I think with time we can make small pages just a subcase of huge pages. And some generalization can be made once more than one filesystem with backing storage will adopt huge pages. > So I'm not convinced that using the same rules for anon-THP and > file-THP is a clear win. We already have file-THP with the same rules: tmpfs. Backing storage is what changes the picture. > And if we have these two options neither of which has negligible > maintenance cost, I'd also like to see more justification for why it is > a good idea to have file-THP for normal filesystems. Do you have any > performance numbers that show it is a win under some realistic workload? See below. As usual with huge pages, they make sense when you plenty of memory. > I'd also note that having PMD-sized pages has some obvious disadvantages as > well: > > 1) I'm not sure buffer head handling code will quite scale to 512 or even > 2048 buffer_heads on a linked list referenced from a page. It may work but > I suspect the performance will suck. Yes, buffer_head list doesn't scale. That's the main reason (along with 4) why syscall-based IO sucks. We spend a lot of time looking for desired block. We need to switch to some other data structure for storing buffer_heads. Is there a reason why we have list there in first place? Why not just array? I will look into it, but this sounds like a separate infrastructure change project. > 2) PMD-sized pages result in increased space & memory usage. Space? Do you mean disk space? Not really: we still don't write beyond i_size or into holes. Behaviour wrt to holes may change with mmap()-IO as we have less granularity, but the same can be seen just between different architectures: 4k vs. 64k base page size. > 3) In ext4 we have to estimate how much metadata we may need to modify when > allocating blocks underlying a page in the worst case (you don't seem to > update this estimate in your patch set). With 2048 blocks underlying a page, > each possibly in a different block group, it is a lot of metadata forcing > us to reserve a large transaction (not sure if you'll be able to even > reserve such large transaction with the default journal size), which again > makes things slower. I didn't saw this on profiles. And xfstests looks fine. I probably need to run them with 1k blocks once again. > 4) As you have noted some places like write_begin() still depend on 4k > pages which creates a strange mix of places that use subpages and that use > head pages. Yes, this need to be addressed to restore syscall-IO performance and take advantage of huge pages. But again, it's an infrastructure change that would likely affect interface between VFS and filesystems. It deserves a separate patchset. > All this would be a non-issue (well, except 2 I guess) if we just didn't > expose filesystems to the fact that something like file-THP exists. The numbers below generated with fio. The working set is relatively small, so it fits into page cache and writing set doesn't hit dirty_ratio. I think the mmap performance should be enough to justify initial inclusion of an experimental feature: it useful for workloads that targets mmap()-IO. It will take time to get feature mature anyway. Configuration: - 2x E5-2697v2, 64G RAM; - INTEL SSDSC2CW24; - IO request size is 4k; - 8 processes, 512MB data set each; Workload read/write baselinestddev huge=always stddev change sync-read read 21439.00 348.1420297.33 259.62 -5.33% sync-write write 6833.20 147.08 3630.13 52.86 -46.88% sync-readwrite read 4377.17 17.53 2366.33 19.52 -45.94% write 4378.50 17.83 2365.80 19.94 -45.97% sync-randread read 5491.20 66.6614664.00 288.29 167.05% sync-randwrite write 6396.13 98.79 2035.80
Re: [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
On Wed, Nov 2, 2016 at 11:09 AM, Kent Overstreetwrote: > On Mon, Oct 31, 2016 at 08:29:01AM -0700, Christoph Hellwig wrote: >> On Sat, Oct 29, 2016 at 04:08:08PM +0800, Ming Lei wrote: >> > Avoid to access .bi_vcnt directly, because it may be not what >> > the driver expected any more after supporting multipage bvec. >> > >> > Signed-off-by: Ming Lei >> >> It would be really nice to have a comment in the code why it's >> even checking for multiple segments. > > Or ideally refactor the code to not care about multiple segments at all. The check on 'bio->bi_vcnt == 1' is introduced in commit de3ec86dff160(dm: don't start current request if it would've merged with the previous), which fixed one performance issue.[1] Looks the idea of the patch is to delay dispatching the rq if it would've merged with previous request and the rq is small(single bvec). I guess the motivation is to try to increase chance of merging with the delay. But why does the code check on 'bio->bi_vcnt == 1'? Once the bio is submitted, .bi_vcnt isn't changed any more and merging doesn't change it too. So should the check have been on blk_rq_bytes(rq)? Mike, please correct me if my understanding is wrong. [1] https://www.redhat.com/archives/dm-devel/2015-March/msg00014.html thanks, Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html