Re: nvme multipath support V2

2017-09-20 Thread Johannes Thumshirn
On Wed, Sep 20, 2017 at 04:55:46PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 20, 2017 at 01:09:59PM +0200, Johannes Thumshirn wrote:
> > Using your patchset and doing the sanme double connect trick I get the same
> > two block devices of cause.
> > 
> > How do I connect using both paths?
> 
> Same as above.  But now in addition to the /dev/nvmeXnY devices you
> should also see a /dev/nvme/nsZ device.  The sysfs entry for it shows
> which paths it belongs to.

Ah OK, we maybe should update nvme-cli to recognize it as well then. I just
looked at the output of nvme list and obviously didn't find it.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 8/9] nvme: track shared namespaces

2017-09-20 Thread Johannes Thumshirn
On Wed, Sep 20, 2017 at 04:54:36PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 20, 2017 at 10:36:43AM +0200, Johannes Thumshirn wrote:
> > Being one of the persons who has to backport a lot of NVMe code to older
> > kernels I'm not a huge fan of renaming nmve_ns.
> 
> The churn is main main worry.  Well and that I don't have a reall good
> name for what currently is nvme_ns either :)
> 
> > That said, I don't have a better name for nvme_ns_head (yet) but I'll try to
> > think of one as well. OTOH looking at nvme_ns_head it actaully is the list
> > head of the nvme_ns list.
> 
> But head also has connotations in the SAN world.  Maybe nvme_ns_chain?

I know that's why I didn't really like it all too much in the first place as
well. For nvme_ns_chain, it's not a chain really (the list itself is a chain,
the structure really is the list head...), but I suck at naming things so.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()

2017-09-20 Thread Martin K. Petersen

Christoph,

> I'm really not sure we should check for -EREMOTEIO specifically, but
> Martin who is more familiar with the SCSI code might be able to
> correct me, I'd feel safer about checking for any error which is
> what the old code did.
>
> Except for that the patch looks fine to me.

We explicitly return -EREMOTEIO when the device reports ILLEGAL
REQUEST. But I agree that we should fall back to manually zeroing for
any error.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] block: consider merge of segments when merge bio into rq

2017-09-20 Thread jianchao.wang


On 09/21/2017 09:29 AM, Christoph Hellwig wrote:
> So the check change here looks good to me.
> 
> I don't like like the duplicate code, can you look into sharing
> the new segment checks between the two functions and the existing
> instance in ll_merge_requests_fn by passing say two struct bio *bio1
> and struct bio *bio2 pointer instead of using req->bio and req->biotail?
> 
> Also please include the information that you posted in the reply to
> the other patch in the description for this one.
> 
I have looked into the ll_back/front_merge_fn() and ll_merge_requests_fn().
It seems that segments check code fragment looks similar but not unified.
We could unify the part of calculation of variable of seg_size and contig,
but not the positions where use these two.
Especially the calculation of total_phys_segments and judgment of 
'nr_phys_segments == 1', rq variable cannot be avoided there. And the update
of bi_seg_front/back_size is also very tricky. If force to merge these code 
fragments together, the code may become hard to read.

So please refer to the V2 patch which contain more comment about the issue
and result.

Thanks
Jianchao


Re: [PATCH] block: consider merge of segments when merge bio into rq

2017-09-20 Thread jianchao.wang


On 09/21/2017 09:29 AM, Christoph Hellwig wrote:
> So the check change here looks good to me.
> 
> I don't like like the duplicate code, can you look into sharing
> the new segment checks between the two functions and the existing
> instance in ll_merge_requests_fn by passing say two struct bio *bio1
> and struct bio *bio2 pointer instead of using req->bio and req->biotail?
> 
> Also please include the information that you posted in the reply to
> the other patch in the description for this one.
> 

I have just sent a new version which contains more comment before I see
this mail. I will send out another new version that apply your advice later.
Thanks for your kind and fair comment.

Jianchao


[V2] block: consider merge of segments when merge bio into rq

2017-09-20 Thread Jianchao Wang
When account the nr_phys_segments during merging bios into rq,
only consider segments merging in individual bio but not all
the bios in a rq. This leads to the bigger nr_phys_segments of
rq than the real one when the segments of bios in rq are
contiguous and mergeable. The nr_phys_segments of rq will exceed
max_segmets of q and stop merging while the sectors of rq maybe
far away from the max_sectors of q.

In practice, the merging will stop due to max_segmets limit while
the segments in the rq are contiguous and mergeable during the
mkfs.ext4 workload on my local. This could be harmful to the
performance of sequential operations.

To fix it, consider the segments merge when account nr_phys_segments
of rq during merging bio into rq. Decrease the nr_phys_segments of rq
by 1 when the adjacent segments in bio and rq are contiguous and
mergeable. Consequently get more fully merging and better performance
in sequential operations. In addition, it could eliminate the wasting of
scatterlist structure.

On my local mkfs.ext4 workload, the final size of rq issued raise from
168 sectors (max_segmets is 168) to 2560 sectors (max_sector_kb is 1280).

V2 : Add more comment to elaborate how this issue found and result after
apply the patch.

Signed-off-by: Jianchao Wang 
---
 block/blk-merge.c | 98 ---
 1 file changed, 64 insertions(+), 34 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 14b6e37..b2f54fd 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -472,54 +472,60 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
-static inline int ll_new_hw_segment(struct request_queue *q,
-   struct request *req,
-   struct bio *bio)
-{
-   int nr_phys_segs = bio_phys_segments(q, bio);
-
-   if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q))
-   goto no_merge;
-
-   if (blk_integrity_merge_bio(q, req, bio) == false)
-   goto no_merge;
-
-   /*
-* This will form the start of a new hw segment.  Bump both
-* counters.
-*/
-   req->nr_phys_segments += nr_phys_segs;
-   return 1;
-
-no_merge:
-   req_set_nomerge(q, req);
-   return 0;
-}
-
 int ll_back_merge_fn(struct request_queue *q, struct request *req,
 struct bio *bio)
 {
+   unsigned int seg_size;
+   int total_nr_phys_segs;
+   bool contig;
+
if (req_gap_back_merge(req, bio))
return 0;
if (blk_integrity_rq(req) &&
integrity_req_gap_back_merge(req, bio))
return 0;
if (blk_rq_sectors(req) + bio_sectors(bio) >
-   blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
-   req_set_nomerge(q, req);
-   return 0;
-   }
+   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
+   goto no_merge;
+
if (!bio_flagged(req->biotail, BIO_SEG_VALID))
blk_recount_segments(q, req->biotail);
if (!bio_flagged(bio, BIO_SEG_VALID))
blk_recount_segments(q, bio);
 
-   return ll_new_hw_segment(q, req, bio);
+   if (blk_integrity_merge_bio(q, req, bio) == false)
+   goto no_merge;
+
+   seg_size = req->biotail->bi_seg_back_size + bio->bi_seg_front_size;
+   total_nr_phys_segs = req->nr_phys_segments + bio_phys_segments(q, bio);
+
+   contig = blk_phys_contig_segment(q, req->biotail, bio);
+   if (contig)
+   total_nr_phys_segs--;
+
+   if (unlikely(total_nr_phys_segs > queue_max_segments(q)))
+   goto no_merge;
+
+   if (contig) {
+   if (req->nr_phys_segments == 1)
+   req->bio->bi_seg_front_size = seg_size;
+   if (bio->bi_phys_segments == 1)
+   bio->bi_seg_back_size = seg_size;
+   }
+   req->nr_phys_segments = total_nr_phys_segs;
+   return 1;
+
+no_merge:
+   req_set_nomerge(q, req);
+   return 0;
 }
 
 int ll_front_merge_fn(struct request_queue *q, struct request *req,
  struct bio *bio)
 {
+   unsigned int seg_size;
+   int total_nr_phys_segs;
+   bool contig;
 
if (req_gap_front_merge(req, bio))
return 0;
@@ -527,16 +533,40 @@ int ll_front_merge_fn(struct request_queue *q, struct 
request *req,
integrity_req_gap_front_merge(req, bio))
return 0;
if (blk_rq_sectors(req) + bio_sectors(bio) >
-   blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) {
-   req_set_nomerge(q, req);
-   return 0;
-   }
+   blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector))
+   goto no_merge;
+
if (!bio_flagged(bio, BIO_SEG_VALID))
blk_recount_segments(q, bio);
if (!bio_flagged(req->bio, 

Re: [PATCH] block: consider merge of segments when merge bio into rq

2017-09-20 Thread Christoph Hellwig
So the check change here looks good to me.

I don't like like the duplicate code, can you look into sharing
the new segment checks between the two functions and the existing
instance in ll_merge_requests_fn by passing say two struct bio *bio1
and struct bio *bio2 pointer instead of using req->bio and req->biotail?

Also please include the information that you posted in the reply to
the other patch in the description for this one.


Re: [PATCH 9/9] nvme: implement multipath access to nvme subsystems

2017-09-20 Thread Christoph Hellwig
On Wed, Sep 20, 2017 at 06:58:22PM -0400, Keith Busch wrote:
> > +   sprintf(head->disk->disk_name, "nvme/ns%d", head->instance);
> 
> Naming it 'nvme/ns<#>', kobject_set_name_vargs is going to change that
> '/' into a '!', so the sysfs entry is named 'nvme!ns<#>'. Not a big
> deal I suppose, but I just thought it looked odd since '!' has special
> meaning in shells.

I noticed the odd renaming in sysfs and though about gettind rid
of the /dev/nvme/ directory.  I just need to come up with a good
name for the device nodes - the name can't contain /dev/nvme* as
nvme-cli would break if it sees files that start with nvme in /dev/.

/dev/nvmN ?  


Re: [PATCH 9/9] nvme: implement multipath access to nvme subsystems

2017-09-20 Thread Keith Busch
On Mon, Sep 18, 2017 at 04:14:53PM -0700, Christoph Hellwig wrote:

This is awesome! Looks great, just a minor comment:

> + sprintf(head->disk->disk_name, "nvme/ns%d", head->instance);

Naming it 'nvme/ns<#>', kobject_set_name_vargs is going to change that
'/' into a '!', so the sysfs entry is named 'nvme!ns<#>'. Not a big
deal I suppose, but I just thought it looked odd since '!' has special
meaning in shells.

Otherwise, this is looking really solid, and test well on my single
ported NVMe. I had some trouble getting dual ported ones, but I've some
now and will run tests on those tomorrow with some failure injection.


Re: I/O hangs after resuming from suspend-to-ram

2017-09-20 Thread Ming Lei
On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> Ming Lei - 28.08.17, 20:58:
> > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > Hi.
> > > 
> > > Here is disk setup for QEMU VM:
> > > 
> > > ===
> > > [root@archmq ~]# smartctl -i /dev/sda
> > > …
> > > Device Model: QEMU HARDDISK
> > > Serial Number:QM1
> > > Firmware Version: 2.5+
> > > User Capacity:4,294,967,296 bytes [4.29 GB]
> > > Sector Size:  512 bytes logical/physical
> > > Device is:Not in smartctl database [for details use: -P showall]
> > > ATA Version is:   ATA/ATAPI-7, ATA/ATAPI-5 published, ANSI NCITS 340-2000
> > > Local Time is:Sun Aug 27 09:31:54 2017 CEST
> > > SMART support is: Available - device has SMART capability.
> > > SMART support is: Enabled
> > > 
> > > [root@archmq ~]# lsblk
> > > NAMEMAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
> > > sda   8:004G  0 disk
> > > `-sda18:104G  0 part
> > > 
> > >   `-md0   9:004G  0 raid10
> > >   
> > > `-system253:004G  0 crypt
> > > 
> > >   |-system-boot 253:10  512M  0 lvm/boot
> > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > >   
> > >   `-system-root 253:303G  0 lvm/
> > > 
> > > sdb   8:16   04G  0 disk
> > > `-sdb18:17   04G  0 part
> > > 
> > >   `-md0   9:004G  0 raid10
> > >   
> > > `-system253:004G  0 crypt
> > > 
> > >   |-system-boot 253:10  512M  0 lvm/boot
> > >   |-system-swap 253:20  512M  0 lvm[SWAP]
> > >   
> > >   `-system-root 253:303G  0 lvm/
> > > 
> > > sr0  11:01 1024M  0 rom
> > > 
> > > [root@archmq ~]# mdadm --misc --detail /dev/md0
> > > 
> > > /dev/md0:
> > > Version : 1.2
> > >   
> > >   Creation Time : Sat Jul 29 16:37:05 2017
> > >   
> > >  Raid Level : raid10
> > >  Array Size : 4191232 (4.00 GiB 4.29 GB)
> > >   
> > >   Used Dev Size : 4191232 (4.00 GiB 4.29 GB)
> > >   
> > >Raid Devices : 2
> > >   
> > >   Total Devices : 2
> > >   
> > > Persistence : Superblock is persistent
> > > 
> > > Update Time : Sun Aug 27 09:30:33 2017
> > > 
> > >   State : clean
> > >  
> > >  Active Devices : 2
> > > 
> > > Working Devices : 2
> > > 
> > >  Failed Devices : 0
> > >  
> > >   Spare Devices : 0
> > >   
> > >  Layout : far=2
> > >  
> > >  Chunk Size : 512K
> > >  
> > >Name : archiso:0
> > >UUID : 43f4be59:c8d2fa0a:a94acdff:1c7f2f4e
> > >  
> > >  Events : 485
> > > 
> > > Number   Major   Minor   RaidDevice State
> > > 
> > >0   810  active sync   /dev/sda1
> > >1   8   171  active sync   /dev/sdb1
> > > 
> > > ===
> > > 
> > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it,
> > > then
> > > LVM, then ext4 for boot, swap and btrfs for /.
> > > 
> > > I couldn't reproduce the issue with single disk without RAID.
> > 
> > Could you verify if the following patch fixes your issue?

Yes, the patch should address this kind of issue, not related
with RAID specially, and the latest version can be found in the
following link:

https://marc.info/?l=linux-block=150579298505484=2

-- 
Ming


Re: I/O hangs after resuming from suspend-to-ram

2017-09-20 Thread Ming Lei
On Wed, Sep 20, 2017 at 07:25:02PM +0200, Martin Steigerwald wrote:
> Ming Lei - 28.08.17, 21:32:
> > On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> > > Ming Lei - 28.08.17, 20:58:
> > > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > > > Hi.
> > > > > 
> > > > > Here is disk setup for QEMU VM:
> […]
> > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it,
> > > > > then
> > > > > LVM, then ext4 for boot, swap and btrfs for /.
> > > > > 
> > > > > I couldn't reproduce the issue with single disk without RAID.
> > > > 
> > > > Could you verify if the following patch fixes your issue?
> > > 
> > > Could this also apply to non MD RAID systems? I am using BTRFS RAID
> > > 1 with two SSDs. So far with CFQ it runs stable.
> > 
> > It is for fixing Oleksandr's issue wrt. blk-mq, and looks not for you.
> 
> My findings are different:
> 
> On 4.12.10 with CONFIG_HZ=1000, CONFIG_PREEMPT=y and optimizations for Intel 
> Core/newer Xeon I see this:
> 
> 1) Running with CFQ: No hang after resume
> 
> 2) Running with scsi_mod.use_blk_mq=1 + BFQ: Hang after resume within first 
> 1-2 
> days.

Hi Martin,

Thanks for your report!

Could you test the following patchset to see if it fixes your issue?

https://marc.info/?l=linux-block=150579298505484=2


-- 
Ming


Re: [GIT PULL] nvme fixes for 4.14-rc2

2017-09-20 Thread Jens Axboe
On 09/20/2017 03:29 PM, Christoph Hellwig wrote:
> Hi Jens,
> 
> a couple nvme fixes for -rc2 are below:
> 
>  - fixes for the Fibre Channel host/target to fix spec compliance
>  - allow a zero keep alive timeout
>  - make the debug printk for broken SGLs work better
>  - fix queue zeroing during initialization

Pulled, thanks.

-- 
Jens Axboe



[GIT PULL] nvme fixes for 4.14-rc2

2017-09-20 Thread Christoph Hellwig
Hi Jens,

a couple nvme fixes for -rc2 are below:

 - fixes for the Fibre Channel host/target to fix spec compliance
 - allow a zero keep alive timeout
 - make the debug printk for broken SGLs work better
 - fix queue zeroing during initialization

The following changes since commit da357321837921f52c386f2123c6718443fbb844:

  bsg-lib: don't free job in bsg_prepare_job (2017-09-14 22:32:03 -0600)

are available in the git repository at:

  ssh://git.infradead.org/srv/git/nvme.git nvme-4.14

for you to fetch changes up to f48b6b902ead0f3a1b5cfb0d157cc022f676de94:

  nvmet: implement valid sqhd values in completions (2017-09-20 10:52:40 -0700)


Guilherme G. Piccoli (1):
  nvme-fabrics: Allow 0 as KATO value

James Smart (12):
  nvme-fc: remove use of FC-specific error codes
  nvmet-fc: remove use of FC-specific error codes
  nvmet-fcloop: remove use of FC-specific error codes
  lpfc: remove use of FC-specific error codes
  qla2xxx: remove use of FC-specific error codes
  nvme.h: remove FC transport-specific error values
  nvme: add transport SGL definitions
  nvme-fc: use transport-specific sgl format
  nvmet-fc: fix failing max io queue connections
  nvme: stop aer posting if controller state not live
  nvme: allow timed-out ios to retry
  nvmet: implement valid sqhd values in completions

Keith Busch (2):
  nvme-pci: initialize queue memory before interrupts
  nvme-pci: Print invalid SGL only once

 drivers/nvme/host/core.c  |  7 +++
 drivers/nvme/host/fabrics.c   | 18 +-
 drivers/nvme/host/fc.c| 21 +++--
 drivers/nvme/host/pci.c   | 34 --
 drivers/nvme/target/core.c|  8 
 drivers/nvme/target/fabrics-cmd.c |  9 +++--
 drivers/nvme/target/fc.c  | 15 ++-
 drivers/nvme/target/fcloop.c  |  2 +-
 drivers/nvme/target/nvmet.h   |  1 +
 drivers/scsi/lpfc/lpfc_nvme.c |  2 +-
 drivers/scsi/qla2xxx/qla_nvme.c   |  2 +-
 include/linux/nvme.h  | 19 ++-
 12 files changed, 70 insertions(+), 68 deletions(-)


[PATCH] loop: remove union of use_aio and ref in struct loop_cmd

2017-09-20 Thread Omar Sandoval
From: Omar Sandoval 

When the request is completed, lo_complete_rq() checks cmd->use_aio.
However, if this is in fact an aio request, cmd->use_aio will have
already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a
union. On x86_64, there's a hole after the union anyways, so this
doesn't make struct loop_cmd any bigger.

Fixes: 92d773324b7e ("block/loop: fix use after free")
Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index f68c1d50802f..1f3956702993 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -67,10 +67,8 @@ struct loop_device {
 struct loop_cmd {
struct kthread_work work;
struct request *rq;
-   union {
-   bool use_aio; /* use AIO interface to handle I/O */
-   atomic_t ref; /* only for aio */
-   };
+   bool use_aio; /* use AIO interface to handle I/O */
+   atomic_t ref; /* only for aio */
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
-- 
2.14.1



[PATCH] lightnvm: pblk: fix error path in pblk_lines_alloc_metadata

2017-09-20 Thread Rakesh Pandit
Use appropriate memory free calls based on allocation type used and
also fix number of times free is called if kmalloc fails.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-init.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7cf4b53..470ef04 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -624,12 +624,16 @@ static int pblk_lines_alloc_metadata(struct pblk *pblk)
 
 fail_free_emeta:
while (--i >= 0) {
-   vfree(l_mg->eline_meta[i]->buf);
+   if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META)
+   vfree(l_mg->eline_meta[i]->buf);
+   else
+   kfree(l_mg->eline_meta[i]->buf);
kfree(l_mg->eline_meta[i]);
}
 
+   i = PBLK_DATA_LINES;
 fail_free_smeta:
-   for (i = 0; i < PBLK_DATA_LINES; i++)
+   while (--i >= 0)
kfree(l_mg->sline_meta[i]);
 
return -ENOMEM;
-- 
2.5.0



Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Coly Li
On 2017/9/20 下午5:40, Michael Lyle wrote:
> On Wed, Sep 20, 2017 at 3:28 AM, Coly Li  wrote:
>> Even the read request failed on file system meta data, because finally a
>> stale data will be provided to kernel file system code, it is probably
>> file system won't complain as well.
> 
> The scary case is when filesystem data that points to other filesystem
> data is cached.  E.g. the data structures representing what space is
> free on disk, or a directory, or a database btree.  Some examples:
> 
> Free space handling-- if a big file /foo is created, and the active
> free-space datastructures are in cache (and this is likely, because
> actively written places can have their writeback-writes
> cancelled/deferred indefinitely)-- and then later the caching disk
> fails, an old version of this will be read from disk.  Later, an
> effort to write a file /bar allocates the space used by /foo, and
> writes over it.
> 
> Directory entity handling-- if /var/spool/foo is an active directory
> (associated data structures in cache), and has the directory
> /var/spool/foo/bar under it, and then /bar is removed... the backing
> disk will still have a reference to bar.  If the space for bar is then
> used for something else, the kernel may end up reading something very
> different from what it expects for a directory later after a cache
> device failure.
> 
> Btrees, etc-- the same thing.  If a tree shrinks, old tree entitys can
> end up pointing to other kinds of data.
> 
> I think this change is harmful-- it is not a good idea to
> automatically, at runtime, decide to start returning data that
> violates the guarantees a block device is supposed to obey about
> ordering and persistence.

Hi Mike,

I totally agree with you. It is my fault for the misleading commit log,
if you read it again you may find we stand on same side, this is what I
feel from your response :-)

Current bcache code does provide stale data from read failure recovery.
In v1 patch discussion people wanted to keep this behavior then in v2
version I add an option to permit this "harmful" behavior, and disable
this behavior by default.

And good to know Kent does not like an option, then we can disable this
"harmful" behavior by default.

Thanks.

Coly


Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Coly Li
On 2017/9/20 下午6:07, Kent Overstreet wrote:
> On Wed, Sep 20, 2017 at 06:24:33AM +0800, Coly Li wrote:
>> When bcache does read I/Os, for example in writeback or writethrough mode,
>> if a read request on cache device is failed, bcache will try to recovery
>> the request by reading from cached device. If the data on cached device is
>> not synced with cache device, then requester will get a stale data.
>>
>> For critical storage system like database, providing stale data from
>> recovery may result an application level data corruption, which is
>> unacceptible. But for some other situation like multi-media stream cache,
>> continuous service may be more important and it is acceptible to fetch
>> a chunk of stale data.
>>
>> This patch tries to solve the above conflict by adding a sysfs option
>>  /sys/block/bcache/bcache/allow_stale_data_on_failure
>> which is defaultly cleared (to 0) as disabled. Now people can make choices
>> for different situations.
> 
> IMO this is just a bug, I'd rather not have an option to keep the buggy
> behaviour. How about this patch:
> 

Hi Kent,

OK, last time when I discuss with other bcache developers, people wanted
to keep this behavior, then I modify it as an option in this version
patch. I support fix it without an option, because there are too many
options already. Good to know you have similar decision :-)


> commit 2746f9c1f962288d8c5d7dabe698bf7b3fddd405
> Author: Kent Overstreet 
> Date:   Wed Sep 20 18:06:37 2017 +0200
> 
> bcache: Don't recover from IO errors when reading dirty data
> 
> Signed-off-by: Kent Overstreet 
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 382397772a..c2d57ef953 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -532,8 +532,10 @@ static int cache_lookup_fn(struct btree_op *op, struct 
> btree *b, struct bkey *k)
>  
>   PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO;
>  
> - if (KEY_DIRTY(k))
> + if (KEY_DIRTY(k)) {
>   s->read_dirty_data = true;
> + s->recoverable = false;
> + }
>  

I though of fixing here, the reason I gave up to modify here was,
cache_lookup_fn() is called for keys in leaf nodes (b->level == 0),
bch_btree_map_keys_recurse() needs to do I/O to fetch upper level nodes
before accessing leaf node. When a SSD failed bch_btree_node_get() will
fail before cache_lookup_fn() is executed. So the your patch, there is
no chance to set s->recoverable to false, recovery still happens.

If you don't like an option, the following modification should be much
simpler,

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 681b4f12b05a..f397785d9c38 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -697,8 +697,10 @@ static void cached_dev_read_error(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = >bio.bio;
+   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);

-   if (s->recoverable) {
+   if (s->recoverable &&
+   (dc && !atomic_read(>has_dirty)) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);

This might be the simplest way I know for now.

Thanks.

Coly Li


Re: [PATCH v7] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-20 Thread Steven Rostedt
On Wed, 20 Sep 2017 13:09:31 -0600
Jens Axboe  wrote:
> 
> I'll take it through my tree, and I'll prune some of that comment
> as well (which should be a commit message thing, not a code comment).
> 

Agreed, and thanks.

-- Steve




Re: [PATCH 08/20] lightnvm: pblk: sched. metadata on write thread

2017-09-20 Thread Rakesh Pandit
On Wed, Sep 20, 2017 at 09:28:34PM +0300, Rakesh Pandit wrote:
> Hi Javier,
> 
> one small issue I found for error path while going through changes:
> 
> On Mon, Jun 26, 2017 at 11:57:17AM +0200, Javier González wrote:
> ..
> > +static int pblk_lines_alloc_metadata(struct pblk *pblk)
> > +{
[..]
> > +   if (!l_mg->sline_meta[i])
> > +   goto fail_free_smeta;
> 
> For this error path the the goto label at end doesn't free up
> resources correctly.  It needs a
> 
> while (--index >= 0)...
> 
> logic with appropriate adjustment.
> 
> > +   }
[..]
> > +   if (!l_mg->vsc_list)
> > +   goto fail_free_emeta;
> > +
> > +   for (i = 0; i < l_mg->nr_lines; i++)
> > +   l_mg->vsc_list[i] = cpu_to_le32(EMPTY_ENTRY);
> > +
> > +   return 0;
> > +
> > +fail_free_emeta:
> > +   while (--i >= 0) {
> > +   vfree(l_mg->eline_meta[i]->buf);
> 
> This would need l_mg->emeta_alloc_type check to decide whether
> allocation was done with kmalloc or vmalloc.
> 
> > +   kfree(_mg->eline_meta[i]);
> > +   }
> > +
> > +fail_free_smeta:
> > +   for (i = 0; i < PBLK_DATA_LINES; i++)
> > +   pblk_mfree(_mg->sline_meta[i], l_mg->smeta_alloc_type);
> > +
> > +   return -ENOMEM;
> > +}
> > +
> 
> Thanks,

I failed to see earlier that it has been merged already.  Will post a
patch in a while.

Thanks,


Re: [PATCH v7] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-20 Thread Waiman Long
On 09/20/2017 01:35 PM, Christoph Hellwig wrote:
>> +/*
>> + * When reading or writing the blktrace sysfs files, the references to the
>> + * opened sysfs or device files should prevent the underlying block device
>> + * from being removed. So no further delete protection is really needed.
>> + *
>> + * Protection from multiple readers and writers accessing blktrace data
>> + * concurrently is still required. The bd_mutex was used for this purpose.
>> + * That could lead to deadlock with concurrent block device deletion and
>> + * sysfs access. As a result, a new blk_trace_mutex is now added to be
>> + * used solely by the blktrace code.
>> + */
> Comments about previous locking schemes really don't have a business
> in the code - those are remarks for the commit logs.  And in general
> please explain the locking scheme near the data that they proctect
> it, as locks should always protected data, not code and the comments
> should follow that.

It seems to be a general practice that we don't put detailed comments in
the header files. The comment was put above the function with the first
instance of the blk_trace_mutex. Yes, I agree that talking about the
past history may not be applicable here. I will keep that in mind in the
future.

Thanks,
Longman




Re: [PATCH] blk-mq: Fix blk_mq_get_request() error path

2017-09-20 Thread Jens Axboe
On 09/20/2017 12:47 PM, Bart Van Assche wrote:
> blk_mq_get_tag() can modify data->ctx. This means that in the
> error path of blk_mq_get_request() data->ctx should be passed to
> blk_mq_put_ctx() instead of local_ctx.

It's just a cosmetic thing, the only part that matters is that
we balance the get and put. I'd take this as a cleanup for
the next series, but not as a Fixes anything and definitely
not for stable.

-- 
Jens Axboe



[PATCH] blk-mq: Fix blk_mq_get_request() error path

2017-09-20 Thread Bart Van Assche
blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx.

Fixes: commit 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter 
when allocating rq failed")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc:  # v4.13+
---
 block/blk-mq.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f18cff80050..fc20016f8a9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -336,12 +336,14 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
struct elevator_queue *e = q->elevator;
struct request *rq;
unsigned int tag;
-   struct blk_mq_ctx *local_ctx = NULL;
+   bool put_ctx_on_error = false;
 
blk_queue_enter_live(q);
data->q = q;
-   if (likely(!data->ctx))
-   data->ctx = local_ctx = blk_mq_get_ctx(q);
+   if (likely(!data->ctx)) {
+   data->ctx = blk_mq_get_ctx(q);
+   put_ctx_on_error = true;
+   }
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
if (op & REQ_NOWAIT)
@@ -360,8 +362,8 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
 
tag = blk_mq_get_tag(data);
if (tag == BLK_MQ_TAG_FAIL) {
-   if (local_ctx) {
-   blk_mq_put_ctx(local_ctx);
+   if (put_ctx_on_error) {
+   blk_mq_put_ctx(data->ctx);
data->ctx = NULL;
}
blk_queue_exit(q);
-- 
2.14.1



Re: [PATCH 08/20] lightnvm: pblk: sched. metadata on write thread

2017-09-20 Thread Rakesh Pandit
Hi Javier,

one small issue I found for error path while going through changes:

On Mon, Jun 26, 2017 at 11:57:17AM +0200, Javier González wrote:
..
> +static int pblk_lines_alloc_metadata(struct pblk *pblk)
> +{
> + struct pblk_line_mgmt *l_mg = >l_mg;
> + struct pblk_line_meta *lm = >lm;
> + int i;
> +
> + /* smeta is always small enough to fit on a kmalloc memory allocation,
> +  * emeta depends on the number of LUNs allocated to the pblk instance
> +  */
> + l_mg->smeta_alloc_type = PBLK_KMALLOC_META;
> + for (i = 0; i < PBLK_DATA_LINES; i++) {
> + l_mg->sline_meta[i] = kmalloc(lm->smeta_len, GFP_KERNEL);
> + if (!l_mg->sline_meta[i])
> + goto fail_free_smeta;

For this error path the the goto label at end doesn't free up
resources correctly.  It needs a

while (--index >= 0)...

logic with appropriate adjustment.

> + }
> +
> + /* emeta allocates three different buffers for managing metadata with
> +  * in-memory and in-media layouts
> +  */
> + for (i = 0; i < PBLK_DATA_LINES; i++) {
> + struct pblk_emeta *emeta;
> +
> + emeta = kmalloc(sizeof(struct pblk_emeta), GFP_KERNEL);
> + if (!emeta)
> + goto fail_free_emeta;
> +
> + if (lm->emeta_len[0] > KMALLOC_MAX_CACHE_SIZE) {
> + l_mg->emeta_alloc_type = PBLK_VMALLOC_META;
> +
> + emeta->buf = vmalloc(lm->emeta_len[0]);
> + if (!emeta->buf) {
> + kfree(emeta);
> + goto fail_free_emeta;
> + }
> +
> + emeta->nr_entries = lm->emeta_sec[0];
> + l_mg->eline_meta[i] = emeta;
> + } else {
> + l_mg->emeta_alloc_type = PBLK_KMALLOC_META;
> +
> + emeta->buf = kmalloc(lm->emeta_len[0], GFP_KERNEL);
> + if (!emeta->buf) {
> + kfree(emeta);
> + goto fail_free_emeta;
> + }
> +
> + emeta->nr_entries = lm->emeta_sec[0];
> + l_mg->eline_meta[i] = emeta;
> + }
> + }
> +
> + l_mg->vsc_list = kcalloc(l_mg->nr_lines, sizeof(__le32), GFP_KERNEL);
> + if (!l_mg->vsc_list)
> + goto fail_free_emeta;
> +
> + for (i = 0; i < l_mg->nr_lines; i++)
> + l_mg->vsc_list[i] = cpu_to_le32(EMPTY_ENTRY);
> +
> + return 0;
> +
> +fail_free_emeta:
> + while (--i >= 0) {
> + vfree(l_mg->eline_meta[i]->buf);

This would need l_mg->emeta_alloc_type check to decide whether
allocation was done with kmalloc or vmalloc.

> + kfree(_mg->eline_meta[i]);
> + }
> +
> +fail_free_smeta:
> + for (i = 0; i < PBLK_DATA_LINES; i++)
> + pblk_mfree(_mg->sline_meta[i], l_mg->smeta_alloc_type);
> +
> + return -ENOMEM;
> +}
> +

Thanks,


Re: [PATCH v7] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-20 Thread Steven Rostedt

Christoph,

Can you give an acked-by for this patch?

Jens,

You want to take this through your tree, or do you want me to?

If you want it, here's my:

Acked-by: Steven Rostedt (VMware) 

-- Steve


On Wed, 20 Sep 2017 13:26:11 -0400
Waiman Long  wrote:

> The lockdep code had reported the following unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(s_active#228);
>lock(>bd_mutex/1);
>lock(s_active#228);
>   lock(>bd_mutex);
> 
>  *** DEADLOCK ***
> 
> The deadlock may happen when one task (CPU1) is trying to delete a
> partition in a block device and another task (CPU0) is accessing
> tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
> partition.
> 
> The s_active isn't an actual lock. It is a reference count (kn->count)
> on the sysfs (kernfs) file. Removal of a sysfs file, however, require
> a wait until all the references are gone. The reference count is
> treated like a rwsem using lockdep instrumentation code.
> 
> The fact that a thread is in the sysfs callback method or in the
> ioctl call means there is a reference to the opended sysfs or device
> file. That should prevent the underlying block structure from being
> removed.
> 
> Instead of using bd_mutex in the block_device structure, a new
> blk_trace_mutex is now added to the request_queue structure to protect
> access to the blk_trace structure.
> 
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Waiman Long 
> ---
>  v7:
>   - Add a new blk_trace_mutex in request_queue structure for blk_trace
> protection.
> 
>  v6:
>   - Add a second patch to rename the bd_fsfreeze_mutex to
> bd_fsfreeze_blktrace_mutex.
> 
>  v5:
>   - Overload the bd_fsfreeze_mutex in block_device structure for
> blktrace protection.
> 
>  v4:
>   - Use blktrace_mutex in blk_trace_ioctl() as well.
> 
>  v3:
>   - Use a global blktrace_mutex to serialize sysfs attribute accesses
> instead of the bd_mutex.
> 
>  v2:
>   - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
>   - Check for signal in the mutex_trylock loops.
>   - Use usleep() instead of schedule() for RT tasks.
> 
>  block/blk-core.c|  3 +++
>  include/linux/blkdev.h  |  1 +
>  kernel/trace/blktrace.c | 24 ++--
>  3 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index aebe676..048be4a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -854,6 +854,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  
>   kobject_init(>kobj, _queue_ktype);
>  
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> + mutex_init(>blk_trace_mutex);
> +#endif
>   mutex_init(>sysfs_lock);
>   spin_lock_init(>__queue_lock);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 460294b..02fa42d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -551,6 +551,7 @@ struct request_queue {
>   int node;
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>   struct blk_trace*blk_trace;
> + struct mutexblk_trace_mutex;
>  #endif
>   /*
>* for flush operations
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 2a685b4..d5cef05 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -648,6 +648,18 @@ int blk_trace_startstop(struct request_queue *q, int 
> start)
>  }
>  EXPORT_SYMBOL_GPL(blk_trace_startstop);
>  
> +/*
> + * When reading or writing the blktrace sysfs files, the references to the
> + * opened sysfs or device files should prevent the underlying block device
> + * from being removed. So no further delete protection is really needed.
> + *
> + * Protection from multiple readers and writers accessing blktrace data
> + * concurrently is still required. The bd_mutex was used for this purpose.
> + * That could lead to deadlock with concurrent block device deletion and
> + * sysfs access. As a result, a new blk_trace_mutex is now added to be
> + * used solely by the blktrace code.
> + */
> +
>  /**
>   * blk_trace_ioctl: - handle the ioctls associated with tracing
>   * @bdev:the block device
> @@ -665,7 +677,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned 
> cmd, char __user *arg)
>   if (!q)
>   return -ENXIO;
>  
> - mutex_lock(>bd_mutex);
> + mutex_lock(>blk_trace_mutex);
>  
>   switch (cmd) {
>   case BLKTRACESETUP:
> @@ -691,7 +703,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned 
> cmd, char __user *arg)
>   break;
>   }
>  
> - mutex_unlock(>bd_mutex);
> + mutex_unlock(>blk_trace_mutex);
>   return ret;
>  }
>  
> @@ -1727,7 +1739,7 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
> *dev,
>   if (q == NULL)
>   

[PATCH v7] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-09-20 Thread Waiman Long
The lockdep code had reported the following unsafe locking scenario:

   CPU0CPU1
   
  lock(s_active#228);
   lock(>bd_mutex/1);
   lock(s_active#228);
  lock(>bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.

Suggested-by: Christoph Hellwig 
Signed-off-by: Waiman Long 
---
 v7:
  - Add a new blk_trace_mutex in request_queue structure for blk_trace
protection.

 v6:
  - Add a second patch to rename the bd_fsfreeze_mutex to
bd_fsfreeze_blktrace_mutex.

 v5:
  - Overload the bd_fsfreeze_mutex in block_device structure for
blktrace protection.

 v4:
  - Use blktrace_mutex in blk_trace_ioctl() as well.

 v3:
  - Use a global blktrace_mutex to serialize sysfs attribute accesses
instead of the bd_mutex.

 v2:
  - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
  - Check for signal in the mutex_trylock loops.
  - Use usleep() instead of schedule() for RT tasks.

 block/blk-core.c|  3 +++
 include/linux/blkdev.h  |  1 +
 kernel/trace/blktrace.c | 24 ++--
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index aebe676..048be4a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -854,6 +854,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
kobject_init(>kobj, _queue_ktype);
 
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+   mutex_init(>blk_trace_mutex);
+#endif
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 460294b..02fa42d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -551,6 +551,7 @@ struct request_queue {
int node;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
struct blk_trace*blk_trace;
+   struct mutexblk_trace_mutex;
 #endif
/*
 * for flush operations
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 2a685b4..d5cef05 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -648,6 +648,18 @@ int blk_trace_startstop(struct request_queue *q, int start)
 }
 EXPORT_SYMBOL_GPL(blk_trace_startstop);
 
+/*
+ * When reading or writing the blktrace sysfs files, the references to the
+ * opened sysfs or device files should prevent the underlying block device
+ * from being removed. So no further delete protection is really needed.
+ *
+ * Protection from multiple readers and writers accessing blktrace data
+ * concurrently is still required. The bd_mutex was used for this purpose.
+ * That could lead to deadlock with concurrent block device deletion and
+ * sysfs access. As a result, a new blk_trace_mutex is now added to be
+ * used solely by the blktrace code.
+ */
+
 /**
  * blk_trace_ioctl: - handle the ioctls associated with tracing
  * @bdev:  the block device
@@ -665,7 +677,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned 
cmd, char __user *arg)
if (!q)
return -ENXIO;
 
-   mutex_lock(>bd_mutex);
+   mutex_lock(>blk_trace_mutex);
 
switch (cmd) {
case BLKTRACESETUP:
@@ -691,7 +703,7 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned 
cmd, char __user *arg)
break;
}
 
-   mutex_unlock(>bd_mutex);
+   mutex_unlock(>blk_trace_mutex);
return ret;
 }
 
@@ -1727,7 +1739,7 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
*dev,
if (q == NULL)
goto out_bdput;
 
-   mutex_lock(>bd_mutex);
+   mutex_lock(>blk_trace_mutex);
 
if (attr == _attr_enable) {
ret = sprintf(buf, "%u\n", !!q->blk_trace);
@@ -1746,7 +1758,7 @@ static ssize_t sysfs_blk_trace_attr_show(struct device 
*dev,
ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
 
 out_unlock_bdev:
-   mutex_unlock(>bd_mutex);
+   mutex_unlock(>blk_trace_mutex);
 out_bdput:
bdput(bdev);
 out:
@@ -1788,7 +1800,7 @@ static ssize_t 

Re: I/O hangs after resuming from suspend-to-ram

2017-09-20 Thread Martin Steigerwald
Ming Lei - 28.08.17, 21:32:
> On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> > Ming Lei - 28.08.17, 20:58:
> > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > > Hi.
> > > > 
> > > > Here is disk setup for QEMU VM:
[…]
> > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on it,
> > > > then
> > > > LVM, then ext4 for boot, swap and btrfs for /.
> > > > 
> > > > I couldn't reproduce the issue with single disk without RAID.
> > > 
> > > Could you verify if the following patch fixes your issue?
> > 
> > Could this also apply to non MD RAID systems? I am using BTRFS RAID
> > 1 with two SSDs. So far with CFQ it runs stable.
> 
> It is for fixing Oleksandr's issue wrt. blk-mq, and looks not for you.

My findings are different:

On 4.12.10 with CONFIG_HZ=1000, CONFIG_PREEMPT=y and optimizations for Intel 
Core/newer Xeon I see this:

1) Running with CFQ: No hang after resume

2) Running with scsi_mod.use_blk_mq=1 + BFQ: Hang after resume within first 1-2 
days.

However with 4.12.9 with CONFIG_HZ=250, CONFIG_PREEMPT_VOLUNTARY=y + no CPU 
optimizations: No hang after resume, no matter whether I use CFQ or scsi-mq + 
BFQ.

Both configs attached.

Also 4.13.2 with CFQ is stable so far.

I am looking forward for fixes to appear in 4.13.x with x>2 and retest with BFQ 
then, unless you have a different advice.

I am running Debian Sid/Experimental from BTRFS RAID 1 with two SSDs

merkaba:~> lsscsi | grep ATA
[0:0:0:0]diskATA  INTEL SSDSA2CW30 0362  /dev/sda 
[2:0:0:0]diskATA  Crucial_CT480M50 MU03  /dev/sdb

on ThinkPad T520 (Sandybridge).

Thanks,
-- 
Martin

config-4.12.10-tp520-btrfstrim.xz
Description: application/xz


config-4.12.9-tp520-btrfstrim+.xz
Description: application/xz


Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Kent Overstreet
On Wed, Sep 20, 2017 at 06:24:33AM +0800, Coly Li wrote:
> When bcache does read I/Os, for example in writeback or writethrough mode,
> if a read request on cache device is failed, bcache will try to recovery
> the request by reading from cached device. If the data on cached device is
> not synced with cache device, then requester will get a stale data.
> 
> For critical storage system like database, providing stale data from
> recovery may result an application level data corruption, which is
> unacceptible. But for some other situation like multi-media stream cache,
> continuous service may be more important and it is acceptible to fetch
> a chunk of stale data.
> 
> This patch tries to solve the above conflict by adding a sysfs option
>   /sys/block/bcache/bcache/allow_stale_data_on_failure
> which is defaultly cleared (to 0) as disabled. Now people can make choices
> for different situations.

IMO this is just a bug, I'd rather not have an option to keep the buggy
behaviour. How about this patch:

commit 2746f9c1f962288d8c5d7dabe698bf7b3fddd405
Author: Kent Overstreet 
Date:   Wed Sep 20 18:06:37 2017 +0200

bcache: Don't recover from IO errors when reading dirty data

Signed-off-by: Kent Overstreet 

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 382397772a..c2d57ef953 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -532,8 +532,10 @@ static int cache_lookup_fn(struct btree_op *op, struct 
btree *b, struct bkey *k)
 
PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO;
 
-   if (KEY_DIRTY(k))
+   if (KEY_DIRTY(k)) {
s->read_dirty_data = true;
+   s->recoverable = false;
+   }
 
n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
  KEY_OFFSET(k) - bio->bi_iter.bi_sector),


Re: [PATCH] nbd: ignore non-nbd ioctl's

2017-09-20 Thread Jens Axboe
On 09/20/2017 08:38 AM, Josef Bacik wrote:
> On Fri, May 05, 2017 at 10:25:18PM -0400, Josef Bacik wrote:
>> In testing we noticed that nbd would spew if you ran a fio job against
>> the raw device itself.  This is because fio calls a block device
>> specific ioctl, however the block layer will first pass this back to the
>> driver ioctl handler in case the driver wants to do something special.
>> Since the device was setup using netlink this caused us to spew every
>> time fio called this ioctl.  Since we don't have special handling, just
>> error out for any non-nbd specific ioctl's that come in.  This fixes the
>> spew.
>>
>> Signed-off-by: Josef Bacik 
> 
> Jens,
> 
> This fell through the cracks, could you pick it up, thanks,

Sure, applied.

-- 
Jens Axboe



Re: [PATCH 8/9] nvme: track shared namespaces

2017-09-20 Thread Christoph Hellwig
On Wed, Sep 20, 2017 at 10:36:43AM +0200, Johannes Thumshirn wrote:
> Being one of the persons who has to backport a lot of NVMe code to older
> kernels I'm not a huge fan of renaming nmve_ns.

The churn is main main worry.  Well and that I don't have a reall good
name for what currently is nvme_ns either :)

> That said, I don't have a better name for nvme_ns_head (yet) but I'll try to
> think of one as well. OTOH looking at nvme_ns_head it actaully is the list
> head of the nvme_ns list.

But head also has connotations in the SAN world.  Maybe nvme_ns_chain?


Re: [PATCH] nbd: ignore non-nbd ioctl's

2017-09-20 Thread Josef Bacik
On Fri, May 05, 2017 at 10:25:18PM -0400, Josef Bacik wrote:
> In testing we noticed that nbd would spew if you ran a fio job against
> the raw device itself.  This is because fio calls a block device
> specific ioctl, however the block layer will first pass this back to the
> driver ioctl handler in case the driver wants to do something special.
> Since the device was setup using netlink this caused us to spew every
> time fio called this ioctl.  Since we don't have special handling, just
> error out for any non-nbd specific ioctl's that come in.  This fixes the
> spew.
> 
> Signed-off-by: Josef Bacik 

Jens,

This fell through the cracks, could you pick it up, thanks,

Josef


Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()

2017-09-20 Thread Ilya Dryomov
On Tue, Sep 19, 2017 at 10:32 PM, Christoph Hellwig  wrote:
> On Wed, Sep 06, 2017 at 07:38:10PM +0200, Ilya Dryomov wrote:
>> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
>> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
>> is set.  This means blkdev_issue_zeroout() must cope with WRITE SAME
>> failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes,
>> unless BLKDEV_ZERO_NOFALLBACK is specified.
>>
>> Commit 71027e97d796 ("block: stop using discards for zeroing") added
>> the following to __blkdev_issue_zeroout() comment:
>>
>>   "Note that this function may fail with -EOPNOTSUPP if the driver signals
>>   zeroing offload support, but the device fails to process the command (for
>>   some devices there is no non-destructive way to verify whether this
>>   operation is actually supported).  In this case the caller should call
>>   retry the call to blkdev_issue_zeroout() and the fallback path will be 
>> used."
>>
>> But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME
>> support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned
>> to blkdev_issue_zeroout().  -EREMOTEIO is then propagated up:
>>
>>   $ fallocate -zn -l 1k /dev/sdg
>>   fallocate: fallocate failed: Remote I/O error
>>   $ fallocate -zn -l 1k /dev/sdg  # OK
>>
>> (The second fallocate(1) succeeds because sd_done() sets ->no_write_same
>> in response to a sense that would become BLK_STS_TARGET.)
>>
>> Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO.  This
>> is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob.
>
> I'm really not sure we should check for -EREMOTEIO specifically, but
> Martin who is more familiar with the SCSI code might be able to
> correct me, I'd feel safer about checking for any error which is
> what the old code did.

Yeah, I pondered that.  The old code did check for any error, but it
wouldn't attempt another WRITE SAME after a failure.  To do that here,
we'd need to refactor __blkdev_issue_zeroout(); as it is, if we retry
__blkdev_issue_zeroout() on a random error, it would happily go the
REQ_OP_WRITE_ZEROES way again.

Martin, what do you think?

Thanks,

Ilya


Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue

2017-09-20 Thread Ming Lei
On Tue, Sep 19, 2017 at 02:37:50PM -0600, Jens Axboe wrote:
> On 09/02/2017 09:17 AM, Ming Lei wrote:
> > @@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> > if (!list_empty(_list)) {
> > blk_mq_sched_mark_restart_hctx(hctx);
> > do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
> > -   } else if (!has_sched_dispatch) {
> > +   } else if (!has_sched_dispatch && !q->queue_depth) {
> > +   /*
> > +* If there is no per-request_queue depth, we
> > +* flush all requests in this hw queue, otherwise
> > +* pick up request one by one from sw queue for
> > +* avoiding to mess up I/O merge when dispatch
> > +* is busy, which can be triggered easily by
> > +* per-request_queue queue depth
> > +*/
> > blk_mq_flush_busy_ctxs(hctx, _list);
> > blk_mq_dispatch_rq_list(q, _list);
> > }
> 
> I don't like this part at all. It's a heuristic, and on top of that,
> it's a heuristic based on a weak assumption that if q->queue_depth == 0,
> then we never run into BUSY constraints. I think that would be stronger
> if it was based on "is this using shared tags", but even that is not
> great at all. It's perfectly possible to have a shared tags setup and
> never run into resource constraints. The opposite condition is also true
> - you can run without shared tags, and yet hit resource constraints
> constantly. Hence this is NOT a good test for deciding whether to flush
> everything at once or not. In short, I think a much better test case
> would be "has this device ever returned BLK_STS_RESOURCE. As it so
> happens, that's easy enough to keep track of and base this test on.

Hi Jens,

The attached patch follows your suggestion, and uses EWMA to
compute the average length of hctx->dispatch, then only flush
all requests from ctxs if the average length is zero, what do
you think about this approach? Or other suggestions?


diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7a27f262c96a..c420c775b9c0 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -232,6 +232,14 @@ static int hctx_flags_show(void *data, struct seq_file *m)
return 0;
 }
 
+static int hctx_dispatch_len_show(void *data, struct seq_file *m)
+{
+   struct blk_mq_hw_ctx *hctx = data;
+
+   seq_printf(m, "%u\n", hctx->dispatch_len);
+   return 0;
+}
+
 #define REQ_OP_NAME(name) [REQ_OP_##name] = #name
 static const char *const op_name[] = {
REQ_OP_NAME(READ),
@@ -763,6 +771,7 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_hctx_attrs[] = {
{"state", 0400, hctx_state_show},
{"flags", 0400, hctx_flags_show},
{"dispatch", 0400, .seq_ops = _dispatch_seq_ops},
+   {"dispatch_length", 0400, hctx_dispatch_len_show},
{"busy", 0400, hctx_busy_show},
{"ctx_map", 0400, hctx_ctx_map_show},
{"tags", 0400, hctx_tags_show},
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a5712dd67783..91a6eeaadaf1 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -102,7 +102,7 @@ static void blk_mq_do_dispatch_sched(struct request_queue 
*q,
if (!rq)
break;
list_add(>queuelist, _list);
-   } while (blk_mq_dispatch_rq_list(q, _list));
+   } while (blk_mq_dispatch_rq_list(q, _list, 1));
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -134,19 +134,51 @@ static void blk_mq_do_dispatch_ctx(struct request_queue 
*q,
/* round robin for fair dispatch */
ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
 
-   dispatched = blk_mq_dispatch_rq_list(q, _list);
+   dispatched = blk_mq_dispatch_rq_list(q, _list, 1);
} while (dispatched);
 
if (!dispatched)
WRITE_ONCE(hctx->dispatch_from, ctx);
 }
 
+/* borrowed from bcache */
+void ewma_add(unsigned *ewma, unsigned val, unsigned weight)
+{
+   unsigned  result = READ_ONCE(*ewma);
+
+   if (val == 1) {
+   result += 1;
+   } else {
+   result *= weight - 1;
+   result += val;
+   result /= weight;
+   }
+   WRITE_ONCE(*ewma, result);
+}
+
+/*
+ * We use EWMA to compute the average length of dispatch list.
+ * It is totally lockless, but we can survive that since it
+ * is just a hint.
+ *
+ * We only flush requests from all ctxs if the average length
+ * of dispatch list is zero, otherwise the hw queue can be
+ * thought as busy and we dequeue request from ctxs one by
+ * one
+ */
+static void blk_mq_update_dispatch_length(struct blk_mq_hw_ctx *hctx,
+ unsigned cnt)
+{
+   ewma_add(>dispatch_len, cnt, 8);
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct 

Re: nvme multipath support V2

2017-09-20 Thread Johannes Thumshirn
Hi Christoph,

I wanted to test your patches, but I fail to see how I have to set it up.

I do have a host with two RDMA HCAs connected to the target (Linux), for "normal
dm-mpath" test I do nvme connect with the host traddr argument for both of the
HCAs and get two nvme block devices which I can add to multipath.

Using your patchset and doing the sanme double connect trick I get the same
two block devices of cause.

How do I connect using both paths?

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Coly Li
On 2017/9/20 上午8:59, Michael Lyle wrote:
> Coly--
> 
> It's an interesting changeset.

Hi Mike,

Yes it's interesting :-) It fixes a silent database data corruption in
our product kernel. The most dangerous point is, it happens silent even
in-data checksum is used, this issue is detected by out-of-data checksum.

> I am not positive if it will work in practice-- the most likely
> objects to be cached are filesystem metadata.  Won't most filesystems
> fall apart if some of their data structures revert back to an earlier
> point of time?

For database workload, most of data cached on SSD is data blocks of
database file which are replied from binlog (for example mysql). File
system won't complain for such situation, and an early version means all
transactions information since last update are all lost, in *silence*.

Even the read request failed on file system meta data, because finally a
stale data will be provided to kernel file system code, it is probably
file system won't complain as well. Because,
- file system reports error when I/O failed, if a stale data from
recovery provided to file system, file system just uses the stale data
until a worse failure detected by file system code.
- if file system use a metadata checksum, and the checksum is inside
metadata block (it is quite common), because the stale data is also
checksum consistent, file system won't report error as well.

So the data corruption happens in application level, even file system
kernel code still thinks everything is consistent on disk 

Thanks.

Coly Li


> On Tue, Sep 19, 2017 at 3:24 PM, Coly Li  wrote:
>> When bcache does read I/Os, for example in writeback or writethrough mode,
>> if a read request on cache device is failed, bcache will try to recovery
>> the request by reading from cached device. If the data on cached device is
>> not synced with cache device, then requester will get a stale data.
>>
>> For critical storage system like database, providing stale data from
>> recovery may result an application level data corruption, which is
>> unacceptible. But for some other situation like multi-media stream cache,
>> continuous service may be more important and it is acceptible to fetch
>> a chunk of stale data.
>>
>> This patch tries to solve the above conflict by adding a sysfs option
>> /sys/block/bcache/bcache/allow_stale_data_on_failure
>> which is defaultly cleared (to 0) as disabled. Now people can make choices
>> for different situations.
>>
>> With this patch, for a failed read request in writeback or writethrough
>> mode, recovery a recoverable read request only happens in one of the
>> following conditions,
>>  - dc->has_dirty is zero. It means all data on cache device is synced to
>>cached device, the recoveried data is up-to-date.
>>  - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set
>>to 1. It means there is dirty data not synced to cached device yet, but
>>option allow_stale_data_on_failure is set, receiving stale data is
>>explicitly acceptible for requester.
>>
>> For other cache modes in bcache, read request will never hit
>> cached_dev_read_error(), they don't need this patch.
>>
>> Please note, because cache mode can be switched arbitrarily in run time, a
>> writethrough mode might be switched from a writeback mode. Therefore
>> checking dc->has_data in writethrough mode still makes sense.
>>
>> Changelog:
>> v2: rename sysfs entry from allow_stale_data_on_failure  to
>> allow_stale_data_on_failure, and fix the confusing commit log.
>> v1: initial patch posted.
>>
>> Signed-off-by: Coly Li 
>> Reported-by: Arne Wolf 
>> Cc: Nix 
>> Cc: Kai Krakow 
>> Cc: Eric Wheeler 
>> Cc: Junhui Tang 
>> Cc: sta...@vger.kernel.org

[snip]



Re: [PATCH 9/9] nvme: implement multipath access to nvme subsystems

2017-09-20 Thread Johannes Thumshirn
On Mon, Sep 18, 2017 at 04:14:53PM -0700, Christoph Hellwig wrote:
> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
no ^
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.

Sorry for the typo only mail, I'll be going around a real review soon but I
wanted to test the patchset in my nvme-rdma setup and just hit a panic in
ib_core with your branched merged into -rc1.

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH V8 01/14] mmc: core: Introduce host claiming by context

2017-09-20 Thread Ulf Hansson
On 13 September 2017 at 13:40, Adrian Hunter  wrote:
> Currently the host can be claimed by a task.  Change this so that the host
> can be claimed by a context that may or may not be a task.  This provides
> for the host to be claimed by a block driver queue to support blk-mq, while
> maintaining compatibility with the existing use of mmc_claim_host().

As stated earlier, I think this is a good approach, which allows us to
move forward.

Some comments below, most related how to avoid us from adding more
wrapper functions.

>
> Signed-off-by: Adrian Hunter 
> ---
>  drivers/mmc/core/core.c  | 92 
> 
>  drivers/mmc/core/core.h  |  6 
>  include/linux/mmc/host.h |  7 +++-
>  3 files changed, 97 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index 66c9cf49ad2f..09fdc467b804 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -832,9 +832,38 @@ unsigned int mmc_align_data_size(struct mmc_card *card, 
> unsigned int sz)
>  }
>  EXPORT_SYMBOL(mmc_align_data_size);
>
> +/*
> + * Allow claiming an already claimed host if the context is the same or 
> there is
> + * no context but the task is the same.
> + */
> +static inline bool mmc_ctx_matches(struct mmc_host *host, struct mmc_ctx 
> *ctx,
> +  struct task_struct *task)
> +{
> +   return host->claimer == ctx ||
> +  (!ctx && task && host->claimer->task == task);
> +}
> +
> +static inline void mmc_ctx_set_claimer(struct mmc_host *host,
> +  struct mmc_ctx *ctx,
> +  struct task_struct *task)
> +{
> +   if (!host->claimer) {
> +   if (ctx)
> +   host->claimer = ctx;
> +   else
> +   host->claimer = >default_ctx;
> +   }
> +   if (task)
> +   host->claimer->task = task;
> +}
> +
>  /**
> - * __mmc_claim_host - exclusively claim a host
> + * __mmc_ctx_task_claim_host - exclusively claim a host
>   * @host: mmc host to claim
> + * @ctx: context that claims the host or NULL in which case the default
> + * context will be used
> + * @task: task that claims the host or NULL in which case @ctx must be
> + * provided
>   * @abort: whether or not the operation should be aborted
>   *
>   * Claim a host for a set of operations.  If @abort is non null and
> @@ -842,7 +871,8 @@ unsigned int mmc_align_data_size(struct mmc_card *card, 
> unsigned int sz)
>   * that non-zero value without acquiring the lock.  Returns zero
>   * with the lock held otherwise.
>   */
> -int __mmc_claim_host(struct mmc_host *host, atomic_t *abort)
> +static int __mmc_ctx_task_claim_host(struct mmc_host *host, struct mmc_ctx 
> *ctx,
> +struct task_struct *task, atomic_t 
> *abort)
>  {
> DECLARE_WAITQUEUE(wait, current);
> unsigned long flags;
> @@ -856,7 +886,7 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t 
> *abort)
> while (1) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> stop = abort ? atomic_read(abort) : 0;
> -   if (stop || !host->claimed || host->claimer == current)
> +   if (stop || !host->claimed || mmc_ctx_matches(host, ctx, 
> task))
> break;
> spin_unlock_irqrestore(>lock, flags);
> schedule();
> @@ -865,7 +895,7 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t 
> *abort)
> set_current_state(TASK_RUNNING);
> if (!stop) {
> host->claimed = 1;
> -   host->claimer = current;
> +   mmc_ctx_set_claimer(host, ctx, task);
> host->claim_cnt += 1;
> if (host->claim_cnt == 1)
> pm = true;
> @@ -879,8 +909,19 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t 
> *abort)
>
> return stop;
>  }
> +
> +int __mmc_claim_host(struct mmc_host *host, atomic_t *abort)
> +{
> +   return __mmc_ctx_task_claim_host(host, NULL, current, abort);
> +}
>  EXPORT_SYMBOL(__mmc_claim_host);

There is currently only two users of __mmc_claim_host().

Instead of adding yet another layer of wrapper functions, let's keep
the existing __mmc_claim_host(), but change its definition to take the
new parameters. Then change the existing two users.

>
> +void mmc_ctx_claim_host(struct mmc_host *host, struct mmc_ctx *ctx)
> +{
> +   __mmc_ctx_task_claim_host(host, ctx, NULL, NULL);
> +}
> +EXPORT_SYMBOL(mmc_ctx_claim_host);

There is no need to make this an exported function, better to keep it
static and only to core.c.

As a matter of fact, I suggest you remove this function altogether,
and just call __mmc_claim_host(), with the new parameters, from
mmc_ctx_get_card().

> +
>  /**
>   * 

Re: [PATCH 8/9] nvme: track shared namespaces

2017-09-20 Thread Johannes Thumshirn
On Mon, Sep 18, 2017 at 04:14:52PM -0700, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> [1] comments welcome if you have a better name for it, the current one is
> horrible.  One idea would be to rename the current struct nvme_ns
> to struct nvme_ns_link or similar and use the nvme_ns name for the
> new structure.  But that would involve a lot of churn.

Being one of the persons who has to backport a lot of NVMe code to older
kernels I'm not a huge fan of renaming nmve_ns.

That said, I don't have a better name for nvme_ns_head (yet) but I'll try to
think of one as well. OTOH looking at nvme_ns_head it actaully is the list
head of the nvme_ns list.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 7/9] nvme: introduce a nvme_ns_ids structure

2017-09-20 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 9/9] nvme: implement multipath access to nvme subsystems

2017-09-20 Thread Johannes Thumshirn
On Wed, Sep 20, 2017 at 08:18:41AM +0800, Tony Yang wrote:
> Hi , Christoph
> 
>  I use the above code to recompile the kernel. The following error
> occurred. I can't find the blk_steal_bios function. What's the reason
> for that? Hope to get your help, thank you

Hi Ton,

have you puled it from Christoph's git? I just did a git pull and it builds
fine.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 2/2 v6] mmc: block: Delete mmc_access_rpmb()

2017-09-20 Thread Linus Walleij
This function is used by the block layer queue to bail out of
requests if the current request is towards an RPMB
"block device".

This was done to avoid boot time scanning of this "block
device" which was never really a block device, thus duct-taping
over the fact that it was badly engineered.

This problem is now gone as we removed the offending RPMB block
device in another patch and replaced it with a character
device.

Cc: Tomas Winkler 
Signed-off-by: Linus Walleij 
---
ChangeLov v5->v6:
- Update the commit description with an accurate description
  of why this was done in the first place, and why it can now
  be removed.
ChangeLog v1->v5:
- Renumber to keep together with the rest of the series.
---
 drivers/mmc/core/block.c | 12 
 drivers/mmc/core/queue.c |  2 +-
 drivers/mmc/core/queue.h |  2 --
 3 files changed, 1 insertion(+), 15 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 6421d06b66bb..48521376b17e 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1203,18 +1203,6 @@ static inline void mmc_blk_reset_success(struct 
mmc_blk_data *md, int type)
md->reset_done &= ~type;
 }
 
-int mmc_access_rpmb(struct mmc_queue *mq)
-{
-   struct mmc_blk_data *md = mq->blkdata;
-   /*
-* If this is a RPMB partition access, return ture
-*/
-   if (md && md->part_type == EXT_CSD_PART_CONFIG_ACC_RPMB)
-   return true;
-
-   return false;
-}
-
 /*
  * The non-block commands come back from the block layer after it queued it and
  * processed it with all other requests and then they get issued in this
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index affa7370ba82..3baccbf16f3d 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -32,7 +32,7 @@ static int mmc_prep_request(struct request_queue *q, struct 
request *req)
 {
struct mmc_queue *mq = q->queuedata;
 
-   if (mq && (mmc_card_removed(mq->card) || mmc_access_rpmb(mq)))
+   if (mq && mmc_card_removed(mq->card))
return BLKPREP_KILL;
 
req->rq_flags |= RQF_DONTPREP;
diff --git a/drivers/mmc/core/queue.h b/drivers/mmc/core/queue.h
index a2b6a9fcab01..7649ed6cbef7 100644
--- a/drivers/mmc/core/queue.h
+++ b/drivers/mmc/core/queue.h
@@ -89,6 +89,4 @@ extern unsigned int mmc_queue_map_sg(struct mmc_queue *,
 extern void mmc_queue_bounce_pre(struct mmc_queue_req *);
 extern void mmc_queue_bounce_post(struct mmc_queue_req *);
 
-extern int mmc_access_rpmb(struct mmc_queue *);
-
 #endif
-- 
2.13.5



[PATCH 1/2 v6] mmc: block: Convert RPMB to a character device

2017-09-20 Thread Linus Walleij
The RPMB partition on the eMMC devices is a special area used
for storing cryptographically safe information signed by a
special secret key. To write and read records from this special
area, authentication is needed.

The RPMB area is *only* and *exclusively* accessed using
ioctl():s from userspace. It is not really a block device,
as blocks cannot be read or written from the device, also
the signed chunks that can be stored on the RPMB are actually
256 bytes, not 512 making a block device a real bad fit.

Currently the RPMB partition spawns a separate block device
named /dev/mmcblkNrpmb for each device with an RPMB partition,
including the creation of a block queue with its own kernel
thread and all overhead associated with this. On the Ux500
HREFv60 platform, for example, the two eMMCs means that two
block queues with separate threads are created for no use
whatsoever.

I have concluded that this block device design for RPMB is
actually pretty wrong. The RPMB area should have been designed
to be accessed from /dev/mmcblkN directly, using ioctl()s on
the main block device. It is however way too late to change
that, since userspace expects to open an RPMB device in
/dev/mmcblkNrpmb and we cannot break userspace.

This patch tries to amend the situation using the following
strategy:

- Stop creating a block device for the RPMB partition/area

- Instead create a custom, dynamic character device with
  the same name.

- Make this new character device support exactly the same
  set of ioctl()s as the old block device.

- Wrap the requests back to the same ioctl() handlers, but
  issue them on the block queue of the main partition/area,
  i.e. /dev/mmcblkN

We need to create a special "rpmb" bus type in order to get
udev and/or busybox hot/coldplug to instantiate the device
node properly.

Before the patch, this appears in 'ps aux':

101 root   0:00 [mmcqd/2rpmb]
123 root   0:00 [mmcqd/3rpmb]

After applying the patch these surplus block queue threads
are gone, but RPMB is as usable as ever using the userspace
MMC tools, such as 'mmc rpmb read-counter'.

We get instead those dynamice devices in /dev:

brw-rw1 root root  179,   0 Jan  1  2000 mmcblk0
brw-rw1 root root  179,   1 Jan  1  2000 mmcblk0p1
brw-rw1 root root  179,   2 Jan  1  2000 mmcblk0p2
brw-rw1 root root  179,   5 Jan  1  2000 mmcblk0p5
brw-rw1 root root  179,   8 Jan  1  2000 mmcblk2
brw-rw1 root root  179,  16 Jan  1  2000 mmcblk2boot0
brw-rw1 root root  179,  24 Jan  1  2000 mmcblk2boot1
crw-rw1 root root  248,   0 Jan  1  2000 mmcblk2rpmb
brw-rw1 root root  179,  32 Jan  1  2000 mmcblk3
brw-rw1 root root  179,  40 Jan  1  2000 mmcblk3boot0
brw-rw1 root root  179,  48 Jan  1  2000 mmcblk3boot1
brw-rw1 root root  179,  33 Jan  1  2000 mmcblk3p1
crw-rw1 root root  248,   1 Jan  1  2000 mmcblk3rpmb

Notice the (248,0) and (248,1) character devices for RPMB.

Cc: Tomas Winkler 
Signed-off-by: Linus Walleij 
---
ChangeLog v5->v6:
- Prefix the bus name with mmc_ so it becomes "mmc_rpmb"
- Prefix the RPMB-specific symbols with mmc_*
- Use the ternary operator ( = rpmb ? A : B ) for assigning IOCTL
  enums
ChangeLog v1 (RFC) -> v5:
- Rebase.
- Drop discussion comments, let's go for this unless someone
  has a better idea.
- Rename rpmb_devt and rpmb_bus_type to mmc_rpmb_devt and
  mmc_rpmb_bus_type as requested by Tomas.
- Handle multiple RPMB partitions as requested by Tomas.
- Renumber v5 to keep together with the rest of the series.
---
 drivers/mmc/core/block.c | 283 +++
 drivers/mmc/core/queue.h |   2 +
 2 files changed, 263 insertions(+), 22 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 29fc1e662891..6421d06b66bb 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -86,6 +87,7 @@ static int max_devices;
 #define MAX_DEVICES 256
 
 static DEFINE_IDA(mmc_blk_ida);
+static DEFINE_IDA(mmc_rpmb_ida);
 
 /*
  * There is one mmc_blk_data per slot.
@@ -96,6 +98,7 @@ struct mmc_blk_data {
struct gendisk  *disk;
struct mmc_queue queue;
struct list_head part;
+   struct list_head rpmbs;
 
unsigned intflags;
 #define MMC_BLK_CMD23  (1 << 0)/* Can do SET_BLOCK_COUNT for 
multiblock */
@@ -121,6 +124,32 @@ struct mmc_blk_data {
int area_type;
 };
 
+/* Device type for RPMB character devices */
+static dev_t mmc_rpmb_devt;
+
+/* Bus type for RPMB character devices */
+static struct bus_type mmc_rpmb_bus_type = {
+   .name = "mmc_rpmb",
+};
+
+/**
+ * struct mmc_rpmb_data - special RPMB device type for these areas
+ * @dev: the device for the 

Re: [PATCHv2] bcache: option for allow stale data on read failure

2017-09-20 Thread Michael Lyle
Coly--

It's an interesting changeset.

I am not positive if it will work in practice-- the most likely
objects to be cached are filesystem metadata.  Won't most filesystems
fall apart if some of their data structures revert back to an earlier
point of time?

Mike

On Tue, Sep 19, 2017 at 3:24 PM, Coly Li  wrote:
> When bcache does read I/Os, for example in writeback or writethrough mode,
> if a read request on cache device is failed, bcache will try to recovery
> the request by reading from cached device. If the data on cached device is
> not synced with cache device, then requester will get a stale data.
>
> For critical storage system like database, providing stale data from
> recovery may result an application level data corruption, which is
> unacceptible. But for some other situation like multi-media stream cache,
> continuous service may be more important and it is acceptible to fetch
> a chunk of stale data.
>
> This patch tries to solve the above conflict by adding a sysfs option
> /sys/block/bcache/bcache/allow_stale_data_on_failure
> which is defaultly cleared (to 0) as disabled. Now people can make choices
> for different situations.
>
> With this patch, for a failed read request in writeback or writethrough
> mode, recovery a recoverable read request only happens in one of the
> following conditions,
>  - dc->has_dirty is zero. It means all data on cache device is synced to
>cached device, the recoveried data is up-to-date.
>  - dc->has_dirty is non-zero, and dc->allow_stale_data_on_failure is set
>to 1. It means there is dirty data not synced to cached device yet, but
>option allow_stale_data_on_failure is set, receiving stale data is
>explicitly acceptible for requester.
>
> For other cache modes in bcache, read request will never hit
> cached_dev_read_error(), they don't need this patch.
>
> Please note, because cache mode can be switched arbitrarily in run time, a
> writethrough mode might be switched from a writeback mode. Therefore
> checking dc->has_data in writethrough mode still makes sense.
>
> Changelog:
> v2: rename sysfs entry from allow_stale_data_on_failure  to
> allow_stale_data_on_failure, and fix the confusing commit log.
> v1: initial patch posted.
>
> Signed-off-by: Coly Li 
> Reported-by: Arne Wolf 
> Cc: Nix 
> Cc: Kai Krakow 
> Cc: Eric Wheeler 
> Cc: Junhui Tang 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/md/bcache/bcache.h  |  1 +
>  drivers/md/bcache/request.c | 14 +-
>  drivers/md/bcache/sysfs.c   |  4 
>  3 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index dee542fff68e..f26b174f409a 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -356,6 +356,7 @@ struct cached_dev {
> unsignedpartial_stripes_expensive:1;
> unsignedwriteback_metadata:1;
> unsignedwriteback_running:1;
> +   unsignedallow_stale_data_on_failure:1;
> unsigned char   writeback_percent;
> unsignedwriteback_delay;
>
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 019b3df9f1c6..becbc0959ca2 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -702,8 +702,20 @@ static void cached_dev_read_error(struct closure *cl)
>  {
> struct search *s = container_of(cl, struct search, cl);
> struct bio *bio = >bio.bio;
> +   struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
> +   int recovery_stale_data = dc ? dc->allow_stale_data_on_failure : 0;
>
> -   if (s->recoverable) {
> +   /*
> +* If dc->has_dirty is non-zero and the recovering data is on cache
> +* device, then recover from cached device will return a stale data
> +* to requester. But in some cases people accept stale data to avoid
> +* a -EIO. So I/O error recovery only happens when,
> +* - No dirty data on cache device.
> +* - Cached device is dirty but sysfs allow_stale_data_on_failure is
> +*   explicitly set (to 1) to accept stale data from recovery.
> +*/
> +   if (s->recoverable &&
> +   (!atomic_read(>has_dirty) || recovery_stale_data)) {
> /* Retry from the backing device: */
> trace_bcache_read_retry(s->orig_bio);
>
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index f90f13616980..8603756005a8 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -106,6 +106,7 @@ rw_attribute(cache_replacement_policy);
>  rw_attribute(btree_shrinker_disabled);
>  rw_attribute(copy_gc_enabled);
>  rw_attribute(size);
> +rw_attribute(allow_stale_data_on_failure);
>
>