Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Verma, Vishal L
On Tue, 2017-01-17 at 18:01 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger 
> wrote:
> > On Jan 17, 2017, at 3:15 PM, Andiry Xu  wrote:
> > > On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma  > > l.com> wrote:
> > > > On 01/16, Darrick J. Wong wrote:
> > > > > On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > > > > > On 01/14, Slava Dubeyko wrote:
> > > > > > > 
> > > > > > >  Original Message 
> > > > > > > Subject: [LSF/MM TOPIC] Badblocks checking/representation
> > > > > > > in filesystems
> > > > > > > Sent: Jan 13, 2017 1:40 PM
> > > > > > > From: "Verma, Vishal L" 
> > > > > > > To: lsf...@lists.linux-foundation.org
> > > > > > > Cc: linux-nvd...@lists.01.org, linux-block@vger.kernel.org
> > > > > > > , linux-fsde...@vger.kernel.org
> > > > > > > 
> > > > > > > > The current implementation of badblocks, where we
> > > > > > > > consult the
> > > > > > > > badblocks list for every IO in the block driver works,
> > > > > > > > and is a
> > > > > > > > last option failsafe, but from a user perspective, it
> > > > > > > > isn't the
> > > > > > > > easiest interface to work with.
> > > > > > > 
> > > > > > > As I remember, FAT and HFS+ specifications contain
> > > > > > > description of bad blocks
> > > > > > > (physical sectors) table. I believe that this table was
> > > > > > > used for the case of
> > > > > > > floppy media. But, finally, this table becomes to be the
> > > > > > > completely obsolete
> > > > > > > artefact because mostly storage devices are reliably
> > > > > > > enough. Why do you need
> > > > > 
> > > > > ext4 has a badblocks inode to own all the bad spots on disk,
> > > > > but ISTR it
> > > > > doesn't support(??) extents or 64-bit filesystems, and might
> > > > > just be a
> > > > > vestigial organ at this point.  XFS doesn't have anything to
> > > > > track bad
> > > > > blocks currently
> > > > > 
> > > > > > > in exposing the bad blocks on the file system level?  Do
> > > > > > > you expect that next
> > > > > > > generation of NVM memory will be so unreliable that file
> > > > > > > system needs to manage
> > > > > > > bad blocks? What's about erasure coding schemes? Do file
> > > > > > > system really need to suffer
> > > > > > > from the bad block issue?
> > > > > > > 
> > > > > > > Usually, we are using LBAs and it is the responsibility of
> > > > > > > storage device to map
> > > > > > > a bad physical block/page/sector into valid one. Do you
> > > > > > > mean that we have
> > > > > > > access to physical NVM memory address directly? But it
> > > > > > > looks like that we can
> > > > > > > have a "bad block" issue even we will access data into
> > > > > > > page cache's memory
> > > > > > > page (if we will use NVM memory for page cache, of
> > > > > > > course). So, what do you
> > > > > > > imply by "bad block" issue?
> > > > > > 
> > > > > > We don't have direct physical access to the device's address
> > > > > > space, in
> > > > > > the sense the device is still free to perform remapping of
> > > > > > chunks of NVM
> > > > > > underneath us. The problem is that when a block or address
> > > > > > range (as
> > > > > > small as a cache line) goes bad, the device maintains a
> > > > > > poison bit for
> > > > > > every affected cache line. Behind the scenes, it may have
> > > > > > already
> > > > > > remapped the range, but the cache line poison has to be kept
> > > > > > so that
> > > > > > there is a notification to the user/owner of the data that
> > > > > > something has
> > > > > > been lost. Since NVM is byte addressable memory sitting on
> > > > > > the memory
> > > > > > bus, such a poisoned cache line results in memory errors and
> > > > > > SIGBUSes.
> > > > > > Compared to tradational storage where an app will get nice
> > > > > > and friendly
> > > > > > (relatively speaking..) -EIOs. The whole badblocks
> > > > > > implementation was
> > > > > > done so that the driver can intercept IO (i.e. reads) to
> > > > > > _known_ bad
> > > > > > locations, and short-circuit them with an EIO. If the driver
> > > > > > doesn't
> > > > > > catch these, the reads will turn into a memory bus access,
> > > > > > and the
> > > > > > poison will cause a SIGBUS.
> > > > > 
> > > > > "driver" ... you mean XFS?  Or do you mean the thing that
> > > > > makes pmem
> > > > > look kind of like a traditional block device? :)
> > > > 
> > > > Yes, the thing that makes pmem look like a block device :) --
> > > > drivers/nvdimm/pmem.c
> > > > 
> > > > > 
> > > > > > This effort is to try and make this badblock checking
> > > > > > smarter - and try
> > > > > > and reduce the penalty on every IO to a smaller range, which
> > > > > > only the
> > > > > > filesystem can do.
> > > > > 
> > > > > Though... now that XFS merged the reverse mapping support,
> > > > > I've been
> > > > > wondering if there'll be a resubmission of the device errors
> > > > > callback?
> > > > > It still would be useful to be able to inform the user that
>

Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Vishal Verma
On 01/17, Lu Zhang wrote:
> I'm curious about the fault model and corresponding hardware ECC mechanisms
> for NVDIMMs. In my understanding for memory accesses to trigger MCE, it
> means the memory controller finds a detectable but uncorrectable error
> (DUE). So if there is no hardware ECC support the media errors won't even
> be noticed, not to mention badblocks or machine checks.
> 
> Current hardware ECC support for DRAM usually employs (72, 64) single-bit
> error correction mechanism, and for advanced ECCs there are techniques like
> Chipkill or SDDC which can tolerate a single DRAM chip failure. What is the
> expected ECC mode for NVDIMMs, assuming that PCM or 3dXpoint based
> technology might have higher error rates?

I'm sure once NVDIMMs start becoming widely available, there will be
more information on how they do ECC..

> 
> If DUE does happen and is flagged to the file system via MCE (somehow...),
> and the fs finds that the error corrupts its allocated data page, or
> metadata, now if the fs wants to recover its data the intuition is that
> there needs to be a stronger error correction mechanism to correct the
> hardware-uncorrectable errors. So knowing the hardware ECC baseline is
> helpful for the file system to understand how severe are the faults in
> badblocks, and develop its recovery methods.

Like mentioned before, this discussion is more about presentation of
errors in a known consumable format, rather than recovering from errors.
While recovering from errors is interesting, we already have layers
like RAID for that, and they are as applicable to NVDIMM backed storage
as they have been for disk/SSD based storage.

> 
> Regards,
> Lu
> 
> On Tue, Jan 17, 2017 at 6:01 PM, Andiry Xu  wrote:
> 
> > On Tue, Jan 17, 2017 at 4:16 PM, Andreas Dilger  wrote:
> > > On Jan 17, 2017, at 3:15 PM, Andiry Xu  wrote:
> > >> On Tue, Jan 17, 2017 at 1:35 PM, Vishal Verma 
> > wrote:
> > >>> On 01/16, Darrick J. Wong wrote:
> >  On Fri, Jan 13, 2017 at 05:49:10PM -0700, Vishal Verma wrote:
> > > On 01/14, Slava Dubeyko wrote:
> > >>
> > >>  Original Message 
> > >> Subject: [LSF/MM TOPIC] Badblocks checking/representation in
> > filesystems
> > >> Sent: Jan 13, 2017 1:40 PM
> > >> From: "Verma, Vishal L" 
> > >> To: lsf...@lists.linux-foundation.org
> > >> Cc: linux-nvd...@lists.01.org, linux-block@vger.kernel.org,
> > linux-fsde...@vger.kernel.org
> > >>
> > >>> The current implementation of badblocks, where we consult the
> > >>> badblocks list for every IO in the block driver works, and is a
> > >>> last option failsafe, but from a user perspective, it isn't the
> > >>> easiest interface to work with.
> > >>
> > >> As I remember, FAT and HFS+ specifications contain description of
> > bad blocks
> > >> (physical sectors) table. I believe that this table was used for
> > the case of
> > >> floppy media. But, finally, this table becomes to be the completely
> > obsolete
> > >> artefact because mostly storage devices are reliably enough. Why do
> > you need
> > 
> >  ext4 has a badblocks inode to own all the bad spots on disk, but ISTR
> > it
> >  doesn't support(??) extents or 64-bit filesystems, and might just be a
> >  vestigial organ at this point.  XFS doesn't have anything to track bad
> >  blocks currently
> > 
> > >> in exposing the bad blocks on the file system level?  Do you expect
> > that next
> > >> generation of NVM memory will be so unreliable that file system
> > needs to manage
> > >> bad blocks? What's about erasure coding schemes? Do file system
> > really need to suffer
> > >> from the bad block issue?
> > >>
> > >> Usually, we are using LBAs and it is the responsibility of storage
> > device to map
> > >> a bad physical block/page/sector into valid one. Do you mean that
> > we have
> > >> access to physical NVM memory address directly? But it looks like
> > that we can
> > >> have a "bad block" issue even we will access data into page cache's
> > memory
> > >> page (if we will use NVM memory for page cache, of course). So,
> > what do you
> > >> imply by "bad block" issue?
> > >
> > > We don't have direct physical access to the device's address space,
> > in
> > > the sense the device is still free to perform remapping of chunks of
> > NVM
> > > underneath us. The problem is that when a block or address range (as
> > > small as a cache line) goes bad, the device maintains a poison bit
> > for
> > > every affected cache line. Behind the scenes, it may have already
> > > remapped the range, but the cache line poison has to be kept so that
> > > there is a notification to the user/owner of the data that something
> > has
> > > been lost. Since NVM is byte addressable memory sitting on the memory
> > > bus, such a poisoned cache line results in memory errors and
> > SIGBUSes.
> > > Compared t

Re: [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Verma, Vishal L
On Tue, 2017-01-17 at 17:58 -0800, Andiry Xu wrote:
> On Tue, Jan 17, 2017 at 3:51 PM, Vishal Verma  m> wrote:
> > On 01/17, Andiry Xu wrote:
> > 
> > 
> > 
> > > > > 
> > > > > The pmem_do_bvec() read logic is like this:
> > > > > 
> > > > > pmem_do_bvec()
> > > > > if (is_bad_pmem())
> > > > > return -EIO;
> > > > > else
> > > > > memcpy_from_pmem();
> > > > > 
> > > > > Note memcpy_from_pmem() is calling memcpy_mcsafe(). Does this
> > > > > imply
> > > > > that even if a block is not in the badblock list, it still can
> > > > > be bad
> > > > > and causes MCE? Does the badblock list get changed during file
> > > > > system
> > > > > running? If that is the case, should the file system get a
> > > > > notification when it gets changed? If a block is good when I
> > > > > first
> > > > > read it, can I still trust it to be good for the second
> > > > > access?
> > > > 
> > > > Yes, if a block is not in the badblocks list, it can still cause
> > > > an
> > > > MCE. This is the latent error case I described above. For a
> > > > simple read()
> > > > via the pmem driver, this will get handled by memcpy_mcsafe. For
> > > > mmap,
> > > > an MCE is inevitable.
> > > > 
> > > > Yes the badblocks list may change while a filesystem is running.
> > > > The RFC
> > > > patches[1] I linked to add a notification for the filesystem
> > > > when this
> > > > happens.
> > > > 
> > > 
> > > This is really bad and it makes file system implementation much
> > > more
> > > complicated. And badblock notification does not help very much,
> > > because any block can be bad potentially, no matter it is in
> > > badblock
> > > list or not. And file system has to perform checking for every
> > > read,
> > > using memcpy_mcsafe. This is disaster for file system like NOVA,
> > > which
> > > uses pointer de-reference to access data structures on pmem. Now
> > > if I
> > > want to read a field in an inode on pmem, I have to copy it to
> > > DRAM
> > > first and make sure memcpy_mcsafe() does not report anything
> > > wrong.
> > 
> > You have a good point, and I don't know if I have an answer for
> > this..
> > Assuming a system with MCE recovery, maybe NOVA can add a mce
> > handler
> > similar to nfit_handle_mce(), and handle errors as they happen, but
> > I'm
> > being very hand-wavey here and don't know how much/how well that
> > might
> > work..
> > 
> > > 
> > > > No, if the media, for some reason, 'dvelops' a bad cell, a
> > > > second
> > > > consecutive read does have a chance of being bad. Once a
> > > > location has
> > > > been marked as bad, it will stay bad till the ACPI clear error
> > > > 'DSM' has
> > > > been called to mark it as clean.
> > > > 
> > > 
> > > I wonder what happens to write in this case? If a block is bad but
> > > not
> > > reported in badblock list. Now I write to it without reading
> > > first. Do
> > > I clear the poison with the write? Or still require a ACPI DSM?
> > 
> > With writes, my understanding is there is still a possibility that
> > an
> > internal read-modify-write can happen, and cause a MCE (this is the
> > same
> > as writing to a bad DRAM cell, which can also cause an MCE). You
> > can't
> > really use the ACPI DSM preemptively because you don't know whether
> > the
> > location was bad. The error flow will be something like write causes
> > the
> > MCE, a badblock gets added (either through the mce handler or after
> > the
> > next reboot), and the recovery path is now the same as a regular
> > badblock.
> > 
> 
> This is different from my understanding. Right now write_pmem() in
> pmem_do_bvec() does not use memcpy_mcsafe(). If the block is bad it
> clears poison and writes to pmem again. Seems to me writing to bad
> blocks does not cause MCE. Do we need memcpy_mcsafe for pmem stores?

You are right, writes don't use memcpy_mcsafe, and will not directly
cause an MCE. However a write can cause an asynchronous 'CMCI' -
corrected machine check interrupt, but this is not critical, and wont be
a memory error as the core didn't consume poison. memcpy_mcsafe cannot
protect against this because the write is 'posted' and the CMCI is not
synchronous. Note that this is only in the latent error or memmap-store
case.

> 
> Thanks,
> Andiry
> 
> > > 
> > > > [1]: http://www.linux.sgi.com/archives/xfs/2016-06/msg00299.html
> > > > 
> > > 
> > > Thank you for the patchset. I will look into it.
> > > N�r��yb�X��ǧv�^�)޺{.n�+{�nZ�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Jens Axboe
On 01/18/2017 06:02 AM, Hannes Reinecke wrote:
> On 01/17/2017 05:50 PM, Sagi Grimberg wrote:
>>
>>> So it looks like we are super not efficient because most of the
>>> times we catch 1
>>> completion per interrupt and the whole point is that we need to find
>>> more! This fio
>>> is single threaded with QD=32 so I'd expect that we be somewhere in
>>> 8-31 almost all
>>> the time... I also tried QD=1024, histogram is still the same.
>>>
>>>
>>> It looks like it takes you longer to submit an I/O than to service an
>>> interrupt,
>>
>> Well, with irq-poll we do practically nothing in the interrupt handler,
>> only schedule irq-poll. Note that the latency measures are only from
>> the point the interrupt arrives and the point we actually service it
>> by polling for completions.
>>
>>> so increasing queue depth in the singe-threaded case doesn't
>>> make much difference. You might want to try multiple threads per core
>>> with QD, say, 32
>>
>> This is how I ran, QD=32.
> 
> The one thing which I found _really_ curious is this:
> 
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> =64=100.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.1%
>  issued: total=r=7673377/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=256
> 
> (note the lines starting with 'submit' and 'complete').
> They are _always_ 4, irrespective of the hardware and/or tests which I
> run. Jens, what are these numbers supposed to mean?
> Is this intended?

It's bucketized. 0=0.0% means that 0% of the submissions didn't submit
anything (unsurprisingly), and ditto for the complete side. The next bucket
is 1..4, so 100% of submissions and completions was in that range.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nbd: only set MSG_MORE when we have more to send

2017-01-19 Thread Jens Axboe
On 01/19/2017 01:08 PM, Josef Bacik wrote:
> A user noticed that write performance was horrible over loopback and we
> traced it to an inversion of when we need to set MSG_MORE.  It should be
> set when we have more bvec's to send, not when we are on the last bvec.
> This patch made the test go from 20 iops to 78k iops.

Added for 4.10, thanks.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Vishal Verma
On 01/18, Jan Kara wrote:
> On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > I do mean that in the filesystem, for every IO, the badblocks will be
> > checked. Currently, the pmem driver does this, and the hope is that the
> > filesystem can do a better job at it. The driver unconditionally checks
> > every IO for badblocks on the whole device. Depending on how the
> > badblocks are represented in the filesystem, we might be able to quickly
> > tell if a file/range has existing badblocks, and error out the IO
> > accordingly.
> > 
> > At mount the the fs would read the existing badblocks on the block
> > device, and build its own representation of them. Then during normal
> > use, if the underlying badblocks change, the fs would get a notification
> > that would allow it to also update its own representation.
> 
> So I believe we have to distinguish three cases so that we are on the same
> page.
> 
> 1) PMEM is exposed only via a block interface for legacy filesystems to
> use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
> Looking from outside, the IO either returns with EIO or succeeds. As a
> result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

Correct.

> 
> 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
> mostly interested in. We could possibly do something more efficient than
> what NVDIMM driver does however the complexity would be relatively high and
> frankly I'm far from convinced this is really worth it. If there are so
> many badblocks this would matter, the HW has IMHO bigger problems than
> performance.

Correct, and Dave was of the opinion that once at least XFS has reverse
mapping support (which it does now), adding badblocks information to
that should not be a hard lift, and should be a better solution. I
suppose should try to benchmark how much of a penalty the current badblock
checking in the NVVDIMM driver imposes. The penalty is not because there
may be a large number of badblocks, but just due to the fact that we
have to do this check for every IO, in fact, every 'bvec' in a bio.

> 
> 3) PMEM filesystem - there things are even more difficult as was already
> noted elsewhere in the thread. But for now I'd like to leave those aside
> not to complicate things too much.

Agreed that that merits consideration and a whole discussion  by itself,
based on the points Audiry raised.

> 
> Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
> if the platform can recover from MCE, we can just always access persistent
> memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> seems to already happen so we just need to make sure all places handle
> returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
> No need for bad blocks list at all, no slow down unless we hit a bad cell
> and in that case who cares about performance when the data is gone...

Even when we have MCE recovery, we cannot do away with the badblocks
list:
1. My understanding is that the hardware's ability to do MCE recovery is
limited/best-effort, and is not guaranteed. There can be circumstances
that cause a "Processor Context Corrupt" state, which is unrecoverable.
2. We still need to maintain a badblocks list so that we know what
blocks need to be cleared (via the ACPI method) on writes.

> 
> For platforms that cannot recover from MCE - just buy better hardware ;).
> Seriously, I have doubts people can seriously use a machine that will
> unavoidably randomly reboot (as there is always a risk you hit error that
> has not been uncovered by background scrub). But maybe for big cloud providers
> the cost savings may offset for the inconvenience, I don't know. But still
> for that case a bad blocks handling in NVDIMM code like we do now looks
> good enough?

The current handling is good enough for those systems, yes.

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nbd: only set MSG_MORE when we have more to send

2017-01-19 Thread Josef Bacik
A user noticed that write performance was horrible over loopback and we
traced it to an inversion of when we need to set MSG_MORE.  It should be
set when we have more bvec's to send, not when we are on the last bvec.
This patch made the test go from 20 iops to 78k iops.

Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 454f770..510ae02 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -275,7 +275,7 @@ static inline int sock_send_bvec(struct nbd_device *nbd, 
int index,
 static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 {
struct request *req = blk_mq_rq_from_pdu(cmd);
-   int result, flags;
+   int result;
struct nbd_request request;
unsigned long size = blk_rq_bytes(req);
struct bio *bio;
@@ -314,7 +314,6 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct 
nbd_cmd *cmd, int index)
if (type != NBD_CMD_WRITE)
return 0;
 
-   flags = 0;
bio = req->bio;
while (bio) {
struct bio *next = bio->bi_next;
@@ -323,9 +322,8 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct 
nbd_cmd *cmd, int index)
 
bio_for_each_segment(bvec, bio, iter) {
bool is_last = !next && bio_iter_last(bvec, iter);
+   int flags = is_last ? 0 : MSG_MORE;
 
-   if (is_last)
-   flags = MSG_MORE;
dev_dbg(nbd_to_dev(nbd), "request %p: sending %d bytes 
data\n",
cmd, bvec.bv_len);
result = sock_send_bvec(nbd, index, &bvec, flags);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Jeff Moyer
Hi, Slava,

Slava Dubeyko  writes:

>>The data is lost, that's why you're getting an ECC.  It's tantamount
>>to -EIO for a disk block access.
>
> I see the three possible cases here:
> (1) bad block has been discovered (no remap, no recovering) -> data is
>> lost; -EIO for a disk block access, block is always bad;

This is, of course, a possiblity.  In that case, attempts to clear the
error will not succeed.

> (2) bad block has been discovered and remapped -> data is lost; -EIO
> for a disk block access.

Right, and the error is cleared when new data is provided (i.e. through
a write system call or fallocate).

> (3) bad block has been discovered, remapped and recovered -> no data is lost.

This is transparent to the OS and the application.

>>> Let's imagine that the affected address range will equal to 64 bytes. 
>>> It sounds for me that for the case of block device it will affect the 
>>> whole logical block (4 KB).
>>
>> 512 bytes, and yes, that's the granularity at which we track errors
>> in the block layer, so that's the minimum amount of data you lose.
>
> I think it depends what granularity hardware supports. It could be 512
> bytes, 4 KB, maybe greater.

Of course, though I expect the ECC protection in the NVDIMMs to cover a
range much smaller than a page.

>>> The situation is more critical for the case of DAX approach. Correct 
>>> me if I wrong but my understanding is the goal of DAX is to provide 
>>> the direct access to file's memory pages with minimal file system 
>>> overhead. So, it looks like that raising bad block issue on file 
>>> system level will affect a user-space application. Because, finally, 
>>> user-space application will need to process such trouble (bad block 
>>> issue). It sounds for me as really weird situation. What can protect a 
>>> user-space application from encountering the issue with partially 
>>> incorrect memory page?
>>
>> Applications need to deal with -EIO today.  This is the same sort of thing.
>> If an application trips over a bad block during a load from persistent 
>> memory,
>> they will get a signal, and they can either handle it or not.
>>
>> Have a read through this specification and see if it clears anything up for 
>> you:
>>  http://www.snia.org/tech_activities/standards/curr_standards/npm
>
> Thank you for sharing this. So, if a user-space application follows to the
> NVM Programming Model then it will be able to survive by means of catching
> and processing the exceptions. But these applications have to be implemented 
> yet.
> Also such applications need in special technique(s) of recovering. It sounds
> that legacy user-space applications are unable to survive for the NVM.PM.FILE 
> mode
> in the case of load/store operation's failure.

By legacy, I assume you mean those applications which mmap file data and
use msync.  Those applications already have to deal with SIGBUS today
when a disk block is bad.  There is no change in behavior.

If you meant legacy applications that use read/write, they also should
see no change in behavior.  Bad blocks are tracked in the block layer,
and any attempt to read from a bad area of memory will get -EIO.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Vishal Verma
On 01/19, Jan Kara wrote:
> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> > >  wrote:
> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > > Jan Kara  writes:
> > > > > 
> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > > Your note on the online repair does raise another tangentially
> > > > > > > related
> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > > submission
> > > > > > > path will clear the error (if the hardware is able to remap
> > > > > > > the bad
> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > > even
> > > > > > > non-mmap operations - read() and write() will go through the
> > > > > > > dax paths
> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > > perform
> > > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > > filesystem
> > > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > > mount it
> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > > pretty
> > > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > > a better
> > > > > > > way.
> > > > > > 
> > > > > > Please remind me, what is the problem with DAX code doing
> > > > > > necessary work to
> > > > > > clear the error when it gets EIO from memcpy on write?
> > > > > 
> > > > > You won't get an MCE for a store;  only loads generate them.
> > > > > 
> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > > -o dax?
> > > > 
> > > > Not necessarily; XFS usually implements this by punching out the
> > > > range
> > > > and then reallocating it as unwritten blocks.
> > > > 
> > > 
> > > That does clear the error because the unwritten blocks are zeroed and
> > > errors cleared when they become allocated again.
> > 
> > Yes, the problem was that writes won't clear errors. zeroing through
> > either hole-punch, truncate, unlinking the file should all work
> > (assuming the hole-punch or truncate ranges wholly contain the
> > 'badblock' sector).
> 
> Let me repeat my question: You have mentioned that if we do IO through DAX,
> writes won't clear errors and we should fall back to normal block path to
> do write to clear the error. What does prevent us from directly clearing
> the error from DAX path?
> 
With DAX, all IO goes through DAX paths. There are two cases:
1. mmap and loads/stores: Obviously there is no kernel intervention
here, and no badblocks handling is possible.
2. read() or write() IO: In the absence of dax, this would go through
the bio submission path, through the pmem driver, and that would handle
error clearing. With DAX, this goes through dax_iomap_actor, which also
doesn't go through the pmem driver (it does a dax mapping, followed by
essentially memcpy), and hence cannot handle badblocks.


>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Dan Williams
On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma  wrote:
> On 01/19, Jan Kara wrote:
>> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
>> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
>> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
>> > >  wrote:
>> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
>> > > > > Jan Kara  writes:
>> > > > >
>> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> > > > > > > Your note on the online repair does raise another tangentially
>> > > > > > > related
>> > > > > > > topic. Currently, if there are badblocks, writes via the bio
>> > > > > > > submission
>> > > > > > > path will clear the error (if the hardware is able to remap
>> > > > > > > the bad
>> > > > > > > locations). However, if the filesystem is mounted eith DAX,
>> > > > > > > even
>> > > > > > > non-mmap operations - read() and write() will go through the
>> > > > > > > dax paths
>> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
>> > > > > > > perform
>> > > > > > > error-clearing in this case. So currently, if a dax mounted
>> > > > > > > filesystem
>> > > > > > > has badblocks, the only way to clear those badblocks is to
>> > > > > > > mount it
>> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
>> > > > > > > pretty
>> > > > > > > terrible user experience, and I'm hoping this can be solved in
>> > > > > > > a better
>> > > > > > > way.
>> > > > > >
>> > > > > > Please remind me, what is the problem with DAX code doing
>> > > > > > necessary work to
>> > > > > > clear the error when it gets EIO from memcpy on write?
>> > > > >
>> > > > > You won't get an MCE for a store;  only loads generate them.
>> > > > >
>> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
>> > > > > -o dax?
>> > > >
>> > > > Not necessarily; XFS usually implements this by punching out the
>> > > > range
>> > > > and then reallocating it as unwritten blocks.
>> > > >
>> > >
>> > > That does clear the error because the unwritten blocks are zeroed and
>> > > errors cleared when they become allocated again.
>> >
>> > Yes, the problem was that writes won't clear errors. zeroing through
>> > either hole-punch, truncate, unlinking the file should all work
>> > (assuming the hole-punch or truncate ranges wholly contain the
>> > 'badblock' sector).
>>
>> Let me repeat my question: You have mentioned that if we do IO through DAX,
>> writes won't clear errors and we should fall back to normal block path to
>> do write to clear the error. What does prevent us from directly clearing
>> the error from DAX path?
>>
> With DAX, all IO goes through DAX paths. There are two cases:
> 1. mmap and loads/stores: Obviously there is no kernel intervention
> here, and no badblocks handling is possible.
> 2. read() or write() IO: In the absence of dax, this would go through
> the bio submission path, through the pmem driver, and that would handle
> error clearing. With DAX, this goes through dax_iomap_actor, which also
> doesn't go through the pmem driver (it does a dax mapping, followed by
> essentially memcpy), and hence cannot handle badblocks.

Hmm, that may no longer be true after my changes to push dax flushing
to the driver. I.e. we could have a copy_from_iter() implementation
that attempts to clear errors... I'll get that series out and we can
discuss there.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] blk-mq: allow resize of scheduler requests

2017-01-19 Thread Jens Axboe

Add support for growing the tags associated with a hardware queue, for
the scheduler tags. Currently we only support resizing within the
limits of the original depth, change that so we can grow it as well by
allocating and replacing the existing scheduler tag set.

This is similar to how we could increase the software queue depth with
the legacy IO stack and schedulers.

Signed-off-by: Jens Axboe 

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5504eb7ed10b..3ab514f6f288 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -387,19 +387,52 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
kfree(tags);
 }
 
-int blk_mq_tag_update_depth(struct blk_mq_tags *tags, unsigned int tdepth)
+int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_tags **tagsptr, unsigned int tdepth,
+   bool can_grow)
 {
+   struct blk_mq_tags *tags = *tagsptr;
+
tdepth -= tags->nr_reserved_tags;
-   if (tdepth > tags->nr_tags)
-   return -EINVAL;
+
+   /*
+* If we are allowed to grow beyond the original size, allocate
+* a new set of tags before freeing the old one.
+*/
+   if (tdepth > tags->nr_tags) {
+   struct blk_mq_tag_set *set = hctx->queue->tag_set;
+   struct blk_mq_tags *new;
+   bool ret;
+
+   if (!can_grow)
+   return -EINVAL;
+
+   /*
+* We need some sort of upper limit, set it high enough that
+* no valid use cases should require more.
+*/
+   if (tdepth > 16 * BLKDEV_MAX_RQ)
+   return -EINVAL;
+
+   new = blk_mq_alloc_rq_map(set, hctx->queue_num, tdepth, 0);
+   if (!new)
+   return -ENOMEM;
+   ret = blk_mq_alloc_rqs(set, new, hctx->queue_num, tdepth);
+   if (ret) {
+   blk_mq_free_rq_map(new);
+   return -ENOMEM;
+   }
+
+   blk_mq_free_rqs(set, *tagsptr, hctx->queue_num);
+   blk_mq_free_rq_map(*tagsptr);
+   tags = *tagsptr = new;
+   }
 
/*
 * Don't need (or can't) update reserved tags here, they remain
 * static and should never need resizing.
 */
sbitmap_queue_resize(&tags->bitmap_tags, tdepth);
-
-   blk_mq_tag_wakeup_all(tags, false);
return 0;
 }
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 41cd15fd1afd..ac22878462e7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -29,7 +29,9 @@ extern void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct 
blk_mq_tags *tags,
   struct blk_mq_ctx *ctx, unsigned int tag);
 extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags);
 extern ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page);
-extern int blk_mq_tag_update_depth(struct blk_mq_tags *tags, unsigned int 
depth);
+extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_tags **tags,
+   unsigned int depth, bool can_grow);
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b365cde4c909..ee69e5e89769 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2561,6 +2561,9 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
if (!set)
return -EINVAL;
 
+   blk_mq_freeze_queue(q);
+   blk_mq_quiesce_queue(q);
+
ret = 0;
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx->tags)
@@ -2569,11 +2572,14 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
 * If we're using an MQ scheduler, just update the scheduler
 * queue depth. This is similar to what the old code would do.
 */
-   if (!hctx->sched_tags)
-   ret = blk_mq_tag_update_depth(hctx->tags,
-   min(nr, 
set->queue_depth));
-   else
-   ret = blk_mq_tag_update_depth(hctx->sched_tags, nr);
+   if (!hctx->sched_tags) {
+   ret = blk_mq_tag_update_depth(hctx, &hctx->tags,
+   min(nr, 
set->queue_depth),
+   false);
+   } else {
+   ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
+   nr, true);
+   }
if (ret)
break;
}
@@ -2581,6 +2587,9 @@ int blk_mq_update_nr_requests(struct r

Re: requeue failure with blk-mq-sched

2017-01-19 Thread Jens Axboe
On 01/19/2017 06:24 AM, Hannes Reinecke wrote:
> On 01/19/2017 03:09 PM, Jens Axboe wrote:
>> On 01/19/2017 04:27 AM, Hannes Reinecke wrote:
>>> Hi Jens,
>>>
>>> upon further testing with your blk-mq-sched branch I hit a queue stall
>>> during requeing:
>>>
>>> [  202.340959] sd 3:0:4:1: tag#473 Send: scmd 0x880422e7a440
>>> [  202.340962] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
>>> [  202.341161] sd 3:0:4:1: tag#473 Done: ADD_TO_MLQUEUE Result:
>>> hostbyte=DID_OK driverbyte=DRIVER_OK
>>> [  202.341164] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
>>> [  202.341167] sd 3:0:4:1: tag#473 Sense Key : Unit Attention [current]
>>> [  202.341171] sd 3:0:4:1: tag#473 Add. Sense: Power on, reset, or bus
>>> device reset occurred
>>> [  202.341173] sd 3:0:4:1: tag#473 scsi host busy 1 failed 0
>>> [  202.341176] sd 3:0:4:1: tag#473 Inserting command 880422e7a440
>>> into mlqueue
>>>
>>> ... and that is the last ever heard of that device.
>>> The 'device_busy' count remains at '1' and no further commands will be
>>> sent to the device.
>>
>> If device_busy is at 1, then it should have a command pending. Where did
>> you log this - it would be bandy if you attached whatever debug patch
>> you put in, so we can see where the printks are coming from. If we get a
>> BUSY with nothing pending, the driver should be ensuring that the queue
>> gets run again later through blk_mq_delay_queue(), for instance.
>>
>> When the device is stuck, does it restart if you send it IO?
>>
> Meanwhile I've found it.
> 
> Problem is that scsi_queue_rq() will not stop the queue when hitting a
> busy condition before sending commands down to the driver, but still
> calls blk_mq_delay_queue():
> 
>   switch (ret) {
>   case BLK_MQ_RQ_QUEUE_BUSY:
>   if (atomic_read(&sdev->device_busy) == 0 &&
>   !scsi_device_blocked(sdev))
>   blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
>   break;
> 
> As the queue isn't stopped blk_mq_delay_queue() won't do anything,
> so queue_rq() will never be called.
> I've send a patch to linux-scsi.
> 
> BTW: Is it a hard requirement that the queue has to be stopped when
> returning BLK_MQ_RQ_QUEUE_BUSY?

No, but currently it is apparently a hard requirement that the queue be
stopped when you call delay. Which does make sense, since there's little
point in doing the delay if the queue is run anyway.

> The comments indicate as such, but none of the drivers do so...
> Also, blk_mq_delay_queue() is a bit odd, in that it'll only start
> stopped hardware queues. I would at least document that the queue has to
> be stopped when calling that.
> Better still, can't we have blk_mq_delay_queue start the queues
> unconditionally?

I think so, it doesn't make sense to have blk_mq_delay_queue() NOT stop
the queue, yet its own work handler requires it to be set to actually
run.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index fa1f8619bfe7..739a29208a63 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1161,8 +1161,8 @@ static void blk_mq_delay_work_fn(struct work_struct *work)
 
hctx = container_of(work, struct blk_mq_hw_ctx, delay_work.work);
 
-   if (test_and_clear_bit(BLK_MQ_S_STOPPED, &hctx->state))
-   __blk_mq_run_hw_queue(hctx);
+   clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+   __blk_mq_run_hw_queue(hctx);
 }
 
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: requeue failure with blk-mq-sched

2017-01-19 Thread Hannes Reinecke
On 01/19/2017 03:09 PM, Jens Axboe wrote:
> On 01/19/2017 04:27 AM, Hannes Reinecke wrote:
>> Hi Jens,
>>
>> upon further testing with your blk-mq-sched branch I hit a queue stall
>> during requeing:
>>
>> [  202.340959] sd 3:0:4:1: tag#473 Send: scmd 0x880422e7a440
>> [  202.340962] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
>> [  202.341161] sd 3:0:4:1: tag#473 Done: ADD_TO_MLQUEUE Result:
>> hostbyte=DID_OK driverbyte=DRIVER_OK
>> [  202.341164] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
>> [  202.341167] sd 3:0:4:1: tag#473 Sense Key : Unit Attention [current]
>> [  202.341171] sd 3:0:4:1: tag#473 Add. Sense: Power on, reset, or bus
>> device reset occurred
>> [  202.341173] sd 3:0:4:1: tag#473 scsi host busy 1 failed 0
>> [  202.341176] sd 3:0:4:1: tag#473 Inserting command 880422e7a440
>> into mlqueue
>>
>> ... and that is the last ever heard of that device.
>> The 'device_busy' count remains at '1' and no further commands will be
>> sent to the device.
> 
> If device_busy is at 1, then it should have a command pending. Where did
> you log this - it would be bandy if you attached whatever debug patch
> you put in, so we can see where the printks are coming from. If we get a
> BUSY with nothing pending, the driver should be ensuring that the queue
> gets run again later through blk_mq_delay_queue(), for instance.
> 
> When the device is stuck, does it restart if you send it IO?
> 
Meanwhile I've found it.

Problem is that scsi_queue_rq() will not stop the queue when hitting a
busy condition before sending commands down to the driver, but still
calls blk_mq_delay_queue():

switch (ret) {
case BLK_MQ_RQ_QUEUE_BUSY:
if (atomic_read(&sdev->device_busy) == 0 &&
!scsi_device_blocked(sdev))
blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
break;

As the queue isn't stopped blk_mq_delay_queue() won't do anything,
so queue_rq() will never be called.
I've send a patch to linux-scsi.

BTW: Is it a hard requirement that the queue has to be stopped when
returning BLK_MQ_RQ_QUEUE_BUSY?
The comments indicate as such, but none of the drivers do so...
Also, blk_mq_delay_queue() is a bit odd, in that it'll only start
stopped hardware queues. I would at least document that the queue has to
be stopped when calling that.
Better still, can't we have blk_mq_delay_queue start the queues
unconditionally?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bug report] blk-mq-tag: cleanup the normal/reserved tag allocation

2017-01-19 Thread Jens Axboe
On 01/19/2017 02:54 AM, Dan Carpenter wrote:
> Hello Jens Axboe,
> 
> This is a semi-automatic email about new static checker warnings.
> 
> The patch 4941115bef2b: "blk-mq-tag: cleanup the normal/reserved tag 
> allocation" from Jan 13, 2017, leads to the following Smatch 
> complaint:
> 
> block/blk-mq-tag.c:142 blk_mq_get_tag()
>warn: variable dereferenced before check 'data->hctx' (see line 102)
> 
> block/blk-mq-tag.c
>101{
>102struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
>^^^
> We added a new "data->hctx" dereference here.
> 
>103struct sbitmap_queue *bt;
>104struct sbq_wait_state *ws;
>105DEFINE_WAIT(wait);
>106unsigned int tag_offset;
>107int tag;
>108
>109if (data->flags & BLK_MQ_REQ_RESERVED) {
>110if (unlikely(!tags->nr_reserved_tags)) {
>111WARN_ON_ONCE(1);
>112return BLK_MQ_TAG_FAIL;
>113}
>114bt = &tags->breserved_tags;
>115tag_offset = 0;
>116} else {
>117bt = &tags->bitmap_tags;
>118tag_offset = tags->nr_reserved_tags;
>119}
>120
>121tag = __blk_mq_get_tag(data->hctx, bt);
>122if (tag != -1)
>123goto found_tag;
>124
>125if (data->flags & BLK_MQ_REQ_NOWAIT)
>126return BLK_MQ_TAG_FAIL;
>127
>128ws = bt_wait_ptr(bt, data->hctx);
>129do {
>130prepare_to_wait(&ws->wait, &wait, 
> TASK_UNINTERRUPTIBLE);
>131
>132tag = __blk_mq_get_tag(data->hctx, bt);
>133if (tag != -1)
>134break;
>135
>136/*
>137 * We're out of tags on this hardware queue, 
> kick any
>138 * pending IO submits before going to sleep 
> waiting for
>139 * some to complete. Note that hctx can be NULL 
> here for
>140 * reserved tag allocation.
>141 */
>142if (data->hctx)
> 
> We're in a loop here so it's possible we don't need to check the first
> item in the list, but it's kind of new so I thought I would forward the
> message anyway even though it might be a false positive.

It's no longer possible for 'hctx' to be passed in as NULL, so the above
dereference is safe. I'll tend to this superfluous hctx check.


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: requeue failure with blk-mq-sched

2017-01-19 Thread Jens Axboe
On 01/19/2017 04:27 AM, Hannes Reinecke wrote:
> Hi Jens,
> 
> upon further testing with your blk-mq-sched branch I hit a queue stall
> during requeing:
> 
> [  202.340959] sd 3:0:4:1: tag#473 Send: scmd 0x880422e7a440
> [  202.340962] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
> [  202.341161] sd 3:0:4:1: tag#473 Done: ADD_TO_MLQUEUE Result:
> hostbyte=DID_OK driverbyte=DRIVER_OK
> [  202.341164] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
> [  202.341167] sd 3:0:4:1: tag#473 Sense Key : Unit Attention [current]
> [  202.341171] sd 3:0:4:1: tag#473 Add. Sense: Power on, reset, or bus
> device reset occurred
> [  202.341173] sd 3:0:4:1: tag#473 scsi host busy 1 failed 0
> [  202.341176] sd 3:0:4:1: tag#473 Inserting command 880422e7a440
> into mlqueue
> 
> ... and that is the last ever heard of that device.
> The 'device_busy' count remains at '1' and no further commands will be
> sent to the device.

If device_busy is at 1, then it should have a command pending. Where did
you log this - it would be bandy if you attached whatever debug patch
you put in, so we can see where the printks are coming from. If we get a
BUSY with nothing pending, the driver should be ensuring that the queue
gets run again later through blk_mq_delay_queue(), for instance.

When the device is stuck, does it restart if you send it IO?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bug report] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-19 Thread Jens Axboe
On 01/18/2017 11:32 PM, Dan Carpenter wrote:
> Hello Jens Axboe,
> 
> This is a semi-automatic email about new static checker warnings.
> 
> The patch bd166ef183c2: "blk-mq-sched: add framework for MQ capable
> IO schedulers" from Jan 17, 2017, leads to the following Smatch
> complaint:
> 
> block/elevator.c:234 elevator_init()
>error: we previously assumed 'e' could be null (see line 229)
> 
> block/elevator.c
>228
>229if (!e) {
> ^^
> Null.
> 
>230printk(KERN_ERR
>231"Default I/O scheduler not 
> found. " \
>232"Using noop/none.\n");
>233if (q->mq_ops) {
>234elevator_put(e);
> ^^^
> This will Oops.
> 
>235return 0;
>236}

Good find, thanks Dan. I've killed the elevator_put().

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


requeue failure with blk-mq-sched

2017-01-19 Thread Hannes Reinecke
Hi Jens,

upon further testing with your blk-mq-sched branch I hit a queue stall
during requeing:

[  202.340959] sd 3:0:4:1: tag#473 Send: scmd 0x880422e7a440
[  202.340962] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
[  202.341161] sd 3:0:4:1: tag#473 Done: ADD_TO_MLQUEUE Result:
hostbyte=DID_OK driverbyte=DRIVER_OK
[  202.341164] sd 3:0:4:1: tag#473 CDB: Test Unit Ready 00 00 00 00 00 00
[  202.341167] sd 3:0:4:1: tag#473 Sense Key : Unit Attention [current]
[  202.341171] sd 3:0:4:1: tag#473 Add. Sense: Power on, reset, or bus
device reset occurred
[  202.341173] sd 3:0:4:1: tag#473 scsi host busy 1 failed 0
[  202.341176] sd 3:0:4:1: tag#473 Inserting command 880422e7a440
into mlqueue

... and that is the last ever heard of that device.
The 'device_busy' count remains at '1' and no further commands will be
sent to the device.

Debugging continues.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Hannes Reinecke
On 01/19/2017 11:57 AM, Ming Lei wrote:
> On Wed, Jan 11, 2017 at 11:07 PM, Jens Axboe  wrote:
>> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>>> Hi all,
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block 
>>> drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI in 
>>> the
>>> networking field and accoring to Sagi's findings in [1] performance with
>>> polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>>> more block drivers and how to overcome the currently seen performance 
>>> issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
>> isn't used more widely.
> 
> I remembered that Keith and I discussed some issues of blk-iopoll:
> 
> http://marc.info/?l=linux-block&m=147576999016407&w=2
> 
> seems which isn't addressed yet.
> 
That's a different poll.

For some obscure reasons you have a blk-mq-poll function (via
q->mq_ops->poll) and an irqpoll function.
The former is for polling completion of individual block-layer tags, the
latter for polling completions from the hardware instead of relying on
interrupts.

We're discussing the latter here, so that thread isn't really applicable
here. However, there have been requests to discuss the former at LSF/MM,
too. So there might be a chance of restarting that discussion.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Ming Lei
On Wed, Jan 11, 2017 at 11:07 PM, Jens Axboe  wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block 
>> drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
>
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.

I remembered that Keith and I discussed some issues of blk-iopoll:

http://marc.info/?l=linux-block&m=147576999016407&w=2

seems which isn't addressed yet.


Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug report] blk-mq-tag: cleanup the normal/reserved tag allocation

2017-01-19 Thread Dan Carpenter
Hello Jens Axboe,

This is a semi-automatic email about new static checker warnings.

The patch 4941115bef2b: "blk-mq-tag: cleanup the normal/reserved tag 
allocation" from Jan 13, 2017, leads to the following Smatch 
complaint:

block/blk-mq-tag.c:142 blk_mq_get_tag()
 warn: variable dereferenced before check 'data->hctx' (see line 102)

block/blk-mq-tag.c
   101  {
   102  struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
   ^^^
We added a new "data->hctx" dereference here.

   103  struct sbitmap_queue *bt;
   104  struct sbq_wait_state *ws;
   105  DEFINE_WAIT(wait);
   106  unsigned int tag_offset;
   107  int tag;
   108  
   109  if (data->flags & BLK_MQ_REQ_RESERVED) {
   110  if (unlikely(!tags->nr_reserved_tags)) {
   111  WARN_ON_ONCE(1);
   112  return BLK_MQ_TAG_FAIL;
   113  }
   114  bt = &tags->breserved_tags;
   115  tag_offset = 0;
   116  } else {
   117  bt = &tags->bitmap_tags;
   118  tag_offset = tags->nr_reserved_tags;
   119  }
   120  
   121  tag = __blk_mq_get_tag(data->hctx, bt);
   122  if (tag != -1)
   123  goto found_tag;
   124  
   125  if (data->flags & BLK_MQ_REQ_NOWAIT)
   126  return BLK_MQ_TAG_FAIL;
   127  
   128  ws = bt_wait_ptr(bt, data->hctx);
   129  do {
   130  prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
   131  
   132  tag = __blk_mq_get_tag(data->hctx, bt);
   133  if (tag != -1)
   134  break;
   135  
   136  /*
   137   * We're out of tags on this hardware queue, kick any
   138   * pending IO submits before going to sleep waiting for
   139   * some to complete. Note that hctx can be NULL here for
   140   * reserved tag allocation.
   141   */
   142  if (data->hctx)

We're in a loop here so it's possible we don't need to check the first
item in the list, but it's kind of new so I thought I would forward the
message anyway even though it might be a false positive.

   143  blk_mq_run_hw_queue(data->hctx, false);
   144  

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Johannes Thumshirn
On Thu, Jan 19, 2017 at 10:23:28AM +0200, Sagi Grimberg wrote:
> Christoph suggest to me once that we can take a hybrid
> approach where we consume a small amount of completions (say 4)
> right away from the interrupt handler and if we have more
> we schedule irq-poll to reap the rest. But back then it
> didn't work better which is not aligned with my observations
> that we consume only 1 completion per interrupt...
> 
> I can give it another go... What do people think about it?

This could be good.

What's also possible (see answer to my previous mail) is measuring the time it
takes for a completion to arrive and if the average time is lower than the
context switch time just busy loop insted of waiting for the IRQ to arrive. If
it is higher we can always schedule a timer to hit _before_ the IRQ will
likely arrive and start polling. Is this something that sounds reasonable to
you guys as well?

Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Johannes Thumshirn
On Thu, Jan 19, 2017 at 10:12:17AM +0200, Sagi Grimberg wrote:
> 
> >>>I think you missed:
> >>>http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> >>
> >>I indeed did, thanks.
> >>
> >But it doesn't help.
> >
> >We're still having to wait for the first interrupt, and if we're really
> >fast that's the only completion we have to process.
> >
> >Try this:
> >
> >
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index b4b32e6..e2dd9e2 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >}
> >__nvme_submit_cmd(nvmeq, &cmnd);
> >spin_unlock(&nvmeq->sq_lock);
> >+   disable_irq_nosync(nvmeq_irq(irq));
> >+   irq_poll_sched(&nvmeq->iop);
> 
> a. This would trigger a condition that we disable irq twice which
> is wrong at least because it will generate a warning.
> 
> b. This would cause a way-too-much triggers of ksoftirqd. In order for
> it to be effective we need to to run only when it should and optimally
> when the completion queue has a batch of completions waiting.
> 
> After a deeper analysis, I agree with Bart that interrupt coalescing is
> needed for it to work. The problem with nvme coalescing as Jens said, is
> a death penalty of 100us granularity. Hannes, Johannes, how does it look
> like with the devices you are testing with?

I haven't had a look at AHCI's Command Completion Coalescing yet but hopefully
I find the time today (+SSD testing!!!).

Don't know if Hannes did (but I _think_ no). The problem is we've already
maxed out our test HW w/o irq_poll and so the only changes we're seeing
currently is an increase of wasted CPU cycles. Not what we wanted to have.

> 
> Also, I think that adaptive moderation is needed in order for it to
> work well. I know that some networking drivers implemented adaptive
> moderation in SW before having HW support for it. It can be done by
> maintaining stats and having a periodic work that looks at it and
> changes the moderation parameters.
> 
> Does anyone think that this is something we should consider?

Yes we've been discussing this internally as well and it sounds good but thats
still all pure theory and nothing actually implemented and tested.

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM ATTEND] block: multipage bvec upstreaming & next step

2017-01-19 Thread Coly Li
On 2017/1/17 上午11:56, Ming Lei wrote:
> Hi Guys,
> 
> The multipage bvec patchset[1] has been posted out for a while, and
> xfstest(ext4, xfs and btrfs) have been run and no regression is
> observed, and I hope having talk about this in person would help
> moving it forward.
> 
> Secondly I'd like to discuss with guys about the following next steps
> of mp-bvec:
> 
> 1) cleanup raid1/raid5/raid10 for removing the only singlepage bvec path

Hi Ming,

I am interested on this topic. Could you please provide a hint that what
is the proper way to remove single page bvec from md raid code ? And
will dm code involved in as well ?

Thanks.

Coly
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM ATTEND] block: multipage bvec upstreaming & next step

2017-01-19 Thread Ming Lei
Hi Coly,

On Thu, Jan 19, 2017 at 3:57 PM, Coly Li  wrote:
> On 2017/1/17 上午11:56, Ming Lei wrote:
>> Hi Guys,
>>
>> The multipage bvec patchset[1] has been posted out for a while, and
>> xfstest(ext4, xfs and btrfs) have been run and no regression is
>> observed, and I hope having talk about this in person would help
>> moving it forward.
>>
>> Secondly I'd like to discuss with guys about the following next steps
>> of mp-bvec:
>>
>> 1) cleanup raid1/raid5/raid10 for removing the only singlepage bvec path
>
> Hi Ming,
>
> I am interested on this topic. Could you please provide a hint that what
> is the proper way to remove single page bvec from md raid code ?

Basically speaking the raid code can't access the bvec table directly like
referencing .bi_vcnt or .bi_io_vec once multipage bvec is supported,
especially after pages are added to the bio via bio_add_page().

I am looking at the raid1 code, and maybe we can discuss a bit
about the cleanup.

> And
> will dm code involved in as well ?

No, nothing at all, and no such usage in dm code.


Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Sagi Grimberg

Christoph suggest to me once that we can take a hybrid
approach where we consume a small amount of completions (say 4)
right away from the interrupt handler and if we have more
we schedule irq-poll to reap the rest. But back then it
didn't work better which is not aligned with my observations
that we consume only 1 completion per interrupt...

I can give it another go... What do people think about it?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Jan Kara
On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> >  wrote:
> > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > Jan Kara  writes:
> > > > 
> > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > Your note on the online repair does raise another tangentially
> > > > > > related
> > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > submission
> > > > > > path will clear the error (if the hardware is able to remap
> > > > > > the bad
> > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > even
> > > > > > non-mmap operations - read() and write() will go through the
> > > > > > dax paths
> > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > perform
> > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > filesystem
> > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > mount it
> > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > pretty
> > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > a better
> > > > > > way.
> > > > > 
> > > > > Please remind me, what is the problem with DAX code doing
> > > > > necessary work to
> > > > > clear the error when it gets EIO from memcpy on write?
> > > > 
> > > > You won't get an MCE for a store;  only loads generate them.
> > > > 
> > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > -o dax?
> > > 
> > > Not necessarily; XFS usually implements this by punching out the
> > > range
> > > and then reallocating it as unwritten blocks.
> > > 
> > 
> > That does clear the error because the unwritten blocks are zeroed and
> > errors cleared when they become allocated again.
> 
> Yes, the problem was that writes won't clear errors. zeroing through
> either hole-punch, truncate, unlinking the file should all work
> (assuming the hole-punch or truncate ranges wholly contain the
> 'badblock' sector).

Let me repeat my question: You have mentioned that if we do IO through DAX,
writes won't clear errors and we should fall back to normal block path to
do write to clear the error. What does prevent us from directly clearing
the error from DAX path?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-19 Thread Sagi Grimberg



I think you missed:
http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007


I indeed did, thanks.


But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
}
__nvme_submit_cmd(nvmeq, &cmnd);
spin_unlock(&nvmeq->sq_lock);
+   disable_irq_nosync(nvmeq_irq(irq));
+   irq_poll_sched(&nvmeq->iop);


a. This would trigger a condition that we disable irq twice which
is wrong at least because it will generate a warning.

b. This would cause a way-too-much triggers of ksoftirqd. In order for
it to be effective we need to to run only when it should and optimally
when the completion queue has a batch of completions waiting.

After a deeper analysis, I agree with Bart that interrupt coalescing is
needed for it to work. The problem with nvme coalescing as Jens said, is
a death penalty of 100us granularity. Hannes, Johannes, how does it look
like with the devices you are testing with?

Also, I think that adaptive moderation is needed in order for it to
work well. I know that some networking drivers implemented adaptive
moderation in SW before having HW support for it. It can be done by
maintaining stats and having a periodic work that looks at it and
changes the moderation parameters.

Does anyone think that this is something we should consider?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html