Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock

2017-01-18 Thread Sagi Grimberg



If hardware queues are stopped for some event, like the device has been
suspended by power management, requests allocated on that hardware queue
are indefinitely stuck causing a queue freeze to wait forever.


I have a problem with this patch. IMO, this is a general issue so, so
why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
not need to update nr_hw_queues at all. I'm fine with the
blk_mq_abandon_stopped_requests but not with its call-site.

Usually a driver knows when it wants to abandon all busy requests
blk_mq_tagset_busy_iter(), maybe the right approach is to add
a hook for all allocated tags? Or have blk_mq_quisce_queue get a
fail all requests parameter from the callers?


This patch is overly aggressive on failing allocated requests. There
are scenarios where we wouldn't want to abandon them, like if the hw
context is about to be brough back online, but this patch assumes all
need to be abandoned. I'll see if there's some other tricks we can have
a driver do. Thanks for the suggestions.


I agree,

I do think though that this should be driven from the driver, because
for fabrics, we might have some fabric error that triggers a periodic
reconnect. So the "hw context is about to be brought back" is unknown
from the driver pov, and when we delete the controller (because we give
up) this is exactly where we need to abandon the allocated requests.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes

2017-01-18 Thread Sagi Grimberg



@@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue 
*queue,
}



Christoph, a little above here we still look at blk_rq_bytes(),
shouldn't that look at blk_rq_payload_bytes() too?


The check is ok for now as it's just zero vs non-zero.  It's somewhat
broken for Write Zeroes, though.  I've fixed it in this series
which I need to submit:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/write-zeroes-take-2



I see, thanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug report] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-18 Thread Dan Carpenter
Hello Jens Axboe,

This is a semi-automatic email about new static checker warnings.

The patch bd166ef183c2: "blk-mq-sched: add framework for MQ capable
IO schedulers" from Jan 17, 2017, leads to the following Smatch
complaint:

block/elevator.c:234 elevator_init()
 error: we previously assumed 'e' could be null (see line 229)

block/elevator.c
   228  
   229  if (!e) {
^^
Null.

   230  printk(KERN_ERR
   231  "Default I/O scheduler not found. " \
   232  "Using noop/none.\n");
   233  if (q->mq_ops) {
   234  elevator_put(e);
^^^
This will Oops.

   235  return 0;
   236  }

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] block level event logging for storage media management

2017-01-18 Thread Hannes Reinecke
On 01/19/2017 12:34 AM, Song Liu wrote:
> 
> Media health monitoring is very important for large scale distributed storage 
> systems. 
> Traditionally, enterprise storage controllers maintain event logs for 
> attached storage
> devices. However, these controller managed logs do not scale well for large 
> scale 
> distributed systems. 
> 
> While designing a more flexible and scalable event logging systems, we think 
> it is better
> to build the log in block layer. Block level event logging covers all major 
> storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different 
> protocols. 
> 
> In this LSF/MM, we would like to discuss the following topics with the 
> community:
> 1. Mechanism for drivers report events (or errors) to block layer. 
>Basically, we will need a traceable function for the drivers to report 
> errors 
>(most likely right before calling end_request or bio_endio).  
>   
> 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event 
> logging?
> 
> 3. How should we categorize different events?
>Currently, there are existing code that translates ATA error 
> (ata_to_sense_error) 
>and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can 
>leverage SCSI Key Code Qualifier for event categorizations. 
> 
> 4. Detailed discussions on data structure for event logging. 
> 
> We will be able to show a prototype implementation during LSF/MM. 
> 
Very good topic; I'm very much in favour of it.

That ties in rather nicely with my multipath redesign, where I've added
a notifier chain for block events.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] block level event logging for storage media management

2017-01-18 Thread Coly Li
On 2017/1/19 上午7:34, Song Liu wrote:
> 
> Media health monitoring is very important for large scale distributed storage 
> systems. 
> Traditionally, enterprise storage controllers maintain event logs for 
> attached storage
> devices. However, these controller managed logs do not scale well for large 
> scale 
> distributed systems. 
> 
> While designing a more flexible and scalable event logging systems, we think 
> it is better
> to build the log in block layer. Block level event logging covers all major 
> storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different 
> protocols. 
> 
> In this LSF/MM, we would like to discuss the following topics with the 
> community:
> 1. Mechanism for drivers report events (or errors) to block layer. 
>Basically, we will need a traceable function for the drivers to report 
> errors 
>(most likely right before calling end_request or bio_endio).  
>   
> 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event 
> logging?
> 
> 3. How should we categorize different events?
>Currently, there are existing code that translates ATA error 
> (ata_to_sense_error) 
>and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can 
>leverage SCSI Key Code Qualifier for event categorizations. 
> 
> 4. Detailed discussions on data structure for event logging. 
> 
> We will be able to show a prototype implementation during LSF/MM. 

This is an interesting topic. For stacked block devices, all layers
higher than the fault layer will observe the media error, reporting the
underlying failure in every layer may introduce quite a lot noise.

Yes, I am willing to attend this discussion.

Thanks.

Coly Li
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Darrick J. Wong
On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> Jan Kara  writes:
> 
> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> Your note on the online repair does raise another tangentially related
> >> topic. Currently, if there are badblocks, writes via the bio submission
> >> path will clear the error (if the hardware is able to remap the bad
> >> locations). However, if the filesystem is mounted eith DAX, even
> >> non-mmap operations - read() and write() will go through the dax paths
> >> (dax_do_io()). We haven't found a good/agreeable way to perform
> >> error-clearing in this case. So currently, if a dax mounted filesystem
> >> has badblocks, the only way to clear those badblocks is to mount it
> >> without DAX, and overwrite/zero the bad locations. This is a pretty
> >> terrible user experience, and I'm hoping this can be solved in a better
> >> way.
> >
> > Please remind me, what is the problem with DAX code doing necessary work to
> > clear the error when it gets EIO from memcpy on write?
> 
> You won't get an MCE for a store;  only loads generate them.
> 
> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Not necessarily; XFS usually implements this by punching out the range
and then reallocating it as unwritten blocks.

--D

> 
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Slava Dubeyko

-Original Message-
From: Jeff Moyer [mailto:jmo...@redhat.com] 
Sent: Wednesday, January 18, 2017 12:48 PM
To: Slava Dubeyko 
Cc: Jan Kara ; linux-nvd...@lists.01.org 
; linux-block@vger.kernel.org; Viacheslav Dubeyko 
; Linux FS Devel ; 
lsf...@lists.linux-foundation.org
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in 
filesystems

>>> Well, the situation with NVM is more like with DRAM AFAIU. It is 
>>> quite reliable but given the size the probability *some* cell has degraded 
>>> is quite high.
>>> And similar to DRAM you'll get MCE (Machine Check Exception) when you 
>>> try to read such cell. As Vishal wrote, the hardware does some 
>>> background scrubbing and relocates stuff early if needed but nothing is 
>>> 100%.
>>
>> My understanding that hardware does the remapping the affected address 
>> range (64 bytes, for example) but it doesn't move/migrate the stored 
>> data in this address range. So, it sounds slightly weird. Because it 
>> means that no guarantee to retrieve the stored data. It sounds that 
>> file system should be aware about this and has to be heavily protected 
>> by some replication or erasure coding scheme. Otherwise, if the 
>> hardware does everything for us (remap the affected address region and 
>> move data into a new address region) then why does file system need to 
>> know about the affected address regions?
>
>The data is lost, that's why you're getting an ECC.  It's tantamount to -EIO 
>for a disk block access.

I see the three possible cases here:
(1) bad block has been discovered (no remap, no recovering) -> data is lost; 
-EIO for a disk block access, block is always bad;
(2) bad block has been discovered and remapped -> data is lost; -EIO for a disk 
block access.
(3) bad block has been discovered, remapped and recovered -> no data is lost.

>> Let's imagine that the affected address range will equal to 64 bytes. 
>> It sounds for me that for the case of block device it will affect the 
>> whole logical block (4 KB).
>
> 512 bytes, and yes, that's the granularity at which we track errors in the 
> block layer, so that's the minimum amount of data you lose.

I think it depends what granularity hardware supports. It could be 512 bytes, 4 
KB, maybe greater.

>> The situation is more critical for the case of DAX approach. Correct 
>> me if I wrong but my understanding is the goal of DAX is to provide 
>> the direct access to file's memory pages with minimal file system 
>> overhead. So, it looks like that raising bad block issue on file 
>> system level will affect a user-space application. Because, finally, 
>> user-space application will need to process such trouble (bad block 
>> issue). It sounds for me as really weird situation. What can protect a 
>> user-space application from encountering the issue with partially 
>> incorrect memory page?
>
> Applications need to deal with -EIO today.  This is the same sort of thing.
> If an application trips over a bad block during a load from persistent memory,
> they will get a signal, and they can either handle it or not.
>
> Have a read through this specification and see if it clears anything up for 
> you:
>  http://www.snia.org/tech_activities/standards/curr_standards/npm

Thank you for sharing this. So, if a user-space application follows to the
NVM Programming Model then it will be able to survive by means of catching
and processing the exceptions. But these applications have to be implemented 
yet.
Also such applications need in special technique(s) of recovering. It sounds
that legacy user-space applications are unable to survive for the NVM.PM.FILE 
mode
in the case of load/store operation's failure.

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] block level event logging for storage media management

2017-01-18 Thread Bart Van Assche
On Wed, 2017-01-18 at 23:34 +, Song Liu wrote:
> Media health monitoring is very important for large scale distributed storage 
> systems. 
> Traditionally, enterprise storage controllers maintain event logs for 
> attached storage
> devices. However, these controller managed logs do not scale well for large 
> scale 
> distributed systems. 
> 
> While designing a more flexible and scalable event logging systems, we think 
> it is better
> to build the log in block layer. Block level event logging covers all major 
> storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different 
> protocols. 
> 
> In this LSF/MM, we would like to discuss the following topics with the 
> community:
> 1. Mechanism for drivers report events (or errors) to block layer. 
>Basically, we will need a traceable function for the drivers to report 
> errors 
>(most likely right before calling end_request or bio_endio).  
>   
> 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event 
> logging?
> 
> 3. How should we categorize different events?
>Currently, there are existing code that translates ATA error 
> (ata_to_sense_error) 
>and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can 
>leverage SCSI Key Code Qualifier for event categorizations. 
> 
> 4. Detailed discussions on data structure for event logging. 
> 
> We will be able to show a prototype implementation during LSF/MM. 

I'd like to participate in this discussion.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC] block level event logging for storage media management

2017-01-18 Thread Song Liu

Media health monitoring is very important for large scale distributed storage 
systems. 
Traditionally, enterprise storage controllers maintain event logs for attached 
storage
devices. However, these controller managed logs do not scale well for large 
scale 
distributed systems. 

While designing a more flexible and scalable event logging systems, we think it 
is better
to build the log in block layer. Block level event logging covers all major 
storage media
(SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. 

In this LSF/MM, we would like to discuss the following topics with the 
community:
1. Mechanism for drivers report events (or errors) to block layer. 
   Basically, we will need a traceable function for the drivers to report 
errors 
   (most likely right before calling end_request or bio_endio).  
  
2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event 
logging?

3. How should we categorize different events?
   Currently, there are existing code that translates ATA error 
(ata_to_sense_error) 
   and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can 
   leverage SCSI Key Code Qualifier for event categorizations. 

4. Detailed discussions on data structure for event logging. 

We will be able to show a prototype implementation during LSF/MM. 

Thanks,
Song--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.10] blk-mq: Remove unused variable

2017-01-18 Thread Keith Busch
On Wed, Jan 18, 2017 at 02:21:48PM -0800, Jens Axboe wrote:
> On 01/18/2017 02:16 PM, Jens Axboe wrote:
> > On 01/18/2017 02:21 PM, Keith Busch wrote:
> >> Signed-off-by: Keith Busch 
> >> Reviewed-by: Christoph Hellwig 
> >> Reviewed-by: Sagi Grimberg 
> > 
> > Does it cause a warning anywhere? If not, I'd rather not apply
> > it, since it'll cause a merge conflict for 4.11 (and the problem
> > doesn't exist there).
> 
> Ah screw it, the fact that we have that in 4.10 is an eye sore.
> I'd rather deal with the conflict, I'll apply it for 4.10. Thanks
> Keith.

Okay, thanks. I was just clearing my backlog of emails and this was on
my todo list from a couple weeks ago, but I didn't noticed 4.11 corrected
this in the mq scheduler set.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.10] blk-mq: Remove unused variable

2017-01-18 Thread Jens Axboe
On 01/18/2017 02:16 PM, Jens Axboe wrote:
> On 01/18/2017 02:21 PM, Keith Busch wrote:
>> Signed-off-by: Keith Busch 
>> Reviewed-by: Christoph Hellwig 
>> Reviewed-by: Sagi Grimberg 
> 
> Does it cause a warning anywhere? If not, I'd rather not apply
> it, since it'll cause a merge conflict for 4.11 (and the problem
> doesn't exist there).

Ah screw it, the fact that we have that in 4.10 is an eye sore.
I'd rather deal with the conflict, I'll apply it for 4.10. Thanks
Keith.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.10] blk-mq: Remove unused variable

2017-01-18 Thread Jens Axboe
On 01/18/2017 02:21 PM, Keith Busch wrote:
> Signed-off-by: Keith Busch 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 

Does it cause a warning anywhere? If not, I'd rather not apply
it, since it'll cause a merge conflict for 4.11 (and the problem
doesn't exist there).

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-4.10] blk-mq: Remove unused variable

2017-01-18 Thread Keith Busch
Signed-off-by: Keith Busch 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a8e67a1..c3400b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -912,7 +912,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 {
LIST_HEAD(rq_list);
-   LIST_HEAD(driver_list);
 
if (unlikely(blk_mq_hctx_stopped(hctx)))
return;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Verma, Vishal L
On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
>  wrote:
> > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > Jan Kara  writes:
> > > 
> > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > Your note on the online repair does raise another tangentially
> > > > > related
> > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > submission
> > > > > path will clear the error (if the hardware is able to remap
> > > > > the bad
> > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > even
> > > > > non-mmap operations - read() and write() will go through the
> > > > > dax paths
> > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > perform
> > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > filesystem
> > > > > has badblocks, the only way to clear those badblocks is to
> > > > > mount it
> > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > pretty
> > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > a better
> > > > > way.
> > > > 
> > > > Please remind me, what is the problem with DAX code doing
> > > > necessary work to
> > > > clear the error when it gets EIO from memcpy on write?
> > > 
> > > You won't get an MCE for a store;  only loads generate them.
> > > 
> > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > -o dax?
> > 
> > Not necessarily; XFS usually implements this by punching out the
> > range
> > and then reallocating it as unwritten blocks.
> > 
> 
> That does clear the error because the unwritten blocks are zeroed and
> errors cleared when they become allocated again.

Yes, the problem was that writes won't clear errors. zeroing through
either hole-punch, truncate, unlinking the file should all work
(assuming the hole-punch or truncate ranges wholly contain the
'badblock' sector).


> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimmN�r��yb�X��ǧv�^�)޺{.n�+{�nZ�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Dan Williams
On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
 wrote:
> On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
>> Jan Kara  writes:
>>
>> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> >> Your note on the online repair does raise another tangentially related
>> >> topic. Currently, if there are badblocks, writes via the bio submission
>> >> path will clear the error (if the hardware is able to remap the bad
>> >> locations). However, if the filesystem is mounted eith DAX, even
>> >> non-mmap operations - read() and write() will go through the dax paths
>> >> (dax_do_io()). We haven't found a good/agreeable way to perform
>> >> error-clearing in this case. So currently, if a dax mounted filesystem
>> >> has badblocks, the only way to clear those badblocks is to mount it
>> >> without DAX, and overwrite/zero the bad locations. This is a pretty
>> >> terrible user experience, and I'm hoping this can be solved in a better
>> >> way.
>> >
>> > Please remind me, what is the problem with DAX code doing necessary work to
>> > clear the error when it gets EIO from memcpy on write?
>>
>> You won't get an MCE for a store;  only loads generate them.
>>
>> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?
>
> Not necessarily; XFS usually implements this by punching out the range
> and then reallocating it as unwritten blocks.
>

That does clear the error because the unwritten blocks are zeroed and
errors cleared when they become allocated again.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] sbitmap: use smp_mb__after_atomic() in sbq_wake_up()

2017-01-18 Thread Jens Axboe
On 01/18/2017 11:55 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> We always do an atomic clear_bit() right before we call sbq_wake_up(),
> so we can use smp_mb__after_atomic(). While we're here, comment the
> memory barriers in here a little more.

Thanks Omar, applied 1-2.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jeff Moyer
Jan Kara  writes:

> On Tue 17-01-17 15:14:21, Vishal Verma wrote:
>> Your note on the online repair does raise another tangentially related
>> topic. Currently, if there are badblocks, writes via the bio submission
>> path will clear the error (if the hardware is able to remap the bad
>> locations). However, if the filesystem is mounted eith DAX, even
>> non-mmap operations - read() and write() will go through the dax paths
>> (dax_do_io()). We haven't found a good/agreeable way to perform
>> error-clearing in this case. So currently, if a dax mounted filesystem
>> has badblocks, the only way to clear those badblocks is to mount it
>> without DAX, and overwrite/zero the bad locations. This is a pretty
>> terrible user experience, and I'm hoping this can be solved in a better
>> way.
>
> Please remind me, what is the problem with DAX code doing necessary work to
> clear the error when it gets EIO from memcpy on write?

You won't get an MCE for a store;  only loads generate them.

Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jeff Moyer
Slava Dubeyko  writes:

>> Well, the situation with NVM is more like with DRAM AFAIU. It is quite 
>> reliable
>> but given the size the probability *some* cell has degraded is quite high.
>> And similar to DRAM you'll get MCE (Machine Check Exception) when you try
>> to read such cell. As Vishal wrote, the hardware does some background 
>> scrubbing
>> and relocates stuff early if needed but nothing is 100%.
>
> My understanding that hardware does the remapping the affected address
> range (64 bytes, for example) but it doesn't move/migrate the stored
> data in this address range. So, it sounds slightly weird. Because it
> means that no guarantee to retrieve the stored data. It sounds that
> file system should be aware about this and has to be heavily protected
> by some replication or erasure coding scheme. Otherwise, if the
> hardware does everything for us (remap the affected address region and
> move data into a new address region) then why does file system need to
> know about the affected address regions?

The data is lost, that's why you're getting an ECC.  It's tantamount to
-EIO for a disk block access.

>> The reason why we play games with badblocks is to avoid those MCEs
>> (i.e., even trying to read the data we know that are bad). Even if it would
>> be rare event, MCE may mean the machine just immediately reboots
>> (although I find such platforms hardly usable with NVM then) and that
>> is no good. And even on hardware platforms that allow for more graceful
>> recovery from MCE it is asynchronous in its nature and our error handling
>> around IO is all synchronous so it is difficult to join these two models 
>> together.
>>
>> But I think it is a good question to ask whether we cannot improve on MCE 
>> handling
>> instead of trying to avoid them and pushing around responsibility for 
>> handling
>> bad blocks. Actually I thought someone was working on that.
>> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now
>> well identified anyway so that we can consult the badblocks list) so that it 
>> MCE
>> happens during these accesses, we note it somewhere and at the end of the 
>> magic
>> block we will just pick up the errors and report them back?
>
> Let's imagine that the affected address range will equal to 64 bytes. It 
> sounds for me
> that for the case of block device it will affect the whole logical
> block (4 KB).

512 bytes, and yes, that's the granularity at which we track errors in
the block layer, so that's the minimum amount of data you lose.

> If the failure rate of address ranges could be significant then it
> would affect a lot of logical blocks.

Who would buy hardware like that?

> The situation is more critical for the case of DAX approach. Correct
> me if I wrong but my understanding is the goal of DAX is to provide
> the direct access to file's memory pages with minimal file system
> overhead. So, it looks like that raising bad block issue on file
> system level will affect a user-space application. Because, finally,
> user-space application will need to process such trouble (bad block
> issue). It sounds for me as really weird situation. What can protect a
> user-space application from encountering the issue with partially
> incorrect memory page?

Applications need to deal with -EIO today.  This is the same sort of
thing.  If an application trips over a bad block during a load from
persistent memory, they will get a signal, and they can either handle it
or not.

Have a read through this specification and see if it clears anything up
for you:
  http://www.snia.org/tech_activities/standards/curr_standards/npm

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] sbitmap: use smp_mb__after_atomic() in sbq_wake_up()

2017-01-18 Thread Omar Sandoval
From: Omar Sandoval 

We always do an atomic clear_bit() right before we call sbq_wake_up(),
so we can use smp_mb__after_atomic(). While we're here, comment the
memory barriers in here a little more.

Signed-off-by: Omar Sandoval 
---
 lib/sbitmap.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2cecf05c82fd..df4e472df8a3 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -299,8 +299,14 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
struct sbq_wait_state *ws;
int wait_cnt;
 
-   /* Ensure that the wait list checks occur after clear_bit(). */
-   smp_mb();
+   /*
+* Pairs with the memory barrier in set_current_state() to ensure the
+* proper ordering of clear_bit()/waitqueue_active() in the waker and
+* test_and_set_bit()/prepare_to_wait()/finish_wait() in the waiter. See
+* the comment on waitqueue_active(). This is __after_atomic because we
+* just did clear_bit() in the caller.
+*/
+   smp_mb__after_atomic();
 
ws = sbq_wake_ptr(sbq);
if (!ws)
@@ -331,7 +337,8 @@ void sbitmap_queue_wake_all(struct sbitmap_queue *sbq)
int i, wake_index;
 
/*
-* Make sure all changes prior to this are visible from other CPUs.
+* Pairs with the memory barrier in set_current_state() like in
+* sbq_wake_up().
 */
smp_mb();
wake_index = atomic_read(&sbq->wake_index);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] sbitmap: fix wakeup hang after sbq resize

2017-01-18 Thread Omar Sandoval
From: Omar Sandoval 

When we resize a struct sbitmap_queue, we update the wakeup batch size,
but we don't update the wait count in the struct sbq_wait_states. If we
resized down from a size which could use a bigger batch size, these
counts could be too large and cause us to miss necessary wakeups. To fix
this, update the wait counts when we resize (ensuring some careful
memory ordering so that it's safe w.r.t. concurrent clears).

This also fixes a theoretical issue where two threads could end up
bumping the wait count up by the batch size, which could also
potentially lead to hangs.

Reported-by: Martin Raiber 
Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
Fixes: 2971c35f3588 ("blk-mq: bitmap tag: fix race on 
blk_mq_bitmap_tags::wake_cnt")
Signed-off-by: Omar Sandoval 
---
 lib/sbitmap.c | 35 ++-
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index df4e472df8a3..8f5c3b268c77 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -239,7 +239,19 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
 
 void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth)
 {
-   sbq->wake_batch = sbq_calc_wake_batch(depth);
+   unsigned int wake_batch = sbq_calc_wake_batch(depth);
+   int i;
+
+   if (sbq->wake_batch != wake_batch) {
+   WRITE_ONCE(sbq->wake_batch, wake_batch);
+   /*
+* Pairs with the memory barrier in sbq_wake_up() to ensure that
+* the batch size is updated before the wait counts.
+*/
+   smp_mb__before_atomic();
+   for (i = 0; i < SBQ_WAIT_QUEUES; i++)
+   atomic_set(&sbq->ws[i].wait_cnt, 1);
+   }
sbitmap_resize(&sbq->sb, depth);
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_resize);
@@ -297,6 +309,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct 
sbitmap_queue *sbq)
 static void sbq_wake_up(struct sbitmap_queue *sbq)
 {
struct sbq_wait_state *ws;
+   unsigned int wake_batch;
int wait_cnt;
 
/*
@@ -313,10 +326,22 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
return;
 
wait_cnt = atomic_dec_return(&ws->wait_cnt);
-   if (unlikely(wait_cnt < 0))
-   wait_cnt = atomic_inc_return(&ws->wait_cnt);
-   if (wait_cnt == 0) {
-   atomic_add(sbq->wake_batch, &ws->wait_cnt);
+   if (wait_cnt <= 0) {
+   wake_batch = READ_ONCE(sbq->wake_batch);
+   /*
+* Pairs with the memory barrier in sbitmap_queue_resize() to
+* ensure that we see the batch size update before the wait
+* count is reset.
+*/
+   smp_mb__before_atomic();
+   /*
+* If there are concurrent callers to sbq_wake_up(), the last
+* one to decrement the wait count below zero will bump it back
+* up. If there is a concurrent resize, the count reset will
+* either cause the cmpxchg to fail or overwrite after the
+* cmpxchg.
+*/
+   atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wait_cnt + wake_batch);
sbq_index_atomic_inc(&sbq->wake_index);
wake_up(&ws->wait);
}
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Future direction of DAX

2017-01-18 Thread Ross Zwisler
On Tue, Jan 17, 2017 at 09:25:33PM -0800, wi...@bombadil.infradead.org wrote:
> On Fri, Jan 13, 2017 at 05:20:08PM -0700, Ross Zwisler wrote:
> > We still have a lot of work to do, though, and I'd like to propose a 
> > discussion
> > around what features people would like to see enabled in the coming year as
> > well as what what use cases their customers have that we might not be aware 
> > of.
> 
> +1 to the discussion
> 
> > - Jan suggested [2] that we could use the radix tree as a cache to service 
> > DAX
> >   faults without needing to call into the filesystem.  Are there any issues
> >   with this approach, and should we move forward with it as an optimization?
> 
> Ahem.  I believe I proposed this at last year's LSFMM.  And I sent
> patches to start that work.  And Dan blocked it.  So I'm not terribly
> amused to see somebody else given credit for the idea.
> 
> It's not just an optimisation.  It's also essential for supporting
> filesystems which don't have block devices.  I'm aware of at least two
> customer demands for this in different domains.
> 
> 1. Embedded uses with NOR flash
> 2. Cloud/virt uses with multiple VMs on a single piece of hardware

Yea, I didn't mean the full move to having PFNs in the tree, just using the
sector number in the radix tree instead of calling into the filesystem.

My apologies if you feel I didn't give you proper credit.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v4] blk-mq-scheduling framework

2017-01-18 Thread Jens Axboe
On 01/18/2017 08:14 AM, Paolo Valente wrote:
> according to the function blk_mq_sched_put_request, the
> mq.completed_request hook seems to always be invoked (if set) for a
> request for which the mq.put_rq_priv is invoked (if set).

Correct, any request that came out of blk_mq_sched_get_request()
will always have completed called on it, regardless of whether it
had IO started on it or not.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v4] blk-mq-scheduling framework

2017-01-18 Thread Paolo Valente

> Il giorno 17 gen 2017, alle ore 11:49, Paolo Valente 
>  ha scritto:
> 
> [NEW RESEND ATTEMPT]
> 
>> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe  ha scritto:
>> 
>> On 12/22/2016 08:28 AM, Paolo Valente wrote:
>>> 
 Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe  ha 
 scritto:
 
 On 12/19/2016 11:21 AM, Paolo Valente wrote:
> 
>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe  ha 
>> scritto:
>> 
>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>> 
 Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe  ha 
 scritto:
 
 This is version 4 of this patchset, version 3 was posted here:
 
 https://marc.info/?l=linux-block&m=148178513407631&w=2
 
 From the discussion last time, I looked into the feasibility of having
 two sets of tags for the same request pool, to avoid having to copy
 some of the request fields at dispatch and completion time. To do that,
 we'd have to replace the driver tag map(s) with our own, and augment
 that with tag map(s) on the side representing the device queue depth.
 Queuing IO with the scheduler would allocate from the new map, and
 dispatching would acquire the "real" tag. We would need to change
 drivers to do this, or add an extra indirection table to map a real
 tag to the scheduler tag. We would also need a 1:1 mapping between
 scheduler and hardware tag pools, or additional info to track it.
 Unless someone can convince me otherwise, I think the current approach
 is cleaner.
 
 I wasn't going to post v4 so soon, but I discovered a bug that led
 to drastically decreased merging. Especially on rotating storage,
 this release should be fast, and on par with the merging that we
 get through the legacy schedulers.
 
>>> 
>>> I'm to modifying bfq.  You mentioned other missing pieces to come.  Do
>>> you already have an idea of what they are, so that I am somehow
>>> prepared to what won't work even if my changes are right?
>> 
>> I'm mostly talking about elevator ops hooks that aren't there in the new
>> framework, but exist in the old one. There should be no hidden
>> surprises, if that's what you are worried about.
>> 
>> On the ops side, the only ones I can think of are the activate and
>> deactivate, and those can be done in the dispatch_request hook for
>> activate, and put/requeue for deactivate.
>> 
> 
> You mean that there is no conceptual problem in moving the code of the
> activate interface function into the dispatch function, and the code
> of the deactivate into the put_request? (for a requeue it is a little
> less clear to me, so one step at a time)  Or am I missing
> something more complex?
 
 Yes, what I mean is that there isn't a 1:1 mapping between the old ops
 and the new ops. So you'll have to consider the cases.
 
 
>>> 
>>> Problem: whereas it seems easy and safe to do somewhere else the
>>> simple increment that was done in activate_request, I wonder if it may
>>> happen that a request is deactivate before being completed.  In it may
>>> happen, then, without a deactivate_request hook, the increments would
>>> remain unbalanced.  Or are request completions always guaranteed till
>>> no hw/sw components breaks?
>> 
>> You should be able to do it in get/put_request. But you might need some
>> extra tracking, I'd need to double check.
> 
> Exactly, AFAICT something extra is apparently needed.  In particular,
> get is not ok, because dispatch is a different event (but dispatch is
> however an already controlled event), while put could be used,
> provided that it is guaranteed to be executed only after dispatch.  If
> it is not, then I think that an extra flag or something should be
> added to the request.  I don't know whether adding this extra piece
> would be worst than adding an extra hook.
> 
>> 
>> I'm trying to avoid adding
>> hooks that we don't truly need, the old interface had a lot of that. If
>> you find that you need a hook and it isn't there, feel free to add it.
>> activate/deactivate might be a good change.
>> 
> 
> If my comments above do not trigger any proposal of a better solution,
> then I will try by adding only one extra 'deactivate' hook.  Unless
> unbalanced hooks are a bad idea too.
> 

Jens,
according to the function blk_mq_sched_put_request, the
mq.completed_request hook seems to always be invoked (if set) for a
request for which the mq.put_rq_priv is invoked (if set).

If you don't warn me that I'm wrong, I will base on the above
assumption, and complete bfq without any additional hook or flag.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More maj

Re: [patch] block: add blktrace C events for bio-based drivers

2017-01-18 Thread Jeff Moyer
Hi, Jens,

Jens Axboe  writes:

> I like the change, hate the naming. I'd prefer one of two things:
>
> - Add bio_endio_complete() instead. That name sucks too, the
>   important part is flipping the __name() to have a trace
>   version instead.

ITYM a notrace version.  By default, we want tracing for bio_endio.  The
only callers that need the inverse of that are in the request-based
path, and there are only 2 of them.

> - Mark the bio as trace completed, and keep the naming. Since
>   it's only off the completion path, that can be just marking
>   the bi_flags non-atomically.

One issue with this is in generic_make_request_checks, where we can call
bio_endio without having called trace_block_bio_queue (so you could get
a C event with no corresponding Q).  To address that, we could make the
flag indicate that trace_block_bio_queue was performed, and clear it in
bio_complete, like so:

if (test_and_clear_bit(BIO_QUEUE_TRACED, &bio->bi_flags))
trace_block_bio_complete(...);

That would solve the problem of duplicate completions, but requires
setting the flag in the submission path and clearing it in the
completion path.  I think the former can be done with just a
bio_set_flag (i.e. non-atomic), right?  Of course, where to stick that
bio_set_flag call is another bike-shedding discussion waiting to happen
(i.e. does it go in the tracepoint itself?).

Alternatively, we could set the trace_completed flag in the paths where
we end I/O without having done the trace_block_bio_queue, but that seems
way uglier to me.

Can you think of any other options?  If we're choosing from the above,
my preference is for adding the bio_endio_notrace(), since it's so much
simpler.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Hannes Reinecke
On 01/18/2017 04:16 PM, Johannes Thumshirn wrote:
> On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
>>
>>> Hannes just spotted this:
>>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>> const struct blk_mq_queue_data *bd)
>>> {
>>> [...]
>>>__nvme_submit_cmd(nvmeq, &cmnd);
>>>nvme_process_cq(nvmeq);
>>>spin_unlock_irq(&nvmeq->q_lock);
>>>return BLK_MQ_RQ_QUEUE_OK;
>>> out_cleanup_iod:
>>>nvme_free_iod(dev, req);
>>> out_free_cmd:
>>>nvme_cleanup_cmd(req);
>>>return ret;
>>> }
>>>
>>> So we're draining the CQ on submit. This of cause makes polling for
>>> completions in the IRQ handler rather pointless as we already did in the
>>> submission path.
>>
>> I think you missed:
>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> 
> I indeed did, thanks.
> 
But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
}
__nvme_submit_cmd(nvmeq, &cmnd);
spin_unlock(&nvmeq->sq_lock);
+   disable_irq_nosync(nvmeq_irq(irq));
+   irq_poll_sched(&nvmeq->iop);
return BLK_MQ_RQ_QUEUE_OK;
 out_cleanup_iod:
nvme_free_iod(dev, req);

That should avoid the first interrupt, and with a bit of lock reduce the
number of interrupts _drastically_.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Andrey Kuzmin
On Wed, Jan 18, 2017 at 5:40 PM, Sagi Grimberg  wrote:
>
>> Your report provided this stats with one-completion dominance for the
>> single-threaded case. Does it also hold if you run multiple fio
>> threads per core?
>
>
> It's useless to run more threads on that core, it's already fully
> utilized. That single threads is already posting a fair amount of
> submissions, so I don't see how adding more fio jobs can help in any
> way.

With a single thread, your completion processing/submission is
completely serialized. From my experience, it takes fio couple of
microsec to process completion and submit next request, and that'
(much) larger than your interrupt processing time.

Regards,
Andrey
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Johannes Thumshirn
On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
> 
> >Hannes just spotted this:
> >static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> > const struct blk_mq_queue_data *bd)
> >{
> >[...]
> >__nvme_submit_cmd(nvmeq, &cmnd);
> >nvme_process_cq(nvmeq);
> >spin_unlock_irq(&nvmeq->q_lock);
> >return BLK_MQ_RQ_QUEUE_OK;
> >out_cleanup_iod:
> >nvme_free_iod(dev, req);
> >out_free_cmd:
> >nvme_cleanup_cmd(req);
> >return ret;
> >}
> >
> >So we're draining the CQ on submit. This of cause makes polling for
> >completions in the IRQ handler rather pointless as we already did in the
> >submission path.
> 
> I think you missed:
> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007

I indeed did, thanks.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Sagi Grimberg



Hannes just spotted this:
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 const struct blk_mq_queue_data *bd)
{
[...]
__nvme_submit_cmd(nvmeq, &cmnd);
nvme_process_cq(nvmeq);
spin_unlock_irq(&nvmeq->q_lock);
return BLK_MQ_RQ_QUEUE_OK;
out_cleanup_iod:
nvme_free_iod(dev, req);
out_free_cmd:
nvme_cleanup_cmd(req);
return ret;
}

So we're draining the CQ on submit. This of cause makes polling for
completions in the IRQ handler rather pointless as we already did in the
submission path.


I think you missed:
http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Johannes Thumshirn
On Wed, Jan 18, 2017 at 04:27:24PM +0200, Sagi Grimberg wrote:
> 
> >So what you say is you saw a consomed == 1 [1] most of the time?
> >
> >[1] from 
> >http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
> 
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.
> 
> What I'm curious is how consistent is this with different devices (wish
> I had some...)

Hannes just spotted this:
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 const struct blk_mq_queue_data *bd)
{
[...]
__nvme_submit_cmd(nvmeq, &cmnd);
nvme_process_cq(nvmeq);
spin_unlock_irq(&nvmeq->q_lock);
return BLK_MQ_RQ_QUEUE_OK;
out_cleanup_iod:
nvme_free_iod(dev, req);
out_free_cmd:
nvme_cleanup_cmd(req);
return ret;
}

So we're draining the CQ on submit. This of cause makes polling for
completions in the IRQ handler rather pointless as we already did in the
submission path. 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Sagi Grimberg



Your report provided this stats with one-completion dominance for the
single-threaded case. Does it also hold if you run multiple fio
threads per core?


It's useless to run more threads on that core, it's already fully
utilized. That single threads is already posting a fair amount of
submissions, so I don't see how adding more fio jobs can help in any
way.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel oops with blk-mq-sched latest

2017-01-18 Thread Jens Axboe
On 01/18/2017 03:48 AM, Hannes Reinecke wrote:
> Nearly there.
> You're missing a 'blk_mq_start_hw_queues(q)' after
> blk_mq_unfreeze_queue(); without it the queue will stall after switching
> the scheduler.

Yes indeed, forgot that. Needed after the quiesce.

> Also what's quite suspicious is this:
> 
> struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>   struct request_queue *q)
> {
>   struct blkcg_gq *blkg;
> 
>   WARN_ON_ONCE(!rcu_read_lock_held());
>   lockdep_assert_held(q->queue_lock);
> 
>   /*
>* This could be the first entry point of blkcg implementation and
>* we shouldn't allow anything to go through for a bypassing queue.
>*/
>   if (unlikely(blk_queue_bypass(q)))
>   return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
> 
> which now won't work as the respective flags aren't set anymore.
> Not sure if that's a problem, though.
> But you might want to look at that, too.

dying is still used on blk-mq, but yes, the bypass check should now be
frozen for blk-mq. Not really directly related to the above change,
but it should be fixed up.

> Nevertheless, with the mentioned modifications to your patch the crashes
> don't occur anymore.

Great

> Sad news is that it doesn't help _that_ much on spinning rust mpt3sas;
> there I still see a ~50% performance penalty on reads.
> Write's slightly better than sq performance, though.

What is the test case? Full details please, from hardware to what you
are running. As I've mentioned before, I don't necessarily think your
performance issues are related to scheduling. Would be nice to get
to the bottom of it, though. And for that, I need more details.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Andrey Kuzmin
On Wed, Jan 18, 2017 at 5:27 PM, Sagi Grimberg  wrote:
>
>> So what you say is you saw a consomed == 1 [1] most of the time?
>>
>> [1] from
>> http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
>
>
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.

Your report provided this stats with one-completion dominance for the
single-threaded case. Does it also hold if you run multiple fio
threads per core?

Regards,
Andrey

>
> What I'm curious is how consistent is this with different devices (wish
> I had some...)
>
>
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Sagi Grimberg



So what you say is you saw a consomed == 1 [1] most of the time?

[1] from 
http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836


Exactly. By processing 1 completion per interrupt it makes perfect sense
why this performs poorly, it's not worth paying the soft-irq schedule
for only a single completion.

What I'm curious is how consistent is this with different devices (wish
I had some...)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Hannes Reinecke
On 01/17/2017 05:50 PM, Sagi Grimberg wrote:
> 
>> So it looks like we are super not efficient because most of the
>> times we catch 1
>> completion per interrupt and the whole point is that we need to find
>> more! This fio
>> is single threaded with QD=32 so I'd expect that we be somewhere in
>> 8-31 almost all
>> the time... I also tried QD=1024, histogram is still the same.
>>
>>
>> It looks like it takes you longer to submit an I/O than to service an
>> interrupt,
> 
> Well, with irq-poll we do practically nothing in the interrupt handler,
> only schedule irq-poll. Note that the latency measures are only from
> the point the interrupt arrives and the point we actually service it
> by polling for completions.
> 
>> so increasing queue depth in the singe-threaded case doesn't
>> make much difference. You might want to try multiple threads per core
>> with QD, say, 32
> 
> This is how I ran, QD=32.

The one thing which I found _really_ curious is this:

  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
 issued: total=r=7673377/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=256

(note the lines starting with 'submit' and 'complete').
They are _always_ 4, irrespective of the hardware and/or tests which I
run. Jens, what are these numbers supposed to mean?
Is this intended?
ATM the information content from those two lines is essentially 0,
seeing that the never change irrespective of the tests I'm doing.
(And which fio version I'm using ...)

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-18 Thread Johannes Thumshirn
On Tue, Jan 17, 2017 at 06:38:43PM +0200, Sagi Grimberg wrote:
> 
> >Just for the record, all tests you've run are with the upper irq_poll_budget 
> >of
> >256 [1]?
> 
> Yes, but that's the point, I never ever reach this budget because
> I'm only processing 1-2 completions per interrupt.
> 
> >We (Hannes and me) recently stumbed accross this when trying to poll for more
> >than 256 queue entries in the drivers we've been testing.

s/stumbed/stumbled/

> 
> What do you mean by stumbed? irq-poll should be agnostic to the fact
> that drivers can poll more than their given budget?

So what you say is you saw a consomed == 1 [1] most of the time?

[1] from 
http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel oops with blk-mq-sched latest

2017-01-18 Thread Hannes Reinecke
On 01/17/2017 02:00 PM, Jens Axboe wrote:
> On 01/17/2017 04:47 AM, Jens Axboe wrote:
>> On 01/17/2017 12:57 AM, Hannes Reinecke wrote:
>>> Hi Jens,
>>>
>>> I gave your latest patchset from
>>>
>>> git.kernel.dk/linux-block blk-mq-sched
>>>
>>> I see a kernel oops when shutting down:
>>>
>>> [ 2132.708929] systemd-shutdown[1]: Detaching DM devices.
>>> [ 2132.965107] BUG: unable to handle kernel NULL pointer dereference at
>>> 
>>> 0001
>>> [ 2133.037182] IP: dd_merged_requests+0x6/0x60
>>> [ 2133.077816] PGD 0
>>> [ 2133.077818]
>>> [ 2133.113087] Oops:  [#1] SMP
>>> [ list of modules removed ]
>>> [ 2133.925265] CPU: 20 PID: 1 Comm: systemd-shutdow Not tainted
>>> 4.10.0-rc4+ #543
>>> [ 2133.990034] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 09/18/2013
>>> [ 2134.050522] task: 88042d614040 task.stack: c9000315
>>> [ 2134.106915] RIP: 0010:dd_merged_requests+0x6/0x60
>>> [ 2134.150593] RSP: 0018:c90003153b18 EFLAGS: 00010002
>>> [ 2134.198740] RAX: 81cc6de0 RBX: 8804296d5040 RCX:
>>> 0001
>>> [ 2134.262708] RDX: 0001 RSI: 0001 RDI:
>>> 8804296d5040
>>> [ 2134.326987] RBP: c90003153b30 R08:  R09:
>>> 
>>> [ 2134.391054] R10:  R11: 0001f8180001f815 R12:
>>> 
>>> [ 2134.456095] R13: 8804296d5040 R14: 8804099801f0 R15:
>>> 0004
>>> [ 2134.521196] FS:  7fd64d3bf840() GS:88042f90()
>>> knlGS:
>>> [ 2134.595178] CS:  0010 DS:  ES:  CR0: 80050033
>>> [ 2134.648637] CR2: 0001 CR3: 00081b892000 CR4:
>>> 000406e0
>>> [ 2134.713349] Call Trace:
>>> [ 2134.737168]  ? elv_drain_elevator+0x29/0xa0
>>> [ 2134.775821]  __blk_drain_queue+0x52/0x1a0
>>> [ 2134.812473]  blk_queue_bypass_start+0x6e/0xa0
>>> [ 2134.854009]  blkcg_deactivate_policy+0x30/0xf0
>>> [ 2134.894993]  blk_throtl_exit+0x34/0x50
>>> [ 2134.929450]  blkcg_exit_queue+0x35/0x40
>>> [ 2134.965089]  blk_release_queue+0x33/0xe0
>>> [ 2135.001364]  kobject_cleanup+0x63/0x170
>>> [ 2135.037412]  kobject_put+0x25/0x50
>>> [ 2135.068622]  blk_cleanup_queue+0x198/0x260
>>> [ 2135.107780]  cleanup_mapped_device+0xb5/0xf0 [dm_mod]
>>> [ 2135.154741]  __dm_destroy+0x1a5/0x290 [dm_mod]
>>> [ 2135.195009]  dm_destroy+0x13/0x20 [dm_mod]
>>> [ 2135.232149]  dev_remove+0xde/0x120 [dm_mod]
>>> [ 2135.270184]  ? dev_suspend+0x210/0x210 [dm_mod]
>>> [ 2135.311478]  ctl_ioctl+0x20b/0x510 [dm_mod]
>>> [ 2135.349680]  ? terminate_walk+0xc3/0x140
>>> [ 2135.385442]  ? kmem_cache_free+0x10/0x260
>>> [ 2135.422315]  dm_ctl_ioctl+0x13/0x20 [dm_mod]
>>>
>>> Known issue?
>>
>> Try with this, looks like we're calling the old bypass path for the new
>> path, which is now triggering because q->elevator is true.
> 
> Should have grepped, there's one more path that uses the old bypass
> path for q->mq_ops, potentially. This one handles that one, too.
> 
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 8ba0af780e88..2630f64bed19 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1223,7 +1223,11 @@ int blkcg_activate_policy(struct request_queue *q,
>   if (blkcg_policy_enabled(q, pol))
>   return 0;
>  
> - blk_queue_bypass_start(q);
> + if (q->mq_ops) {
> + blk_mq_freeze_queue(q);
> + blk_mq_quiesce_queue(q);
> + } else
> + blk_queue_bypass_start(q);
>  pd_prealloc:
>   if (!pd_prealloc) {
>   pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q->node);
> @@ -1261,7 +1265,10 @@ int blkcg_activate_policy(struct request_queue *q,
>  
>   spin_unlock_irq(q->queue_lock);
>  out_bypass_end:
> - blk_queue_bypass_end(q);
> + if (q->mq_ops)
> + blk_mq_unfreeze_queue(q);
> + else
> + blk_queue_bypass_end(q);
>   if (pd_prealloc)
>   pol->pd_free_fn(pd_prealloc);
>   return ret;
> @@ -1284,7 +1291,12 @@ void blkcg_deactivate_policy(struct request_queue *q,
>   if (!blkcg_policy_enabled(q, pol))
>   return;
>  
> - blk_queue_bypass_start(q);
> + if (q->mq_ops) {
> + blk_mq_freeze_queue(q);
> + blk_mq_quiesce_queue(q);
> + } else
> + blk_queue_bypass_start(q);
> +
>   spin_lock_irq(q->queue_lock);
>  
>   __clear_bit(pol->plid, q->blkcg_pols);
> @@ -1304,7 +1316,11 @@ void blkcg_deactivate_policy(struct request_queue *q,
>   }
>  
>   spin_unlock_irq(q->queue_lock);
> - blk_queue_bypass_end(q);
> +
> + if (q->mq_ops)
> + blk_mq_unfreeze_queue(q);
> + else
> + blk_queue_bypass_end(q);
>  }
>  EXPORT_SYMBOL_GPL(blkcg_deactivate_policy);
>  
> diff --git a/block/elevator.c b/block/elevator.c
> index d14cb87e6564..464372840774 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -613,6 +613,9 @@ void elv_drain_elevator(struct request_queue *q)
>  {
>   static int prin

Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications

2017-01-18 Thread Jan Kara
On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> 
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> >
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this is
> > worth discussing during summit see how people feels about such plan and
> > if they have better ideas.
> >
> >
> > I also like to join discussions on:
> >   - Peer-to-Peer DMAs between PCIe devices
> >   - CDM coherent device memory
> >   - PMEM
> >   - overall mm discussions
> 
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

Aneesh, did you intend this as your request to attend? You posted it as a
reply to another email so it is not really clear. Note that each attend
request should be a separate email so that it does not get lost...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jan Kara
On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> Your note on the online repair does raise another tangentially related
> topic. Currently, if there are badblocks, writes via the bio submission
> path will clear the error (if the hardware is able to remap the bad
> locations). However, if the filesystem is mounted eith DAX, even
> non-mmap operations - read() and write() will go through the dax paths
> (dax_do_io()). We haven't found a good/agreeable way to perform
> error-clearing in this case. So currently, if a dax mounted filesystem
> has badblocks, the only way to clear those badblocks is to mount it
> without DAX, and overwrite/zero the bad locations. This is a pretty
> terrible user experience, and I'm hoping this can be solved in a better
> way.

Please remind me, what is the problem with DAX code doing necessary work to
clear the error when it gets EIO from memcpy on write?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] genhd: Do not hold event lock when scheduling workqueue elements

2017-01-18 Thread Hannes Reinecke
When scheduling workqueue elements the callback function might be called
directly, so holding the event lock is potentially dangerous as it might
lead to a deadlock:

[  989.542827] INFO: task systemd-udevd:459 blocked for more than 480 seconds.
[  989.609721]   Not tainted 4.10.0-rc4+ #546
[  989.648545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  989.716429] systemd-udevd   D13368   459  1 0x0004
[  989.716435] Call Trace:
[  989.716444]  __schedule+0x2f2/0xb10
[  989.716447]  schedule+0x3d/0x90
[  989.716449]  schedule_timeout+0x2fc/0x600
[  989.716451]  ? wait_for_completion+0xac/0x110
[  989.716456]  ? mark_held_locks+0x66/0x90
[  989.716458]  ? _raw_spin_unlock_irq+0x2c/0x40
[  989.716460]  ? trace_hardirqs_on_caller+0x111/0x1e0
[  989.716461]  wait_for_completion+0xb4/0x110
[  989.716464]  ? wake_up_q+0x80/0x80
[  989.716469]  flush_work+0x1ea/0x2a0
[  989.716470]  ? flush_work+0x24e/0x2a0
[  989.716472]  ? destroy_worker+0xd0/0xd0
[  989.716474]  __cancel_work_timer+0x11a/0x1e0
[  989.716476]  ? trace_hardirqs_on_caller+0x111/0x1e0
[  989.716477]  cancel_delayed_work_sync+0x13/0x20
[  989.716482]  disk_block_events+0x82/0x90
[  989.716487]  __blkdev_get+0x58/0x450
[  989.716488]  blkdev_get+0x1ce/0x340
[  989.716490]  ? _raw_spin_unlock+0x27/0x40
[  989.716492]  blkdev_open+0x5b/0x70
[  989.716501]  do_dentry_open+0x213/0x310
[  989.716505]  ? blkdev_get_by_dev+0x50/0x50
[  989.716507]  vfs_open+0x4f/0x80
[  989.716518]  ? may_open+0x9b/0x100
[  989.716521]  path_openat+0x48a/0xdc0
[  989.716527]  ? _crng_backtrack_protect+0x30/0x80
[  989.716530]  do_filp_open+0x7e/0xd0
[  989.716533]  ? _raw_spin_unlock+0x27/0x40
[  989.716537]  ? __alloc_fd+0xf7/0x210
[  989.716539]  do_sys_open+0x115/0x1f0
[  989.716542]  SyS_open+0x1e/0x20
[  989.716546]  entry_SYSCALL_64_fastpath+0x23/0xc6

Signed-off-by: Hannes Reinecke 

diff --git a/block/genhd.c b/block/genhd.c
index fcd6d4f..ae46caa 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1426,8 +1426,7 @@ struct disk_events {
struct gendisk  *disk;  /* the associated disk */
spinlock_t  lock;
 
-   struct mutexblock_mutex;/* protects blocking */
-   int block;  /* event blocking depth */
+   atomic_tblock;  /* event blocking depth */
unsigned intpending;/* events already sent out */
unsigned intclearing;   /* events being cleared */
 
@@ -1488,26 +1487,13 @@ static unsigned long disk_events_poll_jiffies(struct 
gendisk *disk)
 void disk_block_events(struct gendisk *disk)
 {
struct disk_events *ev = disk->ev;
-   unsigned long flags;
-   bool cancel;
 
if (!ev)
return;
 
-   /*
-* Outer mutex ensures that the first blocker completes canceling
-* the event work before further blockers are allowed to finish.
-*/
-   mutex_lock(&ev->block_mutex);
-
-   spin_lock_irqsave(&ev->lock, flags);
-   cancel = !ev->block++;
-   spin_unlock_irqrestore(&ev->lock, flags);
-
-   if (cancel)
+   if (atomic_inc_return(&ev->block) == 1)
cancel_delayed_work_sync(&disk->ev->dwork);
 
-   mutex_unlock(&ev->block_mutex);
 }
 
 static void __disk_unblock_events(struct gendisk *disk, bool check_now)
@@ -1516,23 +1502,18 @@ static void __disk_unblock_events(struct gendisk *disk, 
bool check_now)
unsigned long intv;
unsigned long flags;
 
-   spin_lock_irqsave(&ev->lock, flags);
-
-   if (WARN_ON_ONCE(ev->block <= 0))
-   goto out_unlock;
-
-   if (--ev->block)
-   goto out_unlock;
+   if (atomic_dec_return(&ev->block) > 0)
+   return;
 
+   spin_lock_irqsave(&ev->lock, flags);
intv = disk_events_poll_jiffies(disk);
+   spin_unlock_irqrestore(&ev->lock, flags);
if (check_now)
queue_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, 0);
else if (intv)
queue_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, intv);
-out_unlock:
-   spin_unlock_irqrestore(&ev->lock, flags);
 }
 
 /**
@@ -1572,10 +1553,10 @@ void disk_flush_events(struct gendisk *disk, unsigned 
int mask)
 
spin_lock_irq(&ev->lock);
ev->clearing |= mask;
-   if (!ev->block)
+   spin_unlock_irq(&ev->lock);
+   if (!atomic_read(&ev->block))
mod_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, 0);
-   spin_unlock_irq(&ev->lock);
 }
 
 /**
@@ -1666,12 +1647,11 @@ static void disk_check_events(struct disk_events *ev,
*clearing_ptr &= ~clearing;
 
intv = disk_events_poll_jiffies(disk);
-   if (!ev->block && intv)
+   spin_unlock_irq(&ev->lock);
+   if (!atomic_read(

Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jan Kara
On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
> 
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.

So I believe we have to distinguish three cases so that we are on the same
page.

1) PMEM is exposed only via a block interface for legacy filesystems to
use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
Looking from outside, the IO either returns with EIO or succeeds. As a
result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
mostly interested in. We could possibly do something more efficient than
what NVDIMM driver does however the complexity would be relatively high and
frankly I'm far from convinced this is really worth it. If there are so
many badblocks this would matter, the HW has IMHO bigger problems than
performance.

3) PMEM filesystem - there things are even more difficult as was already
noted elsewhere in the thread. But for now I'd like to leave those aside
not to complicate things too much.

Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
if the platform can recover from MCE, we can just always access persistent
memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
seems to already happen so we just need to make sure all places handle
returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
No need for bad blocks list at all, no slow down unless we hit a bad cell
and in that case who cares about performance when the data is gone...

For platforms that cannot recover from MCE - just buy better hardware ;).
Seriously, I have doubts people can seriously use a machine that will
unavoidably randomly reboot (as there is always a risk you hit error that
has not been uncovered by background scrub). But maybe for big cloud providers
the cost savings may offset for the inconvenience, I don't know. But still
for that case a bad blocks handling in NVDIMM code like we do now looks
good enough?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes

2017-01-18 Thread Christoph Hellwig
On Tue, Jan 17, 2017 at 10:06:51PM +0200, Sagi Grimberg wrote:
>
>> @@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue 
>> *queue,
>>  }
>>
>
> Christoph, a little above here we still look at blk_rq_bytes(),
> shouldn't that look at blk_rq_payload_bytes() too?

The check is ok for now as it's just zero vs non-zero.  It's somewhat
broken for Write Zeroes, though.  I've fixed it in this series
which I need to submit:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/write-zeroes-take-2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html