Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock
If hardware queues are stopped for some event, like the device has been suspended by power management, requests allocated on that hardware queue are indefinitely stuck causing a queue freeze to wait forever. I have a problem with this patch. IMO, this is a general issue so, so why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might not need to update nr_hw_queues at all. I'm fine with the blk_mq_abandon_stopped_requests but not with its call-site. Usually a driver knows when it wants to abandon all busy requests blk_mq_tagset_busy_iter(), maybe the right approach is to add a hook for all allocated tags? Or have blk_mq_quisce_queue get a fail all requests parameter from the callers? This patch is overly aggressive on failing allocated requests. There are scenarios where we wouldn't want to abandon them, like if the hw context is about to be brough back online, but this patch assumes all need to be abandoned. I'll see if there's some other tricks we can have a driver do. Thanks for the suggestions. I agree, I do think though that this should be driven from the driver, because for fabrics, we might have some fabric error that triggers a periodic reconnect. So the "hw context is about to be brought back" is unknown from the driver pov, and when we delete the controller (because we give up) this is exactly where we need to abandon the allocated requests. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes
@@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, } Christoph, a little above here we still look at blk_rq_bytes(), shouldn't that look at blk_rq_payload_bytes() too? The check is ok for now as it's just zero vs non-zero. It's somewhat broken for Write Zeroes, though. I've fixed it in this series which I need to submit: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/write-zeroes-take-2 I see, thanks for the clarification. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[bug report] blk-mq-sched: add framework for MQ capable IO schedulers
Hello Jens Axboe, This is a semi-automatic email about new static checker warnings. The patch bd166ef183c2: "blk-mq-sched: add framework for MQ capable IO schedulers" from Jan 17, 2017, leads to the following Smatch complaint: block/elevator.c:234 elevator_init() error: we previously assumed 'e' could be null (see line 229) block/elevator.c 228 229 if (!e) { ^^ Null. 230 printk(KERN_ERR 231 "Default I/O scheduler not found. " \ 232 "Using noop/none.\n"); 233 if (q->mq_ops) { 234 elevator_put(e); ^^^ This will Oops. 235 return 0; 236 } regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] block level event logging for storage media management
On 01/19/2017 12:34 AM, Song Liu wrote: > > Media health monitoring is very important for large scale distributed storage > systems. > Traditionally, enterprise storage controllers maintain event logs for > attached storage > devices. However, these controller managed logs do not scale well for large > scale > distributed systems. > > While designing a more flexible and scalable event logging systems, we think > it is better > to build the log in block layer. Block level event logging covers all major > storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different > protocols. > > In this LSF/MM, we would like to discuss the following topics with the > community: > 1. Mechanism for drivers report events (or errors) to block layer. >Basically, we will need a traceable function for the drivers to report > errors >(most likely right before calling end_request or bio_endio). > > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event > logging? > > 3. How should we categorize different events? >Currently, there are existing code that translates ATA error > (ata_to_sense_error) >and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can >leverage SCSI Key Code Qualifier for event categorizations. > > 4. Detailed discussions on data structure for event logging. > > We will be able to show a prototype implementation during LSF/MM. > Very good topic; I'm very much in favour of it. That ties in rather nicely with my multipath redesign, where I've added a notifier chain for block events. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] block level event logging for storage media management
On 2017/1/19 上午7:34, Song Liu wrote: > > Media health monitoring is very important for large scale distributed storage > systems. > Traditionally, enterprise storage controllers maintain event logs for > attached storage > devices. However, these controller managed logs do not scale well for large > scale > distributed systems. > > While designing a more flexible and scalable event logging systems, we think > it is better > to build the log in block layer. Block level event logging covers all major > storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different > protocols. > > In this LSF/MM, we would like to discuss the following topics with the > community: > 1. Mechanism for drivers report events (or errors) to block layer. >Basically, we will need a traceable function for the drivers to report > errors >(most likely right before calling end_request or bio_endio). > > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event > logging? > > 3. How should we categorize different events? >Currently, there are existing code that translates ATA error > (ata_to_sense_error) >and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can >leverage SCSI Key Code Qualifier for event categorizations. > > 4. Detailed discussions on data structure for event logging. > > We will be able to show a prototype implementation during LSF/MM. This is an interesting topic. For stacked block devices, all layers higher than the fault layer will observe the media error, reporting the underlying failure in every layer may introduce quite a lot noise. Yes, I am willing to attend this discussion. Thanks. Coly Li -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote: > Jan Kara writes: > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote: > >> Your note on the online repair does raise another tangentially related > >> topic. Currently, if there are badblocks, writes via the bio submission > >> path will clear the error (if the hardware is able to remap the bad > >> locations). However, if the filesystem is mounted eith DAX, even > >> non-mmap operations - read() and write() will go through the dax paths > >> (dax_do_io()). We haven't found a good/agreeable way to perform > >> error-clearing in this case. So currently, if a dax mounted filesystem > >> has badblocks, the only way to clear those badblocks is to mount it > >> without DAX, and overwrite/zero the bad locations. This is a pretty > >> terrible user experience, and I'm hoping this can be solved in a better > >> way. > > > > Please remind me, what is the problem with DAX code doing necessary work to > > clear the error when it gets EIO from memcpy on write? > > You won't get an MCE for a store; only loads generate them. > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax? Not necessarily; XFS usually implements this by punching out the range and then reallocating it as unwritten blocks. --D > > Cheers, > Jeff > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
-Original Message- From: Jeff Moyer [mailto:jmo...@redhat.com] Sent: Wednesday, January 18, 2017 12:48 PM To: Slava Dubeyko Cc: Jan Kara ; linux-nvd...@lists.01.org ; linux-block@vger.kernel.org; Viacheslav Dubeyko ; Linux FS Devel ; lsf...@lists.linux-foundation.org Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems >>> Well, the situation with NVM is more like with DRAM AFAIU. It is >>> quite reliable but given the size the probability *some* cell has degraded >>> is quite high. >>> And similar to DRAM you'll get MCE (Machine Check Exception) when you >>> try to read such cell. As Vishal wrote, the hardware does some >>> background scrubbing and relocates stuff early if needed but nothing is >>> 100%. >> >> My understanding that hardware does the remapping the affected address >> range (64 bytes, for example) but it doesn't move/migrate the stored >> data in this address range. So, it sounds slightly weird. Because it >> means that no guarantee to retrieve the stored data. It sounds that >> file system should be aware about this and has to be heavily protected >> by some replication or erasure coding scheme. Otherwise, if the >> hardware does everything for us (remap the affected address region and >> move data into a new address region) then why does file system need to >> know about the affected address regions? > >The data is lost, that's why you're getting an ECC. It's tantamount to -EIO >for a disk block access. I see the three possible cases here: (1) bad block has been discovered (no remap, no recovering) -> data is lost; -EIO for a disk block access, block is always bad; (2) bad block has been discovered and remapped -> data is lost; -EIO for a disk block access. (3) bad block has been discovered, remapped and recovered -> no data is lost. >> Let's imagine that the affected address range will equal to 64 bytes. >> It sounds for me that for the case of block device it will affect the >> whole logical block (4 KB). > > 512 bytes, and yes, that's the granularity at which we track errors in the > block layer, so that's the minimum amount of data you lose. I think it depends what granularity hardware supports. It could be 512 bytes, 4 KB, maybe greater. >> The situation is more critical for the case of DAX approach. Correct >> me if I wrong but my understanding is the goal of DAX is to provide >> the direct access to file's memory pages with minimal file system >> overhead. So, it looks like that raising bad block issue on file >> system level will affect a user-space application. Because, finally, >> user-space application will need to process such trouble (bad block >> issue). It sounds for me as really weird situation. What can protect a >> user-space application from encountering the issue with partially >> incorrect memory page? > > Applications need to deal with -EIO today. This is the same sort of thing. > If an application trips over a bad block during a load from persistent memory, > they will get a signal, and they can either handle it or not. > > Have a read through this specification and see if it clears anything up for > you: > http://www.snia.org/tech_activities/standards/curr_standards/npm Thank you for sharing this. So, if a user-space application follows to the NVM Programming Model then it will be able to survive by means of catching and processing the exceptions. But these applications have to be implemented yet. Also such applications need in special technique(s) of recovering. It sounds that legacy user-space applications are unable to survive for the NVM.PM.FILE mode in the case of load/store operation's failure. Thanks, Vyacheslav Dubeyko. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] block level event logging for storage media management
On Wed, 2017-01-18 at 23:34 +, Song Liu wrote: > Media health monitoring is very important for large scale distributed storage > systems. > Traditionally, enterprise storage controllers maintain event logs for > attached storage > devices. However, these controller managed logs do not scale well for large > scale > distributed systems. > > While designing a more flexible and scalable event logging systems, we think > it is better > to build the log in block layer. Block level event logging covers all major > storage media > (SCSI, SATA, NVMe), and thus minimizes redundant work for different > protocols. > > In this LSF/MM, we would like to discuss the following topics with the > community: > 1. Mechanism for drivers report events (or errors) to block layer. >Basically, we will need a traceable function for the drivers to report > errors >(most likely right before calling end_request or bio_endio). > > 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event > logging? > > 3. How should we categorize different events? >Currently, there are existing code that translates ATA error > (ata_to_sense_error) >and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can >leverage SCSI Key Code Qualifier for event categorizations. > > 4. Detailed discussions on data structure for event logging. > > We will be able to show a prototype implementation during LSF/MM. I'd like to participate in this discussion. Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[LSF/MM TOPIC] block level event logging for storage media management
Media health monitoring is very important for large scale distributed storage systems. Traditionally, enterprise storage controllers maintain event logs for attached storage devices. However, these controller managed logs do not scale well for large scale distributed systems. While designing a more flexible and scalable event logging systems, we think it is better to build the log in block layer. Block level event logging covers all major storage media (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. In this LSF/MM, we would like to discuss the following topics with the community: 1. Mechanism for drivers report events (or errors) to block layer. Basically, we will need a traceable function for the drivers to report errors (most likely right before calling end_request or bio_endio). 2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging? 3. How should we categorize different events? Currently, there are existing code that translates ATA error (ata_to_sense_error) and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can leverage SCSI Key Code Qualifier for event categorizations. 4. Detailed discussions on data structure for event logging. We will be able to show a prototype implementation during LSF/MM. Thanks, Song-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.10] blk-mq: Remove unused variable
On Wed, Jan 18, 2017 at 02:21:48PM -0800, Jens Axboe wrote: > On 01/18/2017 02:16 PM, Jens Axboe wrote: > > On 01/18/2017 02:21 PM, Keith Busch wrote: > >> Signed-off-by: Keith Busch > >> Reviewed-by: Christoph Hellwig > >> Reviewed-by: Sagi Grimberg > > > > Does it cause a warning anywhere? If not, I'd rather not apply > > it, since it'll cause a merge conflict for 4.11 (and the problem > > doesn't exist there). > > Ah screw it, the fact that we have that in 4.10 is an eye sore. > I'd rather deal with the conflict, I'll apply it for 4.10. Thanks > Keith. Okay, thanks. I was just clearing my backlog of emails and this was on my todo list from a couple weeks ago, but I didn't noticed 4.11 corrected this in the mq scheduler set. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.10] blk-mq: Remove unused variable
On 01/18/2017 02:16 PM, Jens Axboe wrote: > On 01/18/2017 02:21 PM, Keith Busch wrote: >> Signed-off-by: Keith Busch >> Reviewed-by: Christoph Hellwig >> Reviewed-by: Sagi Grimberg > > Does it cause a warning anywhere? If not, I'd rather not apply > it, since it'll cause a merge conflict for 4.11 (and the problem > doesn't exist there). Ah screw it, the fact that we have that in 4.10 is an eye sore. I'd rather deal with the conflict, I'll apply it for 4.10. Thanks Keith. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.10] blk-mq: Remove unused variable
On 01/18/2017 02:21 PM, Keith Busch wrote: > Signed-off-by: Keith Busch > Reviewed-by: Christoph Hellwig > Reviewed-by: Sagi Grimberg Does it cause a warning anywhere? If not, I'd rather not apply it, since it'll cause a merge conflict for 4.11 (and the problem doesn't exist there). -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-4.10] blk-mq: Remove unused variable
Signed-off-by: Keith Busch Reviewed-by: Christoph Hellwig Reviewed-by: Sagi Grimberg --- block/blk-mq.c | 1 - 1 file changed, 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index a8e67a1..c3400b5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -912,7 +912,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list) static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx) { LIST_HEAD(rq_list); - LIST_HEAD(driver_list); if (unlikely(blk_mq_hctx_stopped(hctx))) return; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote: > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong > wrote: > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote: > > > Jan Kara writes: > > > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote: > > > > > Your note on the online repair does raise another tangentially > > > > > related > > > > > topic. Currently, if there are badblocks, writes via the bio > > > > > submission > > > > > path will clear the error (if the hardware is able to remap > > > > > the bad > > > > > locations). However, if the filesystem is mounted eith DAX, > > > > > even > > > > > non-mmap operations - read() and write() will go through the > > > > > dax paths > > > > > (dax_do_io()). We haven't found a good/agreeable way to > > > > > perform > > > > > error-clearing in this case. So currently, if a dax mounted > > > > > filesystem > > > > > has badblocks, the only way to clear those badblocks is to > > > > > mount it > > > > > without DAX, and overwrite/zero the bad locations. This is a > > > > > pretty > > > > > terrible user experience, and I'm hoping this can be solved in > > > > > a better > > > > > way. > > > > > > > > Please remind me, what is the problem with DAX code doing > > > > necessary work to > > > > clear the error when it gets EIO from memcpy on write? > > > > > > You won't get an MCE for a store; only loads generate them. > > > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with > > > -o dax? > > > > Not necessarily; XFS usually implements this by punching out the > > range > > and then reallocating it as unwritten blocks. > > > > That does clear the error because the unwritten blocks are zeroed and > errors cleared when they become allocated again. Yes, the problem was that writes won't clear errors. zeroing through either hole-punch, truncate, unlinking the file should all work (assuming the hole-punch or truncate ranges wholly contain the 'badblock' sector). > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimmN�r��yb�X��ǧv�^�){.n�+{�nZ�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong wrote: > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote: >> Jan Kara writes: >> >> > On Tue 17-01-17 15:14:21, Vishal Verma wrote: >> >> Your note on the online repair does raise another tangentially related >> >> topic. Currently, if there are badblocks, writes via the bio submission >> >> path will clear the error (if the hardware is able to remap the bad >> >> locations). However, if the filesystem is mounted eith DAX, even >> >> non-mmap operations - read() and write() will go through the dax paths >> >> (dax_do_io()). We haven't found a good/agreeable way to perform >> >> error-clearing in this case. So currently, if a dax mounted filesystem >> >> has badblocks, the only way to clear those badblocks is to mount it >> >> without DAX, and overwrite/zero the bad locations. This is a pretty >> >> terrible user experience, and I'm hoping this can be solved in a better >> >> way. >> > >> > Please remind me, what is the problem with DAX code doing necessary work to >> > clear the error when it gets EIO from memcpy on write? >> >> You won't get an MCE for a store; only loads generate them. >> >> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax? > > Not necessarily; XFS usually implements this by punching out the range > and then reallocating it as unwritten blocks. > That does clear the error because the unwritten blocks are zeroed and errors cleared when they become allocated again. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] sbitmap: use smp_mb__after_atomic() in sbq_wake_up()
On 01/18/2017 11:55 AM, Omar Sandoval wrote: > From: Omar Sandoval > > We always do an atomic clear_bit() right before we call sbq_wake_up(), > so we can use smp_mb__after_atomic(). While we're here, comment the > memory barriers in here a little more. Thanks Omar, applied 1-2. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Jan Kara writes: > On Tue 17-01-17 15:14:21, Vishal Verma wrote: >> Your note on the online repair does raise another tangentially related >> topic. Currently, if there are badblocks, writes via the bio submission >> path will clear the error (if the hardware is able to remap the bad >> locations). However, if the filesystem is mounted eith DAX, even >> non-mmap operations - read() and write() will go through the dax paths >> (dax_do_io()). We haven't found a good/agreeable way to perform >> error-clearing in this case. So currently, if a dax mounted filesystem >> has badblocks, the only way to clear those badblocks is to mount it >> without DAX, and overwrite/zero the bad locations. This is a pretty >> terrible user experience, and I'm hoping this can be solved in a better >> way. > > Please remind me, what is the problem with DAX code doing necessary work to > clear the error when it gets EIO from memcpy on write? You won't get an MCE for a store; only loads generate them. Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax? Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
Slava Dubeyko writes: >> Well, the situation with NVM is more like with DRAM AFAIU. It is quite >> reliable >> but given the size the probability *some* cell has degraded is quite high. >> And similar to DRAM you'll get MCE (Machine Check Exception) when you try >> to read such cell. As Vishal wrote, the hardware does some background >> scrubbing >> and relocates stuff early if needed but nothing is 100%. > > My understanding that hardware does the remapping the affected address > range (64 bytes, for example) but it doesn't move/migrate the stored > data in this address range. So, it sounds slightly weird. Because it > means that no guarantee to retrieve the stored data. It sounds that > file system should be aware about this and has to be heavily protected > by some replication or erasure coding scheme. Otherwise, if the > hardware does everything for us (remap the affected address region and > move data into a new address region) then why does file system need to > know about the affected address regions? The data is lost, that's why you're getting an ECC. It's tantamount to -EIO for a disk block access. >> The reason why we play games with badblocks is to avoid those MCEs >> (i.e., even trying to read the data we know that are bad). Even if it would >> be rare event, MCE may mean the machine just immediately reboots >> (although I find such platforms hardly usable with NVM then) and that >> is no good. And even on hardware platforms that allow for more graceful >> recovery from MCE it is asynchronous in its nature and our error handling >> around IO is all synchronous so it is difficult to join these two models >> together. >> >> But I think it is a good question to ask whether we cannot improve on MCE >> handling >> instead of trying to avoid them and pushing around responsibility for >> handling >> bad blocks. Actually I thought someone was working on that. >> Cannot we e.g. wrap in-kernel accesses to persistent memory (those are now >> well identified anyway so that we can consult the badblocks list) so that it >> MCE >> happens during these accesses, we note it somewhere and at the end of the >> magic >> block we will just pick up the errors and report them back? > > Let's imagine that the affected address range will equal to 64 bytes. It > sounds for me > that for the case of block device it will affect the whole logical > block (4 KB). 512 bytes, and yes, that's the granularity at which we track errors in the block layer, so that's the minimum amount of data you lose. > If the failure rate of address ranges could be significant then it > would affect a lot of logical blocks. Who would buy hardware like that? > The situation is more critical for the case of DAX approach. Correct > me if I wrong but my understanding is the goal of DAX is to provide > the direct access to file's memory pages with minimal file system > overhead. So, it looks like that raising bad block issue on file > system level will affect a user-space application. Because, finally, > user-space application will need to process such trouble (bad block > issue). It sounds for me as really weird situation. What can protect a > user-space application from encountering the issue with partially > incorrect memory page? Applications need to deal with -EIO today. This is the same sort of thing. If an application trips over a bad block during a load from persistent memory, they will get a signal, and they can either handle it or not. Have a read through this specification and see if it clears anything up for you: http://www.snia.org/tech_activities/standards/curr_standards/npm Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] sbitmap: use smp_mb__after_atomic() in sbq_wake_up()
From: Omar Sandoval We always do an atomic clear_bit() right before we call sbq_wake_up(), so we can use smp_mb__after_atomic(). While we're here, comment the memory barriers in here a little more. Signed-off-by: Omar Sandoval --- lib/sbitmap.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 2cecf05c82fd..df4e472df8a3 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -299,8 +299,14 @@ static void sbq_wake_up(struct sbitmap_queue *sbq) struct sbq_wait_state *ws; int wait_cnt; - /* Ensure that the wait list checks occur after clear_bit(). */ - smp_mb(); + /* +* Pairs with the memory barrier in set_current_state() to ensure the +* proper ordering of clear_bit()/waitqueue_active() in the waker and +* test_and_set_bit()/prepare_to_wait()/finish_wait() in the waiter. See +* the comment on waitqueue_active(). This is __after_atomic because we +* just did clear_bit() in the caller. +*/ + smp_mb__after_atomic(); ws = sbq_wake_ptr(sbq); if (!ws) @@ -331,7 +337,8 @@ void sbitmap_queue_wake_all(struct sbitmap_queue *sbq) int i, wake_index; /* -* Make sure all changes prior to this are visible from other CPUs. +* Pairs with the memory barrier in set_current_state() like in +* sbq_wake_up(). */ smp_mb(); wake_index = atomic_read(&sbq->wake_index); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] sbitmap: fix wakeup hang after sbq resize
From: Omar Sandoval When we resize a struct sbitmap_queue, we update the wakeup batch size, but we don't update the wait count in the struct sbq_wait_states. If we resized down from a size which could use a bigger batch size, these counts could be too large and cause us to miss necessary wakeups. To fix this, update the wait counts when we resize (ensuring some careful memory ordering so that it's safe w.r.t. concurrent clears). This also fixes a theoretical issue where two threads could end up bumping the wait count up by the batch size, which could also potentially lead to hangs. Reported-by: Martin Raiber Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs") Fixes: 2971c35f3588 ("blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt") Signed-off-by: Omar Sandoval --- lib/sbitmap.c | 35 ++- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/lib/sbitmap.c b/lib/sbitmap.c index df4e472df8a3..8f5c3b268c77 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -239,7 +239,19 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_init_node); void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth) { - sbq->wake_batch = sbq_calc_wake_batch(depth); + unsigned int wake_batch = sbq_calc_wake_batch(depth); + int i; + + if (sbq->wake_batch != wake_batch) { + WRITE_ONCE(sbq->wake_batch, wake_batch); + /* +* Pairs with the memory barrier in sbq_wake_up() to ensure that +* the batch size is updated before the wait counts. +*/ + smp_mb__before_atomic(); + for (i = 0; i < SBQ_WAIT_QUEUES; i++) + atomic_set(&sbq->ws[i].wait_cnt, 1); + } sbitmap_resize(&sbq->sb, depth); } EXPORT_SYMBOL_GPL(sbitmap_queue_resize); @@ -297,6 +309,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) static void sbq_wake_up(struct sbitmap_queue *sbq) { struct sbq_wait_state *ws; + unsigned int wake_batch; int wait_cnt; /* @@ -313,10 +326,22 @@ static void sbq_wake_up(struct sbitmap_queue *sbq) return; wait_cnt = atomic_dec_return(&ws->wait_cnt); - if (unlikely(wait_cnt < 0)) - wait_cnt = atomic_inc_return(&ws->wait_cnt); - if (wait_cnt == 0) { - atomic_add(sbq->wake_batch, &ws->wait_cnt); + if (wait_cnt <= 0) { + wake_batch = READ_ONCE(sbq->wake_batch); + /* +* Pairs with the memory barrier in sbitmap_queue_resize() to +* ensure that we see the batch size update before the wait +* count is reset. +*/ + smp_mb__before_atomic(); + /* +* If there are concurrent callers to sbq_wake_up(), the last +* one to decrement the wait count below zero will bump it back +* up. If there is a concurrent resize, the count reset will +* either cause the cmpxchg to fail or overwrite after the +* cmpxchg. +*/ + atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wait_cnt + wake_batch); sbq_index_atomic_inc(&sbq->wake_index); wake_up(&ws->wait); } -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] Future direction of DAX
On Tue, Jan 17, 2017 at 09:25:33PM -0800, wi...@bombadil.infradead.org wrote: > On Fri, Jan 13, 2017 at 05:20:08PM -0700, Ross Zwisler wrote: > > We still have a lot of work to do, though, and I'd like to propose a > > discussion > > around what features people would like to see enabled in the coming year as > > well as what what use cases their customers have that we might not be aware > > of. > > +1 to the discussion > > > - Jan suggested [2] that we could use the radix tree as a cache to service > > DAX > > faults without needing to call into the filesystem. Are there any issues > > with this approach, and should we move forward with it as an optimization? > > Ahem. I believe I proposed this at last year's LSFMM. And I sent > patches to start that work. And Dan blocked it. So I'm not terribly > amused to see somebody else given credit for the idea. > > It's not just an optimisation. It's also essential for supporting > filesystems which don't have block devices. I'm aware of at least two > customer demands for this in different domains. > > 1. Embedded uses with NOR flash > 2. Cloud/virt uses with multiple VMs on a single piece of hardware Yea, I didn't mean the full move to having PFNs in the tree, just using the sector number in the radix tree instead of calling into the filesystem. My apologies if you feel I didn't give you proper credit. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v4] blk-mq-scheduling framework
On 01/18/2017 08:14 AM, Paolo Valente wrote: > according to the function blk_mq_sched_put_request, the > mq.completed_request hook seems to always be invoked (if set) for a > request for which the mq.put_rq_priv is invoked (if set). Correct, any request that came out of blk_mq_sched_get_request() will always have completed called on it, regardless of whether it had IO started on it or not. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET v4] blk-mq-scheduling framework
> Il giorno 17 gen 2017, alle ore 11:49, Paolo Valente > ha scritto: > > [NEW RESEND ATTEMPT] > >> Il giorno 17 gen 2017, alle ore 03:47, Jens Axboe ha scritto: >> >> On 12/22/2016 08:28 AM, Paolo Valente wrote: >>> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe ha scritto: On 12/19/2016 11:21 AM, Paolo Valente wrote: > >> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe ha >> scritto: >> >> On 12/19/2016 04:32 AM, Paolo Valente wrote: >>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe ha scritto: This is version 4 of this patchset, version 3 was posted here: https://marc.info/?l=linux-block&m=148178513407631&w=2 From the discussion last time, I looked into the feasibility of having two sets of tags for the same request pool, to avoid having to copy some of the request fields at dispatch and completion time. To do that, we'd have to replace the driver tag map(s) with our own, and augment that with tag map(s) on the side representing the device queue depth. Queuing IO with the scheduler would allocate from the new map, and dispatching would acquire the "real" tag. We would need to change drivers to do this, or add an extra indirection table to map a real tag to the scheduler tag. We would also need a 1:1 mapping between scheduler and hardware tag pools, or additional info to track it. Unless someone can convince me otherwise, I think the current approach is cleaner. I wasn't going to post v4 so soon, but I discovered a bug that led to drastically decreased merging. Especially on rotating storage, this release should be fast, and on par with the merging that we get through the legacy schedulers. >>> >>> I'm to modifying bfq. You mentioned other missing pieces to come. Do >>> you already have an idea of what they are, so that I am somehow >>> prepared to what won't work even if my changes are right? >> >> I'm mostly talking about elevator ops hooks that aren't there in the new >> framework, but exist in the old one. There should be no hidden >> surprises, if that's what you are worried about. >> >> On the ops side, the only ones I can think of are the activate and >> deactivate, and those can be done in the dispatch_request hook for >> activate, and put/requeue for deactivate. >> > > You mean that there is no conceptual problem in moving the code of the > activate interface function into the dispatch function, and the code > of the deactivate into the put_request? (for a requeue it is a little > less clear to me, so one step at a time) Or am I missing > something more complex? Yes, what I mean is that there isn't a 1:1 mapping between the old ops and the new ops. So you'll have to consider the cases. >>> >>> Problem: whereas it seems easy and safe to do somewhere else the >>> simple increment that was done in activate_request, I wonder if it may >>> happen that a request is deactivate before being completed. In it may >>> happen, then, without a deactivate_request hook, the increments would >>> remain unbalanced. Or are request completions always guaranteed till >>> no hw/sw components breaks? >> >> You should be able to do it in get/put_request. But you might need some >> extra tracking, I'd need to double check. > > Exactly, AFAICT something extra is apparently needed. In particular, > get is not ok, because dispatch is a different event (but dispatch is > however an already controlled event), while put could be used, > provided that it is guaranteed to be executed only after dispatch. If > it is not, then I think that an extra flag or something should be > added to the request. I don't know whether adding this extra piece > would be worst than adding an extra hook. > >> >> I'm trying to avoid adding >> hooks that we don't truly need, the old interface had a lot of that. If >> you find that you need a hook and it isn't there, feel free to add it. >> activate/deactivate might be a good change. >> > > If my comments above do not trigger any proposal of a better solution, > then I will try by adding only one extra 'deactivate' hook. Unless > unbalanced hooks are a bad idea too. > Jens, according to the function blk_mq_sched_put_request, the mq.completed_request hook seems to always be invoked (if set) for a request for which the mq.put_rq_priv is invoked (if set). If you don't warn me that I'm wrong, I will base on the above assumption, and complete bfq without any additional hook or flag. Thanks, Paolo > Thanks, > Paolo > >> -- >> Jens Axboe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More maj
Re: [patch] block: add blktrace C events for bio-based drivers
Hi, Jens, Jens Axboe writes: > I like the change, hate the naming. I'd prefer one of two things: > > - Add bio_endio_complete() instead. That name sucks too, the > important part is flipping the __name() to have a trace > version instead. ITYM a notrace version. By default, we want tracing for bio_endio. The only callers that need the inverse of that are in the request-based path, and there are only 2 of them. > - Mark the bio as trace completed, and keep the naming. Since > it's only off the completion path, that can be just marking > the bi_flags non-atomically. One issue with this is in generic_make_request_checks, where we can call bio_endio without having called trace_block_bio_queue (so you could get a C event with no corresponding Q). To address that, we could make the flag indicate that trace_block_bio_queue was performed, and clear it in bio_complete, like so: if (test_and_clear_bit(BIO_QUEUE_TRACED, &bio->bi_flags)) trace_block_bio_complete(...); That would solve the problem of duplicate completions, but requires setting the flag in the submission path and clearing it in the completion path. I think the former can be done with just a bio_set_flag (i.e. non-atomic), right? Of course, where to stick that bio_set_flag call is another bike-shedding discussion waiting to happen (i.e. does it go in the tracepoint itself?). Alternatively, we could set the trace_completed flag in the paths where we end I/O without having done the trace_block_bio_queue, but that seems way uglier to me. Can you think of any other options? If we're choosing from the above, my preference is for adding the bio_endio_notrace(), since it's so much simpler. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On 01/18/2017 04:16 PM, Johannes Thumshirn wrote: > On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote: >> >>> Hannes just spotted this: >>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, >>> const struct blk_mq_queue_data *bd) >>> { >>> [...] >>>__nvme_submit_cmd(nvmeq, &cmnd); >>>nvme_process_cq(nvmeq); >>>spin_unlock_irq(&nvmeq->q_lock); >>>return BLK_MQ_RQ_QUEUE_OK; >>> out_cleanup_iod: >>>nvme_free_iod(dev, req); >>> out_free_cmd: >>>nvme_cleanup_cmd(req); >>>return ret; >>> } >>> >>> So we're draining the CQ on submit. This of cause makes polling for >>> completions in the IRQ handler rather pointless as we already did in the >>> submission path. >> >> I think you missed: >> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 > > I indeed did, thanks. > But it doesn't help. We're still having to wait for the first interrupt, and if we're really fast that's the only completion we have to process. Try this: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b4b32e6..e2dd9e2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, } __nvme_submit_cmd(nvmeq, &cmnd); spin_unlock(&nvmeq->sq_lock); + disable_irq_nosync(nvmeq_irq(irq)); + irq_poll_sched(&nvmeq->iop); return BLK_MQ_RQ_QUEUE_OK; out_cleanup_iod: nvme_free_iod(dev, req); That should avoid the first interrupt, and with a bit of lock reduce the number of interrupts _drastically_. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On Wed, Jan 18, 2017 at 5:40 PM, Sagi Grimberg wrote: > >> Your report provided this stats with one-completion dominance for the >> single-threaded case. Does it also hold if you run multiple fio >> threads per core? > > > It's useless to run more threads on that core, it's already fully > utilized. That single threads is already posting a fair amount of > submissions, so I don't see how adding more fio jobs can help in any > way. With a single thread, your completion processing/submission is completely serialized. From my experience, it takes fio couple of microsec to process completion and submit next request, and that' (much) larger than your interrupt processing time. Regards, Andrey -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote: > > >Hannes just spotted this: > >static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > > const struct blk_mq_queue_data *bd) > >{ > >[...] > >__nvme_submit_cmd(nvmeq, &cmnd); > >nvme_process_cq(nvmeq); > >spin_unlock_irq(&nvmeq->q_lock); > >return BLK_MQ_RQ_QUEUE_OK; > >out_cleanup_iod: > >nvme_free_iod(dev, req); > >out_free_cmd: > >nvme_cleanup_cmd(req); > >return ret; > >} > > > >So we're draining the CQ on submit. This of cause makes polling for > >completions in the IRQ handler rather pointless as we already did in the > >submission path. > > I think you missed: > http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 I indeed did, thanks. -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Hannes just spotted this: static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { [...] __nvme_submit_cmd(nvmeq, &cmnd); nvme_process_cq(nvmeq); spin_unlock_irq(&nvmeq->q_lock); return BLK_MQ_RQ_QUEUE_OK; out_cleanup_iod: nvme_free_iod(dev, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; } So we're draining the CQ on submit. This of cause makes polling for completions in the IRQ handler rather pointless as we already did in the submission path. I think you missed: http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On Wed, Jan 18, 2017 at 04:27:24PM +0200, Sagi Grimberg wrote: > > >So what you say is you saw a consomed == 1 [1] most of the time? > > > >[1] from > >http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 > > Exactly. By processing 1 completion per interrupt it makes perfect sense > why this performs poorly, it's not worth paying the soft-irq schedule > for only a single completion. > > What I'm curious is how consistent is this with different devices (wish > I had some...) Hannes just spotted this: static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { [...] __nvme_submit_cmd(nvmeq, &cmnd); nvme_process_cq(nvmeq); spin_unlock_irq(&nvmeq->q_lock); return BLK_MQ_RQ_QUEUE_OK; out_cleanup_iod: nvme_free_iod(dev, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; } So we're draining the CQ on submit. This of cause makes polling for completions in the IRQ handler rather pointless as we already did in the submission path. -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Your report provided this stats with one-completion dominance for the single-threaded case. Does it also hold if you run multiple fio threads per core? It's useless to run more threads on that core, it's already fully utilized. That single threads is already posting a fair amount of submissions, so I don't see how adding more fio jobs can help in any way. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel oops with blk-mq-sched latest
On 01/18/2017 03:48 AM, Hannes Reinecke wrote: > Nearly there. > You're missing a 'blk_mq_start_hw_queues(q)' after > blk_mq_unfreeze_queue(); without it the queue will stall after switching > the scheduler. Yes indeed, forgot that. Needed after the quiesce. > Also what's quite suspicious is this: > > struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, > struct request_queue *q) > { > struct blkcg_gq *blkg; > > WARN_ON_ONCE(!rcu_read_lock_held()); > lockdep_assert_held(q->queue_lock); > > /* >* This could be the first entry point of blkcg implementation and >* we shouldn't allow anything to go through for a bypassing queue. >*/ > if (unlikely(blk_queue_bypass(q))) > return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY); > > which now won't work as the respective flags aren't set anymore. > Not sure if that's a problem, though. > But you might want to look at that, too. dying is still used on blk-mq, but yes, the bypass check should now be frozen for blk-mq. Not really directly related to the above change, but it should be fixed up. > Nevertheless, with the mentioned modifications to your patch the crashes > don't occur anymore. Great > Sad news is that it doesn't help _that_ much on spinning rust mpt3sas; > there I still see a ~50% performance penalty on reads. > Write's slightly better than sq performance, though. What is the test case? Full details please, from hardware to what you are running. As I've mentioned before, I don't necessarily think your performance issues are related to scheduling. Would be nice to get to the bottom of it, though. And for that, I need more details. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On Wed, Jan 18, 2017 at 5:27 PM, Sagi Grimberg wrote: > >> So what you say is you saw a consomed == 1 [1] most of the time? >> >> [1] from >> http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 > > > Exactly. By processing 1 completion per interrupt it makes perfect sense > why this performs poorly, it's not worth paying the soft-irq schedule > for only a single completion. Your report provided this stats with one-completion dominance for the single-threaded case. Does it also hold if you run multiple fio threads per core? Regards, Andrey > > What I'm curious is how consistent is this with different devices (wish > I had some...) > > > ___ > Linux-nvme mailing list > linux-n...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
So what you say is you saw a consomed == 1 [1] most of the time? [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 Exactly. By processing 1 completion per interrupt it makes perfect sense why this performs poorly, it's not worth paying the soft-irq schedule for only a single completion. What I'm curious is how consistent is this with different devices (wish I had some...) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On 01/17/2017 05:50 PM, Sagi Grimberg wrote: > >> So it looks like we are super not efficient because most of the >> times we catch 1 >> completion per interrupt and the whole point is that we need to find >> more! This fio >> is single threaded with QD=32 so I'd expect that we be somewhere in >> 8-31 almost all >> the time... I also tried QD=1024, histogram is still the same. >> >> >> It looks like it takes you longer to submit an I/O than to service an >> interrupt, > > Well, with irq-poll we do practically nothing in the interrupt handler, > only schedule irq-poll. Note that the latency measures are only from > the point the interrupt arrives and the point we actually service it > by polling for completions. > >> so increasing queue depth in the singe-threaded case doesn't >> make much difference. You might want to try multiple threads per core >> with QD, say, 32 > > This is how I ran, QD=32. The one thing which I found _really_ curious is this: IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued: total=r=7673377/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=256 (note the lines starting with 'submit' and 'complete'). They are _always_ 4, irrespective of the hardware and/or tests which I run. Jens, what are these numbers supposed to mean? Is this intended? ATM the information content from those two lines is essentially 0, seeing that the never change irrespective of the tests I'm doing. (And which fio version I'm using ...) Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
On Tue, Jan 17, 2017 at 06:38:43PM +0200, Sagi Grimberg wrote: > > >Just for the record, all tests you've run are with the upper irq_poll_budget > >of > >256 [1]? > > Yes, but that's the point, I never ever reach this budget because > I'm only processing 1-2 completions per interrupt. > > >We (Hannes and me) recently stumbed accross this when trying to poll for more > >than 256 queue entries in the drivers we've been testing. s/stumbed/stumbled/ > > What do you mean by stumbed? irq-poll should be agnostic to the fact > that drivers can poll more than their given budget? So what you say is you saw a consomed == 1 [1] most of the time? [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel oops with blk-mq-sched latest
On 01/17/2017 02:00 PM, Jens Axboe wrote: > On 01/17/2017 04:47 AM, Jens Axboe wrote: >> On 01/17/2017 12:57 AM, Hannes Reinecke wrote: >>> Hi Jens, >>> >>> I gave your latest patchset from >>> >>> git.kernel.dk/linux-block blk-mq-sched >>> >>> I see a kernel oops when shutting down: >>> >>> [ 2132.708929] systemd-shutdown[1]: Detaching DM devices. >>> [ 2132.965107] BUG: unable to handle kernel NULL pointer dereference at >>> >>> 0001 >>> [ 2133.037182] IP: dd_merged_requests+0x6/0x60 >>> [ 2133.077816] PGD 0 >>> [ 2133.077818] >>> [ 2133.113087] Oops: [#1] SMP >>> [ list of modules removed ] >>> [ 2133.925265] CPU: 20 PID: 1 Comm: systemd-shutdow Not tainted >>> 4.10.0-rc4+ #543 >>> [ 2133.990034] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 09/18/2013 >>> [ 2134.050522] task: 88042d614040 task.stack: c9000315 >>> [ 2134.106915] RIP: 0010:dd_merged_requests+0x6/0x60 >>> [ 2134.150593] RSP: 0018:c90003153b18 EFLAGS: 00010002 >>> [ 2134.198740] RAX: 81cc6de0 RBX: 8804296d5040 RCX: >>> 0001 >>> [ 2134.262708] RDX: 0001 RSI: 0001 RDI: >>> 8804296d5040 >>> [ 2134.326987] RBP: c90003153b30 R08: R09: >>> >>> [ 2134.391054] R10: R11: 0001f8180001f815 R12: >>> >>> [ 2134.456095] R13: 8804296d5040 R14: 8804099801f0 R15: >>> 0004 >>> [ 2134.521196] FS: 7fd64d3bf840() GS:88042f90() >>> knlGS: >>> [ 2134.595178] CS: 0010 DS: ES: CR0: 80050033 >>> [ 2134.648637] CR2: 0001 CR3: 00081b892000 CR4: >>> 000406e0 >>> [ 2134.713349] Call Trace: >>> [ 2134.737168] ? elv_drain_elevator+0x29/0xa0 >>> [ 2134.775821] __blk_drain_queue+0x52/0x1a0 >>> [ 2134.812473] blk_queue_bypass_start+0x6e/0xa0 >>> [ 2134.854009] blkcg_deactivate_policy+0x30/0xf0 >>> [ 2134.894993] blk_throtl_exit+0x34/0x50 >>> [ 2134.929450] blkcg_exit_queue+0x35/0x40 >>> [ 2134.965089] blk_release_queue+0x33/0xe0 >>> [ 2135.001364] kobject_cleanup+0x63/0x170 >>> [ 2135.037412] kobject_put+0x25/0x50 >>> [ 2135.068622] blk_cleanup_queue+0x198/0x260 >>> [ 2135.107780] cleanup_mapped_device+0xb5/0xf0 [dm_mod] >>> [ 2135.154741] __dm_destroy+0x1a5/0x290 [dm_mod] >>> [ 2135.195009] dm_destroy+0x13/0x20 [dm_mod] >>> [ 2135.232149] dev_remove+0xde/0x120 [dm_mod] >>> [ 2135.270184] ? dev_suspend+0x210/0x210 [dm_mod] >>> [ 2135.311478] ctl_ioctl+0x20b/0x510 [dm_mod] >>> [ 2135.349680] ? terminate_walk+0xc3/0x140 >>> [ 2135.385442] ? kmem_cache_free+0x10/0x260 >>> [ 2135.422315] dm_ctl_ioctl+0x13/0x20 [dm_mod] >>> >>> Known issue? >> >> Try with this, looks like we're calling the old bypass path for the new >> path, which is now triggering because q->elevator is true. > > Should have grepped, there's one more path that uses the old bypass > path for q->mq_ops, potentially. This one handles that one, too. > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 8ba0af780e88..2630f64bed19 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1223,7 +1223,11 @@ int blkcg_activate_policy(struct request_queue *q, > if (blkcg_policy_enabled(q, pol)) > return 0; > > - blk_queue_bypass_start(q); > + if (q->mq_ops) { > + blk_mq_freeze_queue(q); > + blk_mq_quiesce_queue(q); > + } else > + blk_queue_bypass_start(q); > pd_prealloc: > if (!pd_prealloc) { > pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q->node); > @@ -1261,7 +1265,10 @@ int blkcg_activate_policy(struct request_queue *q, > > spin_unlock_irq(q->queue_lock); > out_bypass_end: > - blk_queue_bypass_end(q); > + if (q->mq_ops) > + blk_mq_unfreeze_queue(q); > + else > + blk_queue_bypass_end(q); > if (pd_prealloc) > pol->pd_free_fn(pd_prealloc); > return ret; > @@ -1284,7 +1291,12 @@ void blkcg_deactivate_policy(struct request_queue *q, > if (!blkcg_policy_enabled(q, pol)) > return; > > - blk_queue_bypass_start(q); > + if (q->mq_ops) { > + blk_mq_freeze_queue(q); > + blk_mq_quiesce_queue(q); > + } else > + blk_queue_bypass_start(q); > + > spin_lock_irq(q->queue_lock); > > __clear_bit(pol->plid, q->blkcg_pols); > @@ -1304,7 +1316,11 @@ void blkcg_deactivate_policy(struct request_queue *q, > } > > spin_unlock_irq(q->queue_lock); > - blk_queue_bypass_end(q); > + > + if (q->mq_ops) > + blk_mq_unfreeze_queue(q); > + else > + blk_queue_bypass_end(q); > } > EXPORT_SYMBOL_GPL(blkcg_deactivate_policy); > > diff --git a/block/elevator.c b/block/elevator.c > index d14cb87e6564..464372840774 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -613,6 +613,9 @@ void elv_drain_elevator(struct request_queue *q) > { > static int prin
Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications
On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > > I would like to discuss un-addressable device memory in the context of > > filesystem and block device. Specificaly how to handle write-back, read, > > ... when a filesystem page is migrated to device memory that CPU can not > > access. > > > > I intend to post a patchset leveraging the same idea as the existing > > block bounce helper (block/bounce.c) to handle this. I believe this is > > worth discussing during summit see how people feels about such plan and > > if they have better ideas. > > > > > > I also like to join discussions on: > > - Peer-to-Peer DMAs between PCIe devices > > - CDM coherent device memory > > - PMEM > > - overall mm discussions > > I would like to attend this discussion. I can talk about coherent device > memory and how having HMM handle that will make it easy to have one > interface for device driver. For Coherent device case we definitely need > page cache migration support. Aneesh, did you intend this as your request to attend? You posted it as a reply to another email so it is not really clear. Note that each attend request should be a separate email so that it does not get lost... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
On Tue 17-01-17 15:14:21, Vishal Verma wrote: > Your note on the online repair does raise another tangentially related > topic. Currently, if there are badblocks, writes via the bio submission > path will clear the error (if the hardware is able to remap the bad > locations). However, if the filesystem is mounted eith DAX, even > non-mmap operations - read() and write() will go through the dax paths > (dax_do_io()). We haven't found a good/agreeable way to perform > error-clearing in this case. So currently, if a dax mounted filesystem > has badblocks, the only way to clear those badblocks is to mount it > without DAX, and overwrite/zero the bad locations. This is a pretty > terrible user experience, and I'm hoping this can be solved in a better > way. Please remind me, what is the problem with DAX code doing necessary work to clear the error when it gets EIO from memcpy on write? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] genhd: Do not hold event lock when scheduling workqueue elements
When scheduling workqueue elements the callback function might be called directly, so holding the event lock is potentially dangerous as it might lead to a deadlock: [ 989.542827] INFO: task systemd-udevd:459 blocked for more than 480 seconds. [ 989.609721] Not tainted 4.10.0-rc4+ #546 [ 989.648545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 989.716429] systemd-udevd D13368 459 1 0x0004 [ 989.716435] Call Trace: [ 989.716444] __schedule+0x2f2/0xb10 [ 989.716447] schedule+0x3d/0x90 [ 989.716449] schedule_timeout+0x2fc/0x600 [ 989.716451] ? wait_for_completion+0xac/0x110 [ 989.716456] ? mark_held_locks+0x66/0x90 [ 989.716458] ? _raw_spin_unlock_irq+0x2c/0x40 [ 989.716460] ? trace_hardirqs_on_caller+0x111/0x1e0 [ 989.716461] wait_for_completion+0xb4/0x110 [ 989.716464] ? wake_up_q+0x80/0x80 [ 989.716469] flush_work+0x1ea/0x2a0 [ 989.716470] ? flush_work+0x24e/0x2a0 [ 989.716472] ? destroy_worker+0xd0/0xd0 [ 989.716474] __cancel_work_timer+0x11a/0x1e0 [ 989.716476] ? trace_hardirqs_on_caller+0x111/0x1e0 [ 989.716477] cancel_delayed_work_sync+0x13/0x20 [ 989.716482] disk_block_events+0x82/0x90 [ 989.716487] __blkdev_get+0x58/0x450 [ 989.716488] blkdev_get+0x1ce/0x340 [ 989.716490] ? _raw_spin_unlock+0x27/0x40 [ 989.716492] blkdev_open+0x5b/0x70 [ 989.716501] do_dentry_open+0x213/0x310 [ 989.716505] ? blkdev_get_by_dev+0x50/0x50 [ 989.716507] vfs_open+0x4f/0x80 [ 989.716518] ? may_open+0x9b/0x100 [ 989.716521] path_openat+0x48a/0xdc0 [ 989.716527] ? _crng_backtrack_protect+0x30/0x80 [ 989.716530] do_filp_open+0x7e/0xd0 [ 989.716533] ? _raw_spin_unlock+0x27/0x40 [ 989.716537] ? __alloc_fd+0xf7/0x210 [ 989.716539] do_sys_open+0x115/0x1f0 [ 989.716542] SyS_open+0x1e/0x20 [ 989.716546] entry_SYSCALL_64_fastpath+0x23/0xc6 Signed-off-by: Hannes Reinecke diff --git a/block/genhd.c b/block/genhd.c index fcd6d4f..ae46caa 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1426,8 +1426,7 @@ struct disk_events { struct gendisk *disk; /* the associated disk */ spinlock_t lock; - struct mutexblock_mutex;/* protects blocking */ - int block; /* event blocking depth */ + atomic_tblock; /* event blocking depth */ unsigned intpending;/* events already sent out */ unsigned intclearing; /* events being cleared */ @@ -1488,26 +1487,13 @@ static unsigned long disk_events_poll_jiffies(struct gendisk *disk) void disk_block_events(struct gendisk *disk) { struct disk_events *ev = disk->ev; - unsigned long flags; - bool cancel; if (!ev) return; - /* -* Outer mutex ensures that the first blocker completes canceling -* the event work before further blockers are allowed to finish. -*/ - mutex_lock(&ev->block_mutex); - - spin_lock_irqsave(&ev->lock, flags); - cancel = !ev->block++; - spin_unlock_irqrestore(&ev->lock, flags); - - if (cancel) + if (atomic_inc_return(&ev->block) == 1) cancel_delayed_work_sync(&disk->ev->dwork); - mutex_unlock(&ev->block_mutex); } static void __disk_unblock_events(struct gendisk *disk, bool check_now) @@ -1516,23 +1502,18 @@ static void __disk_unblock_events(struct gendisk *disk, bool check_now) unsigned long intv; unsigned long flags; - spin_lock_irqsave(&ev->lock, flags); - - if (WARN_ON_ONCE(ev->block <= 0)) - goto out_unlock; - - if (--ev->block) - goto out_unlock; + if (atomic_dec_return(&ev->block) > 0) + return; + spin_lock_irqsave(&ev->lock, flags); intv = disk_events_poll_jiffies(disk); + spin_unlock_irqrestore(&ev->lock, flags); if (check_now) queue_delayed_work(system_freezable_power_efficient_wq, &ev->dwork, 0); else if (intv) queue_delayed_work(system_freezable_power_efficient_wq, &ev->dwork, intv); -out_unlock: - spin_unlock_irqrestore(&ev->lock, flags); } /** @@ -1572,10 +1553,10 @@ void disk_flush_events(struct gendisk *disk, unsigned int mask) spin_lock_irq(&ev->lock); ev->clearing |= mask; - if (!ev->block) + spin_unlock_irq(&ev->lock); + if (!atomic_read(&ev->block)) mod_delayed_work(system_freezable_power_efficient_wq, &ev->dwork, 0); - spin_unlock_irq(&ev->lock); } /** @@ -1666,12 +1647,11 @@ static void disk_check_events(struct disk_events *ev, *clearing_ptr &= ~clearing; intv = disk_events_poll_jiffies(disk); - if (!ev->block && intv) + spin_unlock_irq(&ev->lock); + if (!atomic_read(
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
On Tue 17-01-17 15:37:05, Vishal Verma wrote: > I do mean that in the filesystem, for every IO, the badblocks will be > checked. Currently, the pmem driver does this, and the hope is that the > filesystem can do a better job at it. The driver unconditionally checks > every IO for badblocks on the whole device. Depending on how the > badblocks are represented in the filesystem, we might be able to quickly > tell if a file/range has existing badblocks, and error out the IO > accordingly. > > At mount the the fs would read the existing badblocks on the block > device, and build its own representation of them. Then during normal > use, if the underlying badblocks change, the fs would get a notification > that would allow it to also update its own representation. So I believe we have to distinguish three cases so that we are on the same page. 1) PMEM is exposed only via a block interface for legacy filesystems to use. Here, all the bad blocks handling IMO must happen in NVDIMM driver. Looking from outside, the IO either returns with EIO or succeeds. As a result you cannot ever ger rid of bad blocks handling in the NVDIMM driver. 2) PMEM is exposed for DAX aware filesystem. This seems to be what you are mostly interested in. We could possibly do something more efficient than what NVDIMM driver does however the complexity would be relatively high and frankly I'm far from convinced this is really worth it. If there are so many badblocks this would matter, the HW has IMHO bigger problems than performance. 3) PMEM filesystem - there things are even more difficult as was already noted elsewhere in the thread. But for now I'd like to leave those aside not to complicate things too much. Now my question: Why do we bother with badblocks at all? In cases 1) and 2) if the platform can recover from MCE, we can just always access persistent memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that seems to already happen so we just need to make sure all places handle returned errors properly (e.g. fs/dax.c does not seem to) and we are done. No need for bad blocks list at all, no slow down unless we hit a bad cell and in that case who cares about performance when the data is gone... For platforms that cannot recover from MCE - just buy better hardware ;). Seriously, I have doubts people can seriously use a machine that will unavoidably randomly reboot (as there is always a risk you hit error that has not been uncovered by background scrub). But maybe for big cloud providers the cost savings may offset for the inconvenience, I don't know. But still for that case a bad blocks handling in NVDIMM code like we do now looks good enough? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] nvme: use blk_rq_payload_bytes
On Tue, Jan 17, 2017 at 10:06:51PM +0200, Sagi Grimberg wrote: > >> @@ -1014,9 +1013,9 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue >> *queue, >> } >> > > Christoph, a little above here we still look at blk_rq_bytes(), > shouldn't that look at blk_rq_payload_bytes() too? The check is ok for now as it's just zero vs non-zero. It's somewhat broken for Write Zeroes, though. I've fixed it in this series which I need to submit: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/write-zeroes-take-2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html