[PATCH] scsi_transport_fc: set scsi_target_id upon rescan
From: Hannes Reinecke When an rport is found in the bindings array there is no guarantee that it had been a target port, so we need to call fc_remote_port_rolechg() here to ensure the scsi_target_id is set correctly. Otherwise the port will never be scanned. Signed-off-by: Hannes Reinecke --- drivers/scsi/scsi_transport_fc.c | 11 +-- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 3c6bc0081fcb..377961380fd7 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -2876,7 +2876,6 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel, memcpy(&rport->port_name, &ids->port_name, sizeof(rport->port_name)); rport->port_id = ids->port_id; - rport->roles = ids->roles; rport->port_state = FC_PORTSTATE_ONLINE; rport->flags &= ~FC_RPORT_FAST_FAIL_TIMEDOUT; @@ -2885,15 +2884,7 @@ fc_remote_port_add(struct Scsi_Host *shost, int channel, fci->f->dd_fcrport_size); spin_unlock_irqrestore(shost->host_lock, flags); - if (ids->roles & FC_PORT_ROLE_FCP_TARGET) { - scsi_target_unblock(&rport->dev, SDEV_RUNNING); - - /* initiate a scan of the target */ - spin_lock_irqsave(shost->host_lock, flags); - rport->flags |= FC_RPORT_SCAN_PENDING; - scsi_queue_work(shost, &rport->scan_work); - spin_unlock_irqrestore(shost->host_lock, flags); - } + fc_remote_port_rolechg(rport, ids->roles); return rport; } } -- 2.12.3
[PATCH] aacraid: fix potential double-fetch issue
While examining the kernel source code, I found a dangerous operation that could turn into a double-fetch situation (a race condition bug) where the same userspace memory region are fetched twice into kernel with sanity checks after the first fetch while missing checks after the second fetch. 1. The first userspace fetch happens in copy_from_user(&fibsize, &user_srb->count,sizeof(u32)) 2. Subsequently fibsize undergoes a few sanity checks and is then used to allocate user_srbcmd = kmalloc(fibsize, GFP_KERNEL). 3. The second userspace fetch happens in copy_from_user(user_srbcmd, user_srb,fibsize) 4. Given that the memory region pointed by user_srb can be fully controlled in userspace, an attacker can race to override the user_srb->count to arbitrary value (say 0X) after the first fetch but before the second. The changed value will be copied to user_srbcmd. The patch explicitly overrides user_srbcmd->count after the second userspace fetch with the value fibsize from the first userspace fetch. In this way, it is assured that the relation, user_srbcmd->count stores the size of the user_srbcmd buffer, still holds after the second fetch. Signed-off-by: Meng Xu --- drivers/scsi/aacraid/commctrl.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/scsi/aacraid/commctrl.c b/drivers/scsi/aacraid/commctrl.c index 9ab0fa9..734bf11 100644 --- a/drivers/scsi/aacraid/commctrl.c +++ b/drivers/scsi/aacraid/commctrl.c @@ -540,6 +540,12 @@ static int aac_send_raw_srb(struct aac_dev* dev, void __user * arg) goto cleanup; } + /* +* re-establish the relation that user_srbcmd->count holds the +* size of user_srbcmd +*/ + user_srbcmd->count = fibsize; + flags = user_srbcmd->flags; /* from user in cpu order */ switch (flags & (SRB_DataIn | SRB_DataOut)) { case SRB_DataOut: -- 2.7.4
[PATCH] mpt3sas: downgrade full copy_from_user to access_ok check
Since right after the user copy, we are going to memset(&karg, 0, sizeof(karg)), I guess an access_ok check is enough? Signed-off-by: Meng Xu --- drivers/scsi/mpt3sas/mpt3sas_ctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_ctl.c b/drivers/scsi/mpt3sas/mpt3sas_ctl.c index bdffb69..b363d2d 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_ctl.c +++ b/drivers/scsi/mpt3sas/mpt3sas_ctl.c @@ -1065,7 +1065,7 @@ _ctl_getiocinfo(struct MPT3SAS_ADAPTER *ioc, void __user *arg) { struct mpt3_ioctl_iocinfo karg; - if (copy_from_user(&karg, arg, sizeof(karg))) { + if (!access_ok(VERIFY_READ, arg, sizeof(karg))) { pr_err("failure at %s:%d/%s()!\n", __FILE__, __LINE__, __func__); return -EFAULT; -- 2.7.4
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19, 2017 at 04:49:15PM +, Bart Van Assche wrote: > On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote: > > Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART > > to do that already. > > Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to > reexamine the software queues and the hctx dispatch list but not the requeue > list. If a block driver returns BLK_STS_RESOURCE then requests end up on the > requeue list. Hence the following code in scsi_end_request(): > > if (scsi_target(sdev)->single_lun || > !list_empty(&sdev->host->starved_list)) > kblockd_schedule_work(&sdev->requeue_work); > else > blk_mq_run_hw_queues(q, true); Let's focus on dm-rq. I have explained before, dm-rq is different with SCSI's, we don't need to requeue request any more in dm-rq if blk_get_request() returns NULL in multipath_clone_and_map(), since SCHED_RESTART can cover that definitely. Not mention dm_mq_delay_requeue_request() will run the queue explicitly if it has to be done. That needn't SCHED_RESTART to cover, right? -- Ming
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
Hi Mike, On Tue, Sep 19, 2017 at 07:50:06PM -0400, Mike Snitzer wrote: > On Tue, Sep 19 2017 at 7:25pm -0400, > Bart Van Assche wrote: > > > On Wed, 2017-09-20 at 06:44 +0800, Ming Lei wrote: > > > For this issue, it isn't same between SCSI and dm-rq. > > > > > > We don't need to run queue in .end_io of dm, and the theory is > > > simple, otherwise it isn't performance issue, and should be I/O hang. > > > > > > 1) every dm-rq's request is 1:1 mapped to SCSI's request > > > > > > 2) if there is any mapped SCSI request not finished, either > > > in-flight or in requeue list or whatever, there will be one > > > corresponding dm-rq's request in-flight > > > > > > 3) once the mapped SCSI request is completed, dm-rq's completion > > > path will be triggered and dm-rq's queue will be rerun because of > > > SCHED_RESTART in dm-rq > > > > > > So the hw queue of dm-rq has been run in dm-rq's completion path > > > already, right? Why do we need to do it again in the hot path? > > > > The measurement data in the description of patch 5/5 shows a significant > > performance regression for an important workload, namely random I/O. > > Additionally, the performance improvement for sequential I/O was achieved > > for an unrealistically low queue depth. > > So you've ignored Ming's question entirely and instead decided to focus > on previous questions you raised to Ming that he ignored. This is > getting tedious. Sorry for not making it clear, I mentioned I will post a new version to address the random I/O regression. > > Especially given that Ming said the first patch that all this fighting > has been over isn't even required to attain the improvements. > > Ming, please retest both your baseline and patched setup with a > queue_depth of >= 32. Also, please do 3 - 5 runs to get a avg and std > dev across the runs. Taking a bigger queue_depth won't be helpful on this issue, and it can make the situation worse, because .cmd_per_lun won't be changed, and queue often becomes busy when number of in-flight requests is bigger than .cmd_per_lun. I will post one new version, which will use another simple way to figure out if underlying queue is busy, so that random I/O perf won't be affected, but this new version need to depend on the following patchset: https://marc.info/?t=15043655572&r=1&w=2 So it may take a while since that patchset is still under review. I will post them all together in 'blk-mq-sched: improve SCSI-MQ performance(V5)'. The approach taken in patch 5 depends on q->queue_depth, but some SCSI host's .cmd_per_lun is different with q->queue_depth, so causes the random I/O regression. -- Ming
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19 2017 at 7:25pm -0400, Bart Van Assche wrote: > On Wed, 2017-09-20 at 06:44 +0800, Ming Lei wrote: > > For this issue, it isn't same between SCSI and dm-rq. > > > > We don't need to run queue in .end_io of dm, and the theory is > > simple, otherwise it isn't performance issue, and should be I/O hang. > > > > 1) every dm-rq's request is 1:1 mapped to SCSI's request > > > > 2) if there is any mapped SCSI request not finished, either > > in-flight or in requeue list or whatever, there will be one > > corresponding dm-rq's request in-flight > > > > 3) once the mapped SCSI request is completed, dm-rq's completion > > path will be triggered and dm-rq's queue will be rerun because of > > SCHED_RESTART in dm-rq > > > > So the hw queue of dm-rq has been run in dm-rq's completion path > > already, right? Why do we need to do it again in the hot path? > > The measurement data in the description of patch 5/5 shows a significant > performance regression for an important workload, namely random I/O. > Additionally, the performance improvement for sequential I/O was achieved > for an unrealistically low queue depth. So you've ignored Ming's question entirely and instead decided to focus on previous questions you raised to Ming that he ignored. This is getting tedious. Especially given that Ming said the first patch that all this fighting has been over isn't even required to attain the improvements. Ming, please retest both your baseline and patched setup with a queue_depth of >= 32. Also, please do 3 - 5 runs to get a avg and std dev across the runs. > Sorry but given these measurement results I don't see why I should > spend more time on this patch series. Bart, I've historically genuinely always appreciated your review and insight. But if your future "review" on this patchset would take the form shown in this thread then please don't spend more time on it. Mike
[GIT PULL] SCSI fixes for 4.14-rc1
This is a set of five small fixes: one is a null deref fix which is pretty critical for the fc transport class and one fixes a potential security issue of sg leaking kernel information. The patch is available here: git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes The short changelog is: Arnd Bergmann (1): scsi: acornscsi: fix build error Christoph Hellwig (1): scsi: scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout Hannes Reinecke (2): scsi: sg: fixup infoleak when using SG_GET_REQUEST_TABLE scsi: sg: factor out sg_fill_request_table() Lukas Czerner (1): scsi: sd: Remove unnecessary condition in sd_read_block_limits() and the diffstat: drivers/scsi/arm/acornscsi.c | 6 ++-- drivers/scsi/scsi_transport_fc.c | 2 +- drivers/scsi/sd.c| 2 -- drivers/scsi/sg.c| 64 ++-- 4 files changed, 40 insertions(+), 34 deletions(-) With full diff below James --- diff --git a/drivers/scsi/arm/acornscsi.c b/drivers/scsi/arm/acornscsi.c index 690816f3c6af..421fe869a11e 100644 --- a/drivers/scsi/arm/acornscsi.c +++ b/drivers/scsi/arm/acornscsi.c @@ -2725,9 +2725,9 @@ int acornscsi_abort(struct scsi_cmnd *SCpnt) * Params : SCpnt - command causing reset * Returns : one of SCSI_RESET_ macros */ -int acornscsi_host_reset(struct Scsi_Host *shpnt) +int acornscsi_host_reset(struct scsi_cmnd *SCpnt) { - AS_Host *host = (AS_Host *)shpnt->hostdata; + AS_Host *host = (AS_Host *)SCpnt->device->host->hostdata; struct scsi_cmnd *SCptr; host->stats.resets += 1; @@ -2741,7 +2741,7 @@ int acornscsi_host_reset(struct Scsi_Host *shpnt) printk(KERN_WARNING "acornscsi_reset: "); print_sbic_status(asr, ssr, host->scsi.phase); - for (devidx = 0; devidx < 9; devidx ++) { + for (devidx = 0; devidx < 9; devidx++) acornscsi_dumplog(host, devidx); } #endif diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 3c6bc0081fcb..ba9d70f8a6a1 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -3571,7 +3571,7 @@ fc_vport_sched_delete(struct work_struct *work) static enum blk_eh_timer_return fc_bsg_job_timeout(struct request *req) { - struct bsg_job *job = (void *) req->special; + struct bsg_job *job = blk_mq_rq_to_pdu(req); struct Scsi_Host *shost = fc_bsg_to_shost(job); struct fc_rport *rport = fc_bsg_to_rport(job); struct fc_internal *i = to_fc_internal(shost->transportt); diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 11c1738c2100..fb9f8b5f4673 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2915,8 +2915,6 @@ static void sd_read_block_limits(struct scsi_disk *sdkp) sd_config_discard(sdkp, SD_LBP_WS16); else if (sdkp->lbpws10) sd_config_discard(sdkp, SD_LBP_WS10); - else if (sdkp->lbpu && sdkp->max_unmap_blocks) - sd_config_discard(sdkp, SD_LBP_UNMAP); else sd_config_discard(sdkp, SD_LBP_DISABLE); } diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index cf0e71db9e51..0419c2298eab 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -828,6 +828,39 @@ static int max_sectors_bytes(struct request_queue *q) return max_sectors << 9; } +static void +sg_fill_request_table(Sg_fd *sfp, sg_req_info_t *rinfo) +{ + Sg_request *srp; + int val; + unsigned int ms; + + val = 0; + list_for_each_entry(srp, &sfp->rq_list, entry) { + if (val > SG_MAX_QUEUE) + break; + rinfo[val].req_state = srp->done + 1; + rinfo[val].problem = + srp->header.masked_status & + srp->header.host_status & + srp->header.driver_status; + if (srp->done) + rinfo[val].duration = + srp->header.duration; + else { + ms = jiffies_to_msecs(jiffies); + rinfo[val].duration = + (ms > srp->header.duration) ? + (ms - srp->header.duration) : 0; + } + rinfo[val].orphan = srp->orphan; + rinfo[val].sg_io_owned = srp->sg_io_owned; + rinfo[val].pack_id = srp->header.pack_id; + rinfo[val].usr_ptr = srp->header.usr_ptr; + val++; + } +} + static long sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) { @@ -1012,38 +1045,13 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) return -EFAULT; else {
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Wed, 2017-09-20 at 06:44 +0800, Ming Lei wrote: > For this issue, it isn't same between SCSI and dm-rq. > > We don't need to run queue in .end_io of dm, and the theory is > simple, otherwise it isn't performance issue, and should be I/O hang. > > 1) every dm-rq's request is 1:1 mapped to SCSI's request > > 2) if there is any mapped SCSI request not finished, either > in-flight or in requeue list or whatever, there will be one > corresponding dm-rq's request in-flight > > 3) once the mapped SCSI request is completed, dm-rq's completion > path will be triggered and dm-rq's queue will be rerun because of > SCHED_RESTART in dm-rq > > So the hw queue of dm-rq has been run in dm-rq's completion path > already, right? Why do we need to do it again in the hot path? The measurement data in the description of patch 5/5 shows a significant performance regression for an important workload, namely random I/O. Additionally, the performance improvement for sequential I/O was achieved for an unrealistically low queue depth. Sorry but given these measurement results I don't see why I should spend more time on this patch series. Bart.
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19, 2017 at 06:42:30PM +, Bart Van Assche wrote: > On Wed, 2017-09-20 at 00:55 +0800, Ming Lei wrote: > > On Wed, Sep 20, 2017 at 12:49 AM, Bart Van Assche > > wrote: > > > On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote: > > > > Run queue at end_io is definitely wrong, because blk-mq has > > > > SCHED_RESTART > > > > to do that already. > > > > > > Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core > > > to > > > reexamine the software queues and the hctx dispatch list but not the > > > requeue > > > list. If a block driver returns BLK_STS_RESOURCE then requests end up on > > > the > > > requeue list. Hence the following code in scsi_end_request(): > > > > That doesn't need SCHED_RESTART, because it is requeue's > > responsibility to do that, > > see blk_mq_requeue_work(), which will run hw queue at the end of this func. > > That's not what I was trying to explain. What I was trying to explain is that > every block driver that can cause a request to end up on the requeue list is > responsible for kicking the requeue list at a later time. Hence the > kblockd_schedule_work(&sdev->requeue_work) call in the SCSI core and the > blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() calls in the > dm code. What I would like to see is measurement results for dm-mpath without > this patch series and a call to kick the requeue list added to the dm-mpath > end_io code. For this issue, it isn't same between SCSI and dm-rq. We don't need to run queue in .end_io of dm, and the theory is simple, otherwise it isn't performance issue, and should be I/O hang. 1) every dm-rq's request is 1:1 mapped to SCSI's request 2) if there is any mapped SCSI request not finished, either in-flight or in requeue list or whatever, there will be one corresponding dm-rq's request in-flight 3) once the mapped SCSI request is completed, dm-rq's completion path will be triggered and dm-rq's queue will be rerun because of SCHED_RESTART in dm-rq So the hw queue of dm-rq has been run in dm-rq's completion path already, right? Why do we need to do it again in the hot path? -- Ming
Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset
On 09/19/2017 02:05 PM, James Bottomley wrote: > Actually, the whole problem sounds like a posted write. Likely the > write that causes the reset doesn't get flushed until the read checking > if the reset has succeeded, which might explain the 100% initial > failure. Why not throw away that first value if it's a failure and > then do your polled wait and timeout on the reset success. We should > anyway be waiting some time for a reset to be issued, so even on non- > posted write systems we could see this problem intermittently. > > James > Thanks for this suggestion James. I tried to remove the sleep and did a dummy read to register using readl() - issue reproduced. I did expect that, since in aac_is_ctrl_up_and_running() we indeed read a register and if it shows us reset is not complete, we wait and read it again. So, we can think in this 1st read as a dummy one heheh My theory here is that we're observing a failure similar to one we already did in some specific NVMe adapters - the readl before some delay (in nvme it was 2s) corrupts the adapter FW procedure. It's as if the adapter doesn't like to deal with this read while the reset procedure is ongoing. So, we wait a bit to issue a readl and everything goes fine. Cheers, Guilherme
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Wed, 2017-09-20 at 00:55 +0800, Ming Lei wrote: > On Wed, Sep 20, 2017 at 12:49 AM, Bart Van Assche > wrote: > > On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote: > > > Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART > > > to do that already. > > > > Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to > > reexamine the software queues and the hctx dispatch list but not the requeue > > list. If a block driver returns BLK_STS_RESOURCE then requests end up on the > > requeue list. Hence the following code in scsi_end_request(): > > That doesn't need SCHED_RESTART, because it is requeue's > responsibility to do that, > see blk_mq_requeue_work(), which will run hw queue at the end of this func. That's not what I was trying to explain. What I was trying to explain is that every block driver that can cause a request to end up on the requeue list is responsible for kicking the requeue list at a later time. Hence the kblockd_schedule_work(&sdev->requeue_work) call in the SCSI core and the blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() calls in the dm code. What I would like to see is measurement results for dm-mpath without this patch series and a call to kick the requeue list added to the dm-mpath end_io code. Bart.
Re: Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?
On 2017-09-19 10:56 AM, Benjamin Block wrote: Hello linux-block, I wrote some tests recently to test patches against bsg.c and bsg-lib.c, and while writing those I noticed something strange: When you use the write() and read() call on multiple file-descriptors for a single bsg-device (FC or SCSI), it is possible that you get cross-talk between those different file-descriptors. E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0 and you send commands via write() in both processes, when they both do read() later on - to read the response for the commands they send before -, it is possible that process A gets the response (Sense-Data, FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis. I noticed this because in my tests I 'tag' each command I send with write() via a value I write into the usr_ptr field of sg_io_v4. When I later user read() to receive responses I check for this tag in a hash-table and so can look up the original command. When I used this and spawned multiple processes with the same target bsg-device, I got responses for commands I don't have tags for in my hash-table, so they were written by an other process. This never happend when I only spawn one test-process. This seems awfully contra-productive.. so much that I don't see any point in allowing users to open more than one FD per bsg-device, because that would inevitably lead to very hard to debug 'bugs'. And I wonder if this is really by design or some regression that happend over time. I looked at the code in bsg.c and yes, it seems this is what is coded right now. We have a hash-table which manages all 'struct bsg_device's who are indexed by device-minors, so one hash-table entry per device-node. So eh, long talk short question, is that intended? Hi, About once a year I point out that major misfeature in the bsg driver and no-one seems to care. Most recently I pointed it out in a discussion about SCSI SG CVE-2017-0794 6 days ago: " ... Last time I looked at the bsg driver, a SCSI command could be issued on one file descriptor and its data-in block and SCSI status delivered to another file descriptor to the same block device (potentially in another process). IOW chaos" It is hard to imagine any sane application relying on this bizarre behaviour so fixing it should not break any applications. My guess is that it would require a non-trivial patch set to fix. Would you like to volunteer? In the same post I asked (thinking it was cc-ed to this list): "BTW Is there a NVMe pass-through?" And the answer is: yes there is in include/uapi/linux/nvme_ioctl.h which has struct nvme_user_io and struct nvme_passthru_cmd plus several ioctl constants. See the smartmontool repository for examples of how to use them. I also discovered there is a SCSI to NVMe translation layer (SNTL) hiding in some kernels. Hence commands like this: sg_inq /dev/nvme0## that's a 'char' device, and ... sg_vpd /dev/nvme0n1 ## that's a block device "namespace 1" work as expected. Ah, I spoke too soon ... in another post I have been informed that the SNTL has now been removed from the kernel. Just goes to show the human stupidity is still infinite :-) Doug Gilbert
RE: [PATCH] sd: Limit WRITE SAME / WRITE SAME(16) w/UNMAP length for certain devices
Ewan, I like it, more generic than my patch. I never saw the other cases, so I limited my patch to WS16. Acked-by: Bill Kuzeja On Tue-09-19 at 12:14 Ewan D. Milne wrote: > Some devices do not support a WRITE SAME / WRITE SAME(16) with the > UNMAP > bit set up to the length specified in the MAXIMUM WRITE SAME LENGTH > field > in the block limits VPD page (or, the field is zero, indicating there is > no limit). Limit the length by the MAXIMUM UNMAP LBA COUNT value. > Otherwise the command might be rejected. > > Signed-off-by: Ewan D. Milne > --- > drivers/scsi/sd.c | 14 +++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 6549e5c..fa07ac2 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -693,6 +693,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, > unsigned int mode) > struct request_queue *q = sdkp->disk->queue; > unsigned int logical_block_size = sdkp->device->sector_size; > unsigned int max_blocks = 0; > + u32 max_ws_blocks = sdkp->max_ws_blocks; > > q->limits.discard_alignment = > sdkp->unmap_alignment * logical_block_size; > @@ -701,6 +702,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, > unsigned int mode) > sdkp->unmap_granularity * logical_block_size); > sdkp->provisioning_mode = mode; > > + /* > + * Some devices limit WRITE SAME / WRITE SAME(16) w/UNMAP > + * by MAXIMUM UNMAP LBA COUNT instead of MAXIMUM WRITE > SAME LENGTH. > + * Use the smaller of the two values to avoid ILLEGAL REQUEST > errors. > + */ > + max_ws_blocks = min_not_zero(max_ws_blocks, sdkp- > >max_unmap_blocks); > + > switch (mode) { > > case SD_LBP_FULL: > @@ -715,17 +723,17 @@ static void sd_config_discard(struct scsi_disk *sdkp, > unsigned int mode) > break; > > case SD_LBP_WS16: > - max_blocks = min_not_zero(sdkp->max_ws_blocks, > + max_blocks = min_not_zero(max_ws_blocks, > (u32)SD_MAX_WS16_BLOCKS); > break; > > case SD_LBP_WS10: > - max_blocks = min_not_zero(sdkp->max_ws_blocks, > + max_blocks = min_not_zero(max_ws_blocks, > (u32)SD_MAX_WS10_BLOCKS); > break; > > case SD_LBP_ZERO: > - max_blocks = min_not_zero(sdkp->max_ws_blocks, > + max_blocks = min_not_zero(max_ws_blocks, > (u32)SD_MAX_WS10_BLOCKS); > break; > } > -- > 2.1.0
[PATCH V3 9/9] pm80xx : corrected linkrate value.
Corrected the value defined for LINKRATE_60 (6 Gig). Signed-off-by: Raj Dinesh Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm80xx_hwi.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/pm8001/pm80xx_hwi.h b/drivers/scsi/pm8001/pm80xx_hwi.h index e36c5176f9a9..889e69ce3689 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.h +++ b/drivers/scsi/pm8001/pm80xx_hwi.h @@ -167,7 +167,7 @@ #define LINKMODE_AUTO (0x03 << 12) #define LINKRATE_15(0x01 << 8) #define LINKRATE_30(0x02 << 8) -#define LINKRATE_60(0x06 << 8) +#define LINKRATE_60(0x04 << 8) #define LINKRATE_120 (0x08 << 8) /* phy_profile */ -- 2.12.3
[PATCH V3 8/9] pm80xx : panic on ncq error cleaning up the read log.
when there's an error in 'ncq mode' the host has to read the ncq error log (10h) to clear the error state. however, the ccb that is setup for doing this doesn't setup the ccb so that the previous state is cleared. if the ccb was previously used for an IO n_elems is set and pm8001_ccb_task_free() treats this as the signal to go free a scatter-gather list (that's already been free-ed). Signed-off-by: Deepak Ukey Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm80xx_hwi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c index 92d2045dea68..f2c0839afbe3 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.c +++ b/drivers/scsi/pm8001/pm80xx_hwi.c @@ -1489,6 +1489,7 @@ static void pm80xx_send_read_log(struct pm8001_hba_info *pm8001_ha, ccb->device = pm8001_ha_dev; ccb->ccb_tag = ccb_tag; ccb->task = task; + ccb->n_elem = 0; pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG; pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG; -- 2.12.3
[PATCH V3 6/9] pm80xx : modified port reset timer value for PM8006 card
Added port reset timer value as 2000ms for PM8006 sata controller. Signed-off-by: Deepak Ukey Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm80xx_hwi.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c index baab8a19c78e..8f1f5dc77d71 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.c +++ b/drivers/scsi/pm8001/pm80xx_hwi.c @@ -597,6 +597,12 @@ static void update_main_config_table(struct pm8001_hba_info *pm8001_ha) pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer &= 0x; pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer |= PORT_RECOVERY_TIMEOUT; + if (pm8001_ha->chip_id == chip_8006) { + pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer &= + 0x; + pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer |= + 0x14; + } pm8001_mw32(address, MAIN_PORT_RECOVERY_TIMER, pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer); } -- 2.12.3
[PATCH V3 7/9] pm80xx : corrected SATA abort handling sequence.
Modified SATA abort handling with following steps: 1) Set device state as recovery. 2) Send phy reset. 3) Wait for reset completion. 4) After successful reset, abort all IO's to the device. 5) After aborting all IO's to device, set device state as operational. Signed-off-by: Deepak Ukey Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm8001_hwi.c | 8 - drivers/scsi/pm8001/pm8001_sas.c | 71 ++-- drivers/scsi/pm8001/pm8001_sas.h | 8 + drivers/scsi/pm8001/pm80xx_hwi.c | 36 4 files changed, 113 insertions(+), 10 deletions(-) diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c index bc4a6f649ec9..db88a8e7ee0e 100644 --- a/drivers/scsi/pm8001/pm8001_hwi.c +++ b/drivers/scsi/pm8001/pm8001_hwi.c @@ -3209,10 +3209,16 @@ int pm8001_mpi_local_phy_ctl(struct pm8001_hba_info *pm8001_ha, void *piomb) PM8001_MSG_DBG(pm8001_ha, pm8001_printk("%x phy execute %x phy op failed!\n", phy_id, phy_op)); - } else + } else { PM8001_MSG_DBG(pm8001_ha, pm8001_printk("%x phy execute %x phy op success!\n", phy_id, phy_op)); + pm8001_ha->phy[phy_id].reset_success = true; + } + if (pm8001_ha->phy[phy_id].enable_completion) { + complete(pm8001_ha->phy[phy_id].enable_completion); + pm8001_ha->phy[phy_id].enable_completion = NULL; + } pm8001_tag_free(pm8001_ha, tag); return 0; } diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c index e80b0542a67f..60d5bec5c45e 100644 --- a/drivers/scsi/pm8001/pm8001_sas.c +++ b/drivers/scsi/pm8001/pm8001_sas.c @@ -1167,13 +1167,16 @@ int pm8001_abort_task(struct sas_task *task) struct scsi_lun lun; struct pm8001_device *pm8001_dev; struct pm8001_tmf_task tmf_task; - int rc = TMF_RESP_FUNC_FAILED; + int rc = TMF_RESP_FUNC_FAILED, ret; + u32 phy_id; + struct sas_task_slow slow_task; if (unlikely(!task || !task->lldd_task || !task->dev)) return TMF_RESP_FUNC_FAILED; dev = task->dev; pm8001_dev = dev->lldd_dev; pm8001_ha = pm8001_find_ha_by_dev(dev); device_id = pm8001_dev->device_id; + phy_id = pm8001_dev->attached_phy; rc = pm8001_find_tag(task, &tag); if (rc == 0) { pm8001_printk("no tag for task:%p\n", task); @@ -1184,6 +1187,11 @@ int pm8001_abort_task(struct sas_task *task) spin_unlock_irqrestore(&task->task_state_lock, flags); return TMF_RESP_FUNC_COMPLETE; } + task->task_state_flags |= SAS_TASK_STATE_ABORTED; + if (task->slow_task == NULL) { + init_completion(&slow_task.completion); + task->slow_task = &slow_task; + } spin_unlock_irqrestore(&task->task_state_lock, flags); if (task->task_proto & SAS_PROTOCOL_SSP) { struct scsi_cmnd *cmnd = task->uldd_task; @@ -1195,8 +1203,61 @@ int pm8001_abort_task(struct sas_task *task) pm8001_dev->sas_device, 0, tag); } else if (task->task_proto & SAS_PROTOCOL_SATA || task->task_proto & SAS_PROTOCOL_STP) { - rc = pm8001_exec_internal_task_abort(pm8001_ha, pm8001_dev, - pm8001_dev->sas_device, 0, tag); + if (pm8001_ha->chip_id == chip_8006) { + DECLARE_COMPLETION_ONSTACK(completion_reset); + DECLARE_COMPLETION_ONSTACK(completion); + struct pm8001_phy *phy = pm8001_ha->phy + phy_id; + /* 1. Set Device state as Recovery*/ + pm8001_dev->setds_completion = &completion; + PM8001_CHIP_DISP->set_dev_state_req(pm8001_ha, + pm8001_dev, 0x03); + wait_for_completion(&completion); + /* 2. Send Phy Control Hard Reset */ + reinit_completion(&completion); + phy->reset_success = false; + phy->enable_completion = &completion; + phy->reset_completion = &completion_reset; + ret = PM8001_CHIP_DISP->phy_ctl_req(pm8001_ha, phy_id, + PHY_HARD_RESET); + if (ret) + goto out; + PM8001_MSG_DBG(pm8001_ha, + pm8001_printk("Waiting for local phy ctl\n")); + wait_for_completion(&completion); + if (!phy->reset_success) + goto out; + /* 3. Wait for Port Reset complete / Port reset TMO*/ + PM8001_MSG_DBG(pm8001_ha, +
[PATCH V3 3/9] pm80xx : Different SAS addresses for phys.
Different SAS addresses are assigned for each set of phys. Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm8001_init.c | 13 + drivers/scsi/pm8001/pm80xx_hwi.c | 3 +-- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c index 0e013f76b582..7a697ca68501 100644 --- a/drivers/scsi/pm8001/pm8001_init.c +++ b/drivers/scsi/pm8001/pm8001_init.c @@ -132,7 +132,7 @@ static void pm8001_phy_init(struct pm8001_hba_info *pm8001_ha, int phy_id) sas_phy->oob_mode = OOB_NOT_CONNECTED; sas_phy->linkrate = SAS_LINK_RATE_UNKNOWN; sas_phy->id = phy_id; - sas_phy->sas_addr = &pm8001_ha->sas_addr[0]; + sas_phy->sas_addr = (u8 *)&phy->dev_sas_addr; sas_phy->frame_rcvd = &phy->frame_rcvd[0]; sas_phy->ha = (struct sas_ha_struct *)pm8001_ha->shost->hostdata; sas_phy->lldd_phy = phy; @@ -591,10 +591,12 @@ static void pm8001_post_sas_ha_init(struct Scsi_Host *shost, for (i = 0; i < chip_info->n_phy; i++) { sha->sas_phy[i] = &pm8001_ha->phy[i].sas_phy; sha->sas_port[i] = &pm8001_ha->port[i].sas_port; + sha->sas_phy[i]->sas_addr = + (u8 *)&pm8001_ha->phy[i].dev_sas_addr; } sha->sas_ha_name = DRV_NAME; sha->dev = pm8001_ha->dev; - + sha->strict_wide_ports = 1; sha->lldd_module = THIS_MODULE; sha->sas_addr = &pm8001_ha->sas_addr[0]; sha->num_phys = chip_info->n_phy; @@ -611,6 +613,7 @@ static void pm8001_post_sas_ha_init(struct Scsi_Host *shost, static void pm8001_init_sas_add(struct pm8001_hba_info *pm8001_ha) { u8 i, j; + u8 sas_add[8]; #ifdef PM8001_READ_VPD /* For new SPC controllers WWN is stored in flash vpd * For SPC/SPCve controllers WWN is stored in EEPROM @@ -672,10 +675,12 @@ static void pm8001_init_sas_add(struct pm8001_hba_info *pm8001_ha) pm8001_ha->sas_addr[j] = payload.func_specific[0x804 + i]; } - + memcpy(sas_add, pm8001_ha->sas_addr, SAS_ADDR_SIZE); for (i = 0; i < pm8001_ha->chip->n_phy; i++) { + if (i && ((i % 4) == 0)) + sas_add[7] = sas_add[7] + 4; memcpy(&pm8001_ha->phy[i].dev_sas_addr, - pm8001_ha->sas_addr, SAS_ADDR_SIZE); + sas_add, SAS_ADDR_SIZE); PM8001_INIT_DBG(pm8001_ha, pm8001_printk("phy %d sas_addr = %016llx\n", i, pm8001_ha->phy[i].dev_sas_addr)); diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c index 8fb5ddf08cc4..2b26445d1b97 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.c +++ b/drivers/scsi/pm8001/pm80xx_hwi.c @@ -3041,7 +3041,6 @@ hw_event_phy_down(struct pm8001_hba_info *pm8001_ha, void *piomb) port->port_state = portstate; phy->identify.device_type = 0; phy->phy_attached = 0; - memset(&phy->dev_sas_addr, 0, SAS_ADDR_SIZE); switch (portstate) { case PORT_VALID: break; @@ -4394,7 +4393,7 @@ pm80xx_chip_phy_start_req(struct pm8001_hba_info *pm8001_ha, u8 phy_id) payload.sas_identify.dev_type = SAS_END_DEVICE; payload.sas_identify.initiator_bits = SAS_PROTOCOL_ALL; memcpy(payload.sas_identify.sas_addr, - pm8001_ha->sas_addr, SAS_ADDR_SIZE); + &pm8001_ha->phy[phy_id].dev_sas_addr, SAS_ADDR_SIZE); payload.sas_identify.phy_id = phy_id; ret = pm8001_mpi_build_cmd(pm8001_ha, circularQ, opcode, &payload, 0); return ret; -- 2.12.3
[PATCH V3 0/9] pm80xx updates
This patch set include some bug fixes and enhancement for pm80xx driver. Changes from V2: - Corrected date. Changes from V1: - sas_identify_frame_local structure moved to pm80xx_hwi.h - sata abort handling patch split to four patches. - tag allocation for phy control request. - cleanup in pm8001_abort_task function. - modified port reset timer value for PM8006 card - corrected SATA abort handling sequence. Viswas G (9): pm80xx : redefine sas_identify_frame structure pm80xx : ILA and inactive firmware version through sysfs pm80xx : Different SAS addresses for phys. pm80xx : tag allocation for phy control request. pm80xx : cleanup in pm8001_abort_task function. pm80xx : modified port reset timer value for PM8006 card pm80xx : corrected SATA abort handling sequence. pm80xx : panic on ncq error cleaning up the read log. pm80xx : corrected linkrate value. drivers/scsi/pm8001/pm8001_ctl.c | 54 + drivers/scsi/pm8001/pm8001_hwi.c | 11 +++- drivers/scsi/pm8001/pm8001_init.c | 13 +++-- drivers/scsi/pm8001/pm8001_sas.c | 118 ++ drivers/scsi/pm8001/pm8001_sas.h | 10 drivers/scsi/pm8001/pm80xx_hwi.c | 61 drivers/scsi/pm8001/pm80xx_hwi.h | 102 +++- 7 files changed, 313 insertions(+), 56 deletions(-) -- 2.12.3
[PATCH V3 4/9] pm80xx : tag allocation for phy control request.
tag is taken from the tag pool instead of using the hardcoded tag value(1). Signed-off-by: Deepak Ukey Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm8001_hwi.c | 3 +++ drivers/scsi/pm8001/pm80xx_hwi.c | 10 +++--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c index 10546faac58c..bc4a6f649ec9 100644 --- a/drivers/scsi/pm8001/pm8001_hwi.c +++ b/drivers/scsi/pm8001/pm8001_hwi.c @@ -3198,11 +3198,13 @@ pm8001_mpi_get_nvmd_resp(struct pm8001_hba_info *pm8001_ha, void *piomb) int pm8001_mpi_local_phy_ctl(struct pm8001_hba_info *pm8001_ha, void *piomb) { + u32 tag; struct local_phy_ctl_resp *pPayload = (struct local_phy_ctl_resp *)(piomb + 4); u32 status = le32_to_cpu(pPayload->status); u32 phy_id = le32_to_cpu(pPayload->phyop_phyid) & ID_BITS; u32 phy_op = le32_to_cpu(pPayload->phyop_phyid) & OP_BITS; + tag = le32_to_cpu(pPayload->tag); if (status != 0) { PM8001_MSG_DBG(pm8001_ha, pm8001_printk("%x phy execute %x phy op failed!\n", @@ -3211,6 +3213,7 @@ int pm8001_mpi_local_phy_ctl(struct pm8001_hba_info *pm8001_ha, void *piomb) PM8001_MSG_DBG(pm8001_ha, pm8001_printk("%x phy execute %x phy op success!\n", phy_id, phy_op)); + pm8001_tag_free(pm8001_ha, tag); return 0; } diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c index 2b26445d1b97..baab8a19c78e 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.c +++ b/drivers/scsi/pm8001/pm80xx_hwi.c @@ -4500,17 +4500,21 @@ static int pm80xx_chip_reg_dev_req(struct pm8001_hba_info *pm8001_ha, static int pm80xx_chip_phy_ctl_req(struct pm8001_hba_info *pm8001_ha, u32 phyId, u32 phy_op) { + u32 tag; + int rc; struct local_phy_ctl_req payload; struct inbound_queue_table *circularQ; int ret; u32 opc = OPC_INB_LOCAL_PHY_CONTROL; memset(&payload, 0, sizeof(payload)); + rc = pm8001_tag_alloc(pm8001_ha, &tag); + if (rc) + return rc; circularQ = &pm8001_ha->inbnd_q_tbl[0]; - payload.tag = cpu_to_le32(1); + payload.tag = cpu_to_le32(tag); payload.phyop_phyid = cpu_to_le32(((phy_op & 0xFF) << 8) | (phyId & 0xFF)); - ret = pm8001_mpi_build_cmd(pm8001_ha, circularQ, opc, &payload, 0); - return ret; + return pm8001_mpi_build_cmd(pm8001_ha, circularQ, opc, &payload, 0); } static u32 pm80xx_chip_is_our_interupt(struct pm8001_hba_info *pm8001_ha) -- 2.12.3
[PATCH V3 1/9] pm80xx : redefine sas_identify_frame structure
sas_identify structure defined by pm80xx doesn't have CRC field. So added a new sas_identify structure without CRC. v2: - Since the structure changes is applicable for only pm80xx, sas_identify_frame_local structure moved to pm80xx_hwi.h. Signed-off-by: Raj Dinesh Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm80xx_hwi.h | 98 +++- 1 file changed, 97 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/pm8001/pm80xx_hwi.h b/drivers/scsi/pm8001/pm80xx_hwi.h index 7a443bad6163..82b8cf581da9 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.h +++ b/drivers/scsi/pm8001/pm80xx_hwi.h @@ -229,6 +229,102 @@ #define IT_NEXUS_TIMEOUT 0x7D0 #define PORT_RECOVERY_TIMEOUT ((IT_NEXUS_TIMEOUT/100) + 30) +#ifdef __LITTLE_ENDIAN_BITFIELD +struct sas_identify_frame_local { + /* Byte 0 */ + u8 frame_type:4; + u8 dev_type:3; + u8 _un0:1; + + /* Byte 1 */ + u8 _un1; + + /* Byte 2 */ + union { + struct { + u8 _un20:1; + u8 smp_iport:1; + u8 stp_iport:1; + u8 ssp_iport:1; + u8 _un247:4; + }; + u8 initiator_bits; + }; + + /* Byte 3 */ + union { + struct { + u8 _un30:1; + u8 smp_tport:1; + u8 stp_tport:1; + u8 ssp_tport:1; + u8 _un347:4; + }; + u8 target_bits; + }; + + /* Byte 4 - 11 */ + u8 _un4_11[8]; + + /* Byte 12 - 19 */ + u8 sas_addr[SAS_ADDR_SIZE]; + + /* Byte 20 */ + u8 phy_id; + + u8 _un21_27[7]; + +} __packed; + +#elif defined(__BIG_ENDIAN_BITFIELD) +struct sas_identify_frame_local { + /* Byte 0 */ + u8 _un0:1; + u8 dev_type:3; + u8 frame_type:4; + + /* Byte 1 */ + u8 _un1; + + /* Byte 2 */ + union { + struct { + u8 _un247:4; + u8 ssp_iport:1; + u8 stp_iport:1; + u8 smp_iport:1; + u8 _un20:1; + }; + u8 initiator_bits; + }; + + /* Byte 3 */ + union { + struct { + u8 _un347:4; + u8 ssp_tport:1; + u8 stp_tport:1; + u8 smp_tport:1; + u8 _un30:1; + }; + u8 target_bits; + }; + + /* Byte 4 - 11 */ + u8 _un4_11[8]; + + /* Byte 12 - 19 */ + u8 sas_addr[SAS_ADDR_SIZE]; + + /* Byte 20 */ + u8 phy_id; + + u8 _un21_27[7]; +} __packed; +#else +#error "Bitfield order not defined!" +#endif + struct mpi_msg_hdr { __le32 header; /* Bits [11:0] - Message operation code */ /* Bits [15:12] - Message Category */ @@ -248,7 +344,7 @@ struct mpi_msg_hdr { struct phy_start_req { __le32 tag; __le32 ase_sh_lm_slr_phyid; - struct sas_identify_frame sas_identify; /* 28 Bytes */ + struct sas_identify_frame_local sas_identify; /* 28 Bytes */ __le32 spasti; u32 reserved[21]; } __attribute__((packed, aligned(4))); -- 2.12.3
[PATCH V3 5/9] pm80xx : cleanup in pm8001_abort_task function.
Signed-off-by: Deepak Ukey Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm8001_sas.c | 49 +++- 1 file changed, 13 insertions(+), 36 deletions(-) diff --git a/drivers/scsi/pm8001/pm8001_sas.c b/drivers/scsi/pm8001/pm8001_sas.c index ce584c31d36e..e80b0542a67f 100644 --- a/drivers/scsi/pm8001/pm8001_sas.c +++ b/drivers/scsi/pm8001/pm8001_sas.c @@ -1159,40 +1159,35 @@ int pm8001_query_task(struct sas_task *task) int pm8001_abort_task(struct sas_task *task) { unsigned long flags; - u32 tag = 0xdeadbeef; + u32 tag; u32 device_id; struct domain_device *dev ; - struct pm8001_hba_info *pm8001_ha = NULL; + struct pm8001_hba_info *pm8001_ha; struct pm8001_ccb_info *ccb; struct scsi_lun lun; struct pm8001_device *pm8001_dev; struct pm8001_tmf_task tmf_task; int rc = TMF_RESP_FUNC_FAILED; if (unlikely(!task || !task->lldd_task || !task->dev)) - return rc; + return TMF_RESP_FUNC_FAILED; + dev = task->dev; + pm8001_dev = dev->lldd_dev; + pm8001_ha = pm8001_find_ha_by_dev(dev); + device_id = pm8001_dev->device_id; + rc = pm8001_find_tag(task, &tag); + if (rc == 0) { + pm8001_printk("no tag for task:%p\n", task); + return TMF_RESP_FUNC_FAILED; + } spin_lock_irqsave(&task->task_state_lock, flags); if (task->task_state_flags & SAS_TASK_STATE_DONE) { spin_unlock_irqrestore(&task->task_state_lock, flags); - rc = TMF_RESP_FUNC_COMPLETE; - goto out; + return TMF_RESP_FUNC_COMPLETE; } spin_unlock_irqrestore(&task->task_state_lock, flags); if (task->task_proto & SAS_PROTOCOL_SSP) { struct scsi_cmnd *cmnd = task->uldd_task; - dev = task->dev; - ccb = task->lldd_task; - pm8001_dev = dev->lldd_dev; - pm8001_ha = pm8001_find_ha_by_dev(dev); int_to_scsilun(cmnd->device->lun, &lun); - rc = pm8001_find_tag(task, &tag); - if (rc == 0) { - printk(KERN_INFO "No such tag in %s\n", __func__); - rc = TMF_RESP_FUNC_FAILED; - return rc; - } - device_id = pm8001_dev->device_id; - PM8001_EH_DBG(pm8001_ha, - pm8001_printk("abort io to deviceid= %d\n", device_id)); tmf_task.tmf = TMF_ABORT_TASK; tmf_task.tag_of_task_to_be_managed = tag; rc = pm8001_issue_ssp_tmf(dev, lun.scsi_lun, &tmf_task); @@ -1200,28 +1195,10 @@ int pm8001_abort_task(struct sas_task *task) pm8001_dev->sas_device, 0, tag); } else if (task->task_proto & SAS_PROTOCOL_SATA || task->task_proto & SAS_PROTOCOL_STP) { - dev = task->dev; - pm8001_dev = dev->lldd_dev; - pm8001_ha = pm8001_find_ha_by_dev(dev); - rc = pm8001_find_tag(task, &tag); - if (rc == 0) { - printk(KERN_INFO "No such tag in %s\n", __func__); - rc = TMF_RESP_FUNC_FAILED; - return rc; - } rc = pm8001_exec_internal_task_abort(pm8001_ha, pm8001_dev, pm8001_dev->sas_device, 0, tag); } else if (task->task_proto & SAS_PROTOCOL_SMP) { /* SMP */ - dev = task->dev; - pm8001_dev = dev->lldd_dev; - pm8001_ha = pm8001_find_ha_by_dev(dev); - rc = pm8001_find_tag(task, &tag); - if (rc == 0) { - printk(KERN_INFO "No such tag in %s\n", __func__); - rc = TMF_RESP_FUNC_FAILED; - return rc; - } rc = pm8001_exec_internal_task_abort(pm8001_ha, pm8001_dev, pm8001_dev->sas_device, 0, tag); -- 2.12.3
[PATCH V3 2/9] pm80xx : ILA and inactive firmware version through sysfs
Added support to read ILA version and inactive firmware version from MPI configuration table and export through sysfs. Signed-off-by: Deepak Ukey Signed-off-by: Viswas G Acked-by: Jack Wang --- drivers/scsi/pm8001/pm8001_ctl.c | 54 drivers/scsi/pm8001/pm8001_sas.h | 2 ++ drivers/scsi/pm8001/pm80xx_hwi.c | 5 drivers/scsi/pm8001/pm80xx_hwi.h | 2 ++ 4 files changed, 63 insertions(+) diff --git a/drivers/scsi/pm8001/pm8001_ctl.c b/drivers/scsi/pm8001/pm8001_ctl.c index be8269c8d127..596f3ff965f5 100644 --- a/drivers/scsi/pm8001/pm8001_ctl.c +++ b/drivers/scsi/pm8001/pm8001_ctl.c @@ -98,6 +98,58 @@ static ssize_t pm8001_ctl_fw_version_show(struct device *cdev, } } static DEVICE_ATTR(fw_version, S_IRUGO, pm8001_ctl_fw_version_show, NULL); + +/** + * pm8001_ctl_ila_version_show - ila version + * @cdev: pointer to embedded class device + * @buf: the buffer returned + * + * A sysfs 'read-only' shost attribute. + */ +static ssize_t pm8001_ctl_ila_version_show(struct device *cdev, + struct device_attribute *attr, char *buf) +{ + struct Scsi_Host *shost = class_to_shost(cdev); + struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost); + struct pm8001_hba_info *pm8001_ha = sha->lldd_ha; + + if (pm8001_ha->chip_id != chip_8001) { + return snprintf(buf, PAGE_SIZE, "%02x.%02x.%02x.%02x\n", + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version >> 24), + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version >> 16), + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version >> 8), + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version)); + } + return 0; +} +static DEVICE_ATTR(ila_version, 0444, pm8001_ctl_ila_version_show, NULL); + +/** + * pm8001_ctl_inactive_fw_version_show - Inacative firmware version number + * @cdev: pointer to embedded class device + * @buf: the buffer returned + * + * A sysfs 'read-only' shost attribute. + */ +static ssize_t pm8001_ctl_inactive_fw_version_show(struct device *cdev, + struct device_attribute *attr, char *buf) +{ + struct Scsi_Host *shost = class_to_shost(cdev); + struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost); + struct pm8001_hba_info *pm8001_ha = sha->lldd_ha; + + if (pm8001_ha->chip_id != chip_8001) { + return snprintf(buf, PAGE_SIZE, "%02x.%02x.%02x.%02x\n", + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version >> 24), + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version >> 16), + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version >> 8), + (u8)(pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version)); + } + return 0; +} +static +DEVICE_ATTR(inc_fw_ver, 0444, pm8001_ctl_inactive_fw_version_show, NULL); + /** * pm8001_ctl_max_out_io_show - max outstanding io supported * @cdev: pointer to embedded class device @@ -748,6 +800,8 @@ struct device_attribute *pm8001_host_attrs[] = { &dev_attr_bios_version, &dev_attr_ib_log, &dev_attr_ob_log, + &dev_attr_ila_version, + &dev_attr_inc_fw_ver, NULL, }; diff --git a/drivers/scsi/pm8001/pm8001_sas.h b/drivers/scsi/pm8001/pm8001_sas.h index e81a8fa7ef1a..c75de413e062 100644 --- a/drivers/scsi/pm8001/pm8001_sas.h +++ b/drivers/scsi/pm8001/pm8001_sas.h @@ -404,6 +404,8 @@ union main_cfg_table { u32 port_recovery_timer; u32 interrupt_reassertion_delay; u32 fatal_n_non_fatal_dump; /* 0x28 */ + u32 ila_version; + u32 inc_fw_version; } pm80xx_tbl; }; diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c index eb4fee61df72..8fb5ddf08cc4 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.c +++ b/drivers/scsi/pm8001/pm80xx_hwi.c @@ -312,6 +312,11 @@ static void read_main_config_table(struct pm8001_hba_info *pm8001_ha) /* read port recover and reset timeout */ pm8001_ha->main_cfg_tbl.pm80xx_tbl.port_recovery_timer = pm8001_mr32(address, MAIN_PORT_RECOVERY_TIMER); + /* read ILA and inactive firmware version */ + pm8001_ha->main_cfg_tbl.pm80xx_tbl.ila_version = + pm8001_mr32(address, MAIN_MPI_ILA_RELEASE_TYPE); + pm8001_ha->main_cfg_tbl.pm80xx_tbl.inc_fw_version = + pm8001_mr32(address, MAIN_MPI_INACTIVE_FW_VERSION); } /** diff --git a/drivers/scsi/pm8001/pm80xx_hwi.h b/drivers/scsi/pm8001/pm80xx_hwi.h index 82b8cf581da9..e36c5176f9a9 100644 --- a/drivers/scsi/pm8001/pm80xx_hwi.h +++ b/drivers/scsi/pm8001/pm80xx_hwi.h @@ -1445,6 +1445,8 @@ typedef struct SASProtocolTimerConfig SASProtocolTimerConfig_t; #define MAIN_SAS_PHY_ATTR_TABLE_OFFSET 0x90 /* DWORD 0x24 */ #define MAIN_PORT_RECOVERY_TIMER 0x94 /* DWORD 0x25 */ #define MAIN_INT_REASSERTION_DELAY 0x98
Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset
On Tue, 2017-09-19 at 08:52 -0700, Christoph Hellwig wrote: > On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote: > > > > On 09/19/2017 12:37 PM, Christoph Hellwig wrote: > > > > > > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli > > > wrote: > > > > > > > > src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK); > > > > + > > > > + msleep(5000); > > > > > > src_writel is a writel, and thus a posted MMIO write. You'll > > > need > > > to have to a read first to make it a reliable timing base. > > > > > > > Just for my full understanding - you're saying a readl BEFORE > > src_writel() or AFTER src_writel() ? > > AFTER. Actually, the whole problem sounds like a posted write. Likely the write that causes the reset doesn't get flushed until the read checking if the reset has succeeded, which might explain the 100% initial failure. Why not throw away that first value if it's a failure and then do your polled wait and timeout on the reset success. We should anyway be waiting some time for a reset to be issued, so even on non- posted write systems we could see this problem intermittently. James
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Wed, Sep 20, 2017 at 12:49 AM, Bart Van Assche wrote: > On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote: >> Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART >> to do that already. > > Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to > reexamine the software queues and the hctx dispatch list but not the requeue > list. If a block driver returns BLK_STS_RESOURCE then requests end up on the > requeue list. Hence the following code in scsi_end_request(): That doesn't need SCHED_RESTART, because it is requeue's responsibility to do that, see blk_mq_requeue_work(), which will run hw queue at the end of this func. -- Ming Lei
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Wed, 2017-09-20 at 00:04 +0800, Ming Lei wrote: > Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART > to do that already. Sorry but I disagree. If SCHED_RESTART is set that causes the blk-mq core to reexamine the software queues and the hctx dispatch list but not the requeue list. If a block driver returns BLK_STS_RESOURCE then requests end up on the requeue list. Hence the following code in scsi_end_request(): if (scsi_target(sdev)->single_lun || !list_empty(&sdev->host->starved_list)) kblockd_schedule_work(&sdev->requeue_work); else blk_mq_run_hw_queues(q, true); Bart.
[PATCH] sd: Limit WRITE SAME / WRITE SAME(16) w/UNMAP length for certain devices
From: "Ewan D. Milne" Some devices do not support a WRITE SAME / WRITE SAME(16) with the UNMAP bit set up to the length specified in the MAXIMUM WRITE SAME LENGTH field in the block limits VPD page (or, the field is zero, indicating there is no limit). Limit the length by the MAXIMUM UNMAP LBA COUNT value. Otherwise the command might be rejected. Signed-off-by: Ewan D. Milne --- drivers/scsi/sd.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 6549e5c..fa07ac2 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -693,6 +693,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) struct request_queue *q = sdkp->disk->queue; unsigned int logical_block_size = sdkp->device->sector_size; unsigned int max_blocks = 0; + u32 max_ws_blocks = sdkp->max_ws_blocks; q->limits.discard_alignment = sdkp->unmap_alignment * logical_block_size; @@ -701,6 +702,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) sdkp->unmap_granularity * logical_block_size); sdkp->provisioning_mode = mode; + /* +* Some devices limit WRITE SAME / WRITE SAME(16) w/UNMAP +* by MAXIMUM UNMAP LBA COUNT instead of MAXIMUM WRITE SAME LENGTH. +* Use the smaller of the two values to avoid ILLEGAL REQUEST errors. +*/ + max_ws_blocks = min_not_zero(max_ws_blocks, sdkp->max_unmap_blocks); + switch (mode) { case SD_LBP_FULL: @@ -715,17 +723,17 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) break; case SD_LBP_WS16: - max_blocks = min_not_zero(sdkp->max_ws_blocks, + max_blocks = min_not_zero(max_ws_blocks, (u32)SD_MAX_WS16_BLOCKS); break; case SD_LBP_WS10: - max_blocks = min_not_zero(sdkp->max_ws_blocks, + max_blocks = min_not_zero(max_ws_blocks, (u32)SD_MAX_WS10_BLOCKS); break; case SD_LBP_ZERO: - max_blocks = min_not_zero(sdkp->max_ws_blocks, + max_blocks = min_not_zero(max_ws_blocks, (u32)SD_MAX_WS10_BLOCKS); break; } -- 2.1.0
Re: [PATCH] scsi: SSDs can timeout during FS init because of too many unmaps
On Tue, 2017-09-19 at 09:02 -0400, Bill Kuzeja wrote: > I encountered this issue putting XFS on several brands of SSDs on my > system. During initialization, I would see a bunch of timeouts on > WRITE_SAME_16 commands, which would get aborted, reissued, and complete. > The logs look like this: > > kernel: sd 2:0:1:0: attempting task abort! scmd(88086ca0c8c0) > kernel: sd 1:0:1:0: attempting task abort! scmd(88086c7f2bc0) > kernel: sd 1:0:1:0: [sds] CDB: Write same(16) 93 08 00 00 00 00 24 04 > 07 b8 00 7f ff ff 00 00 > kernel: sd 2:0:1:0: [sdt] CDB: Write same(16) 93 08 00 00 00 00 24 04 > 07 b8 00 7f ff ff 00 00 > > And so on (there are many, many more of these)... > > The interesting thing to note as that these are WS16 commands with the > unmap bit set (this drive's version of UNMAP) with length 0x7f. > This is over 8.3 million blocks to be unmapped at once. Since there are > many concurrent "unmaps", the drive can get overwhelmed and time out. There is another problem as well. There are some enterprise storage arrays that are rejecting large WRITE SAME(16) commands w/UNMAP set with ILLEGAL REQUEST / INVALID FIELD IN CDB. As far as I can tell, T10 SBC says that the MAXIMUM WRITE SAME LENGTH field in the block limits VPD page should describe the limit for these commands, but the arrays appear to reject anything large than MAXIMUM UNMAP LBA COUNT. i.e. they are treating WRITE SAME(16) w/UNMAP the same as an UNMAP CDB. I had come up with something similar, see my comment on your patch below. > > Why does this happen? Initializing the device with a filesystem (in my > experience XFS) creates one huge discard for the SSD. This gets > broken apart into smaller unmap seqments, which get sent down to the > drive. For the SSDs that I've been working with (lbpws is always set), > UNMAPs always gettranslated to WS16 with the unmap bit set. > > The size of these segments is determined when the drive is set up > initially. Early on, in the routine sd_read_block_limits, we read the > Block Limits VPD page (page 0xb0). Among other things, this page gives > us the drive's MAX UNMAP LBA count as well as the MAX WRITE SAME LENGTH. > In my experience, every SSD returns zero for MAX WRITE SAME length but > does have a real value for MAX_UNMAP LBA count. > > The way the code is structured, because we read in zero for > MAX WRITE SAME, we assume there is no limit for write same commands. > This *may* be true, but unmap/discard commands translate into write > same commands with the unmap bit set. Technically, this makes them > no longer write same commands. > > Currently, the max discard size is actually based off of max_ws_blocks. > When configuring max discard size later, we default to > SD_MAX_WS16_BLOCKS (0x7f) because max_ws_blocks is currently > always zero: > > max_blocks = min_not_zero(sdkp->max_ws_blocks, > (u32)SD_MAX_WS16_BLOCKS); > > A reasonable fix for this would be to use the MAX UNMAP LBA count > (stored as max_unmap_blocks) instead of max_ws_blocks in the case where > we're defaulting to WS16 for unmaps. > > After discussing this issue with an SSD vendor's firmware team, they > confirmed that this was a good way to proceed. That is, it made sense to > use the max_unmap_blocks count instead of the default SD_MAX_WS16_BLOCKS > value because 1) max_ws_blocks is always zero, 2) these are really > unmap commands we're issuing, and 3) the SSD can handle a few unmaps > the size of SD_MAX_WS16_BLOCKS but not necessarily a barrage of them. > > The largest max unmap size I've seen from returned from a drive (from > the Block Limits VPD page) is 0x27 or about 30% of SD_MAX_WS16_BLOCKS. > Other sizes are much smaller, typically 0x8 or about 6% of the > previous max value. > > I've also done performance testing for this change. The only impact I've > seen on SSDs is during filesystem initialization time, which would be > expected since that's most likely the only time we'd be doing really large > unmaps. Even so, the impact on FS init is fairly minimal, 10% for some > models of SSDs, others no noticeable difference at all. > > > > Signed-off-by: Bill Kuzeja > --- > drivers/scsi/sd.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 11c1738..f24c4f2 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -715,8 +715,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, > unsigned int mode) > break; > > case SD_LBP_WS16: > - max_blocks = min_not_zero(sdkp->max_ws_blocks, > - (u32)SD_MAX_WS16_BLOCKS); > + /* If no value given, use unmap limit - WS16 default too large > */ > + if (!sdkp->max_ws_blocks) > + max_blocks = min_not_zero(sdkp->max_unmap_blocks, > + (u32)
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19, 2017 at 11:48:23AM -0400, Mike Snitzer wrote: > On Tue, Sep 19 2017 at 1:43am -0400, > Ming Lei wrote: > > > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote: > > > On Sun, 2017-09-17 at 20:40 +0800, Ming Lei wrote: > > > > "if no request has completed before the delay has expired" can't be a > > > > reason to rerun the queue, because the queue can still be busy. > > > > > > That statement of you shows that there are important aspects of the SCSI > > > core and dm-mpath driver that you don't understand. > > > > Then can you tell me why blk-mq's SCHED_RESTART can't cover > > the rerun when there are in-flight requests? What is the case > > in which dm-rq can return BUSY and there aren't any in-flight > > requests meantime? > > > > Also you are the author of adding 'blk_mq_delay_run_hw_queue( > > hctx, 100/*ms*/)' in dm-rq, you never explain in commit > > 6077c2d706097c0(dm rq: Avoid that request processing stalls > > sporadically) what the root cause is for your request stall > > and why this patch fixes your issue. Even you don't explain > > why is the delay 100ms? > > > > So it is a workaound, isn't it? > > > > My concern is that it isn't good to add blk_mq_delay_run_hw_queue(hctx, > > 100/*ms*/) > > in the hot path since it should been covered by SCHED_RESTART > > if there are in-flight requests. > > This thread proves that it is definitely brittle to be relying on fixed > delays like this: > https://patchwork.kernel.org/patch/9703249/ I can't agree more, because no one mentioned the root cause, maybe the request stall has been fixed recently. Keeping the workaound in hotpath is a bit annoying. -- Ming
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19, 2017 at 11:56:03AM -0400, Mike Snitzer wrote: > On Tue, Sep 19 2017 at 11:36am -0400, > Bart Van Assche wrote: > > > On Tue, 2017-09-19 at 13:43 +0800, Ming Lei wrote: > > > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote: > > > > If you are still looking at removing the blk_mq_delay_run_hw_queue() > > > > calls > > > > then I think you are looking in the wrong direction. What kind of > > > > problem > > > > are you trying to solve? Is it perhaps that there can be a delay between > > > > > > Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't > > > need this patch. > > > > The approach of this patch series looks wrong to me and patch 5/5 is an ugly > > hack in my opinion. That's why I asked you to drop the entire patch series > > and > > to test whether inserting a queue run call into the dm-mpath end_io callback > > yields a similar performance improvement to the patches you posted. Please > > do > > not expect me to spend more time on this patch series if you do not come up > > with measurement results for the proposed alternative. > > Bart, asserting that Ming's work is a hack doesn't help your apparent > goal of deligitimizing Ming's work. > > Nor does it take away from the fact that your indecision on appropriate > timeouts (let alone ability to defend and/or explain them) validates > Ming calling them into question (which you are now dodging regularly). > > But please don't take this feedback and shut-down. Instead please work > together more constructively. This doesn't need to be adversarial! I > am at a loss for why there is such animosity from your end Bart. > > Please dial it back. It is just a distraction that fosters needless > in-fighting. > > Believe it or not: Ming is just trying to improve the code because he > has a testbed that is showing fairly abysmal performance with dm-mq > multipath (on lpfc with scsi-mq). > > Ming, if you can: please see if what Bart has proposed (instead: run > queue at end_io) helps. Though I don't yet see why that should be > needed. Run queue at end_io is definitely wrong, because blk-mq has SCHED_RESTART to do that already. The dm-mpath performance issue is nothing to do with that, actually the issue is very similar with my improvement on SCSI-MQ, because now dm_dispatch_clone_request() doesn't know if the underlying queue is busy or not, and dm-rq/mpath is just trying to dispatch more request to underlying queue even though it is busy, then IO merge can't be done easily on dm-rq/mpath. I am thinking another way to improve this issue, since some SCSI device's queue_depth is different with .cmd_per_lun, will post patchset soon. -- Ming
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19 2017 at 11:52am -0400, Bart Van Assche wrote: > On Tue, 2017-09-19 at 11:48 -0400, Mike Snitzer wrote: > > This thread proves that it is definitely brittle to be relying on fixed > > delays like this: > > https://patchwork.kernel.org/patch/9703249/ > > Hello Mike, > > Sorry but I think that's a misinterpretation of my patch. I came up with that > patch before I had found and fixed the root cause of the high system load. > These fixes are upstream (kernel v4.13) which means that that patch is now > obsolete. OK, fair point. Though fixed magic values for delay aren't going to stand the test of time.
Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset
On 09/19/2017 12:52 PM, Christoph Hellwig wrote: > On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote: >> On 09/19/2017 12:37 PM, Christoph Hellwig wrote: >>> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote: src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK); + + msleep(5000); >>> >>> src_writel is a writel, and thus a posted MMIO write. You'll need >>> to have to a read first to make it a reliable timing base. >>> >> >> Just for my full understanding - you're saying a readl BEFORE >> src_writel() or AFTER src_writel() ? > > AFTER. Thanks! > >> I could add a read to some dummy register, but notice it was a sequence >> of readl's on aac_is_ctrl_up_and_running() that caused the failure of >> reset... > > Oh, ouch. I guess in that case we'll need to do the writel and pray, > but that would need a big comment explaining what's going on there. > Heheh you're right! But do you remember that quirk added on nvme? Basically, it was a similar scenario in which some adapters weren't happy in poll a register in nvme_wait_ready() right after we wrote in the Controller Config register when disabling a controller. Seems the same thing here, the controller is not being able to handle a read right after some internal procedure (reset) were initiated. If you have suggestion to improve the commit msg, let me know :) Thanks!
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19 2017 at 11:36am -0400, Bart Van Assche wrote: > On Tue, 2017-09-19 at 13:43 +0800, Ming Lei wrote: > > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote: > > > If you are still looking at removing the blk_mq_delay_run_hw_queue() calls > > > then I think you are looking in the wrong direction. What kind of problem > > > are you trying to solve? Is it perhaps that there can be a delay between > > > > Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't > > need this patch. > > The approach of this patch series looks wrong to me and patch 5/5 is an ugly > hack in my opinion. That's why I asked you to drop the entire patch series and > to test whether inserting a queue run call into the dm-mpath end_io callback > yields a similar performance improvement to the patches you posted. Please do > not expect me to spend more time on this patch series if you do not come up > with measurement results for the proposed alternative. Bart, asserting that Ming's work is a hack doesn't help your apparent goal of deligitimizing Ming's work. Nor does it take away from the fact that your indecision on appropriate timeouts (let alone ability to defend and/or explain them) validates Ming calling them into question (which you are now dodging regularly). But please don't take this feedback and shut-down. Instead please work together more constructively. This doesn't need to be adversarial! I am at a loss for why there is such animosity from your end Bart. Please dial it back. It is just a distraction that fosters needless in-fighting. Believe it or not: Ming is just trying to improve the code because he has a testbed that is showing fairly abysmal performance with dm-mq multipath (on lpfc with scsi-mq). Ming, if you can: please see if what Bart has proposed (instead: run queue at end_io) helps. Though I don't yet see why that should be needed. Mike
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, 2017-09-19 at 11:48 -0400, Mike Snitzer wrote: > This thread proves that it is definitely brittle to be relying on fixed > delays like this: > https://patchwork.kernel.org/patch/9703249/ Hello Mike, Sorry but I think that's a misinterpretation of my patch. I came up with that patch before I had found and fixed the root cause of the high system load. These fixes are upstream (kernel v4.13) which means that that patch is now obsolete. Bart.
Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset
On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote: > On 09/19/2017 12:37 PM, Christoph Hellwig wrote: > > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote: > >>src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK); > >> + > >> + msleep(5000); > > > > src_writel is a writel, and thus a posted MMIO write. You'll need > > to have to a read first to make it a reliable timing base. > > > > Just for my full understanding - you're saying a readl BEFORE > src_writel() or AFTER src_writel() ? AFTER. > I could add a read to some dummy register, but notice it was a sequence > of readl's on aac_is_ctrl_up_and_running() that caused the failure of > reset... Oh, ouch. I guess in that case we'll need to do the writel and pray, but that would need a big comment explaining what's going on there.
Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset
On 09/19/2017 12:37 PM, Christoph Hellwig wrote: > On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote: >> src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK); >> + >> +msleep(5000); > > src_writel is a writel, and thus a posted MMIO write. You'll need > to have to a read first to make it a reliable timing base. > Just for my full understanding - you're saying a readl BEFORE src_writel() or AFTER src_writel() ? I could add a read to some dummy register, but notice it was a sequence of readl's on aac_is_ctrl_up_and_running() that caused the failure of reset... Thanks
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, Sep 19 2017 at 1:43am -0400, Ming Lei wrote: > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote: > > On Sun, 2017-09-17 at 20:40 +0800, Ming Lei wrote: > > > "if no request has completed before the delay has expired" can't be a > > > reason to rerun the queue, because the queue can still be busy. > > > > That statement of you shows that there are important aspects of the SCSI > > core and dm-mpath driver that you don't understand. > > Then can you tell me why blk-mq's SCHED_RESTART can't cover > the rerun when there are in-flight requests? What is the case > in which dm-rq can return BUSY and there aren't any in-flight > requests meantime? > > Also you are the author of adding 'blk_mq_delay_run_hw_queue( > hctx, 100/*ms*/)' in dm-rq, you never explain in commit > 6077c2d706097c0(dm rq: Avoid that request processing stalls > sporadically) what the root cause is for your request stall > and why this patch fixes your issue. Even you don't explain > why is the delay 100ms? > > So it is a workaound, isn't it? > > My concern is that it isn't good to add blk_mq_delay_run_hw_queue(hctx, > 100/*ms*/) > in the hot path since it should been covered by SCHED_RESTART > if there are in-flight requests. This thread proves that it is definitely brittle to be relying on fixed delays like this: https://patchwork.kernel.org/patch/9703249/ Mike
Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset
On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote: > src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK); > + > + msleep(5000); src_writel is a writel, and thus a posted MMIO write. You'll need to have to a read first to make it a reliable timing base.
Re: [PATCH 1/5] block: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE
On Tue, 2017-09-19 at 13:43 +0800, Ming Lei wrote: > On Mon, Sep 18, 2017 at 03:18:16PM +, Bart Van Assche wrote: > > If you are still looking at removing the blk_mq_delay_run_hw_queue() calls > > then I think you are looking in the wrong direction. What kind of problem > > are you trying to solve? Is it perhaps that there can be a delay between > > Actually the improvement on dm-rq IO schedule(the patch 2 ~ 5) doesn't > need this patch. The approach of this patch series looks wrong to me and patch 5/5 is an ugly hack in my opinion. That's why I asked you to drop the entire patch series and to test whether inserting a queue run call into the dm-mpath end_io callback yields a similar performance improvement to the patches you posted. Please do not expect me to spend more time on this patch series if you do not come up with measurement results for the proposed alternative. Bart.
[PATCH] scsi: aacraid: Add a small delay after IOP reset
Commit 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset status") changed the way driver checks if a reset succeeded. Now, after an IOP reset, aacraid immediately start polling a register to verify the reset is complete. This behavior cause regressions on the reset path in PowerPC (at least). Since the delay after the IOP reset was removed by the aforementioned patch, the fact driver just starts to read a register instantly after the reset was issued (by writing in another register) "corrupts" the reset procedure, which ends up failing all the time. The issue highly impacted kdump on PowerPC, since on kdump path we proactively issue a reset in adapter (through the reset_devices kernel parameter). This patch (re-)adds a delay right after IOP reset is issued. Empirically we measured that 3 seconds is enough, but for safety reasons we delay for 5s (and since it was 30s before, 5s is still a small amount). For reference, without this patch we observe the following messages on kdump kernel boot process: [ 76.294] aacraid 0003:01:00.0: IOP reset failed [ 76.294] aacraid 0003:01:00.0: ARC Reset attempt failed [ 86.524] aacraid 0003:01:00.0: adapter kernel panic'd ff. [ 86.524] aacraid 0003:01:00.0: Controller reset type is 3 [ 86.524] aacraid 0003:01:00.0: Issuing IOP reset [146.534] aacraid 0003:01:00.0: IOP reset failed [146.534] aacraid 0003:01:00.0: ARC Reset attempt failed Fixes: 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset status") Cc: sta...@vger.kernel.org # v4.13+ Signed-off-by: Guilherme G. Piccoli --- drivers/scsi/aacraid/src.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/aacraid/src.c b/drivers/scsi/aacraid/src.c index 48c2b2b34b72..0c9361c87ec8 100644 --- a/drivers/scsi/aacraid/src.c +++ b/drivers/scsi/aacraid/src.c @@ -740,6 +740,8 @@ static void aac_send_iop_reset(struct aac_dev *dev) aac_set_intx_mode(dev); src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK); + + msleep(5000); } static void aac_send_hardware_soft_reset(struct aac_dev *dev) -- 2.14.1
Re: [PATCH V6 2/3] dma-mapping: Rework dma_get_cache_alignment() function
> mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * > mdev->limits.mtt_seg_size, > -dma_get_cache_alignment()) / > mdev->limits.mtt_seg_size; > +dma_get_cache_alignment(NULL)) / > mdev->limits.mtt_seg_size; > > mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, > init_hca->mtt_base, Please pass the actually relevant struct device for each call.
Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?
Hello linux-block, I wrote some tests recently to test patches against bsg.c and bsg-lib.c, and while writing those I noticed something strange: When you use the write() and read() call on multiple file-descriptors for a single bsg-device (FC or SCSI), it is possible that you get cross-talk between those different file-descriptors. E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0 and you send commands via write() in both processes, when they both do read() later on - to read the response for the commands they send before -, it is possible that process A gets the response (Sense-Data, FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis. I noticed this because in my tests I 'tag' each command I send with write() via a value I write into the usr_ptr field of sg_io_v4. When I later user read() to receive responses I check for this tag in a hash-table and so can look up the original command. When I used this and spawned multiple processes with the same target bsg-device, I got responses for commands I don't have tags for in my hash-table, so they were written by an other process. This never happend when I only spawn one test-process. This seems awfully contra-productive.. so much that I don't see any point in allowing users to open more than one FD per bsg-device, because that would inevitably lead to very hard to debug 'bugs'. And I wonder if this is really by design or some regression that happend over time. I looked at the code in bsg.c and yes, it seems this is what is coded right now. We have a hash-table which manages all 'struct bsg_device's who are indexed by device-minors, so one hash-table entry per device-node. So eh, long talk short question, is that intended? Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: qla2xxx: MSI-X: Unsupported ISP 2432 SSVID/SSDID (0x103C,0x7041)
> On 08/19/2017 10:41 PM, Meelis Roos wrote: > > Hello, I just tried Linux with the latest kernel (4.13-rc5+git) on a HP > > DL360 G6 with HP branded ISP2432 HBA. The driver mentions unsupported > > model of the card: > > > > [3.868589] scsi host1: qla2xxx > > [3.871696] qla2xxx [:07:00.0]-00fb:1: QLogic HPAE312A - PCI-Express > > Dual Port 4Gb Fibre Channel HBA. > > [3.871782] qla2xxx [:07:00.0]-00fc:1: ISP2432: PCIe (2.5GT/s x4) @ > > :07:00.0 hdma+ host#=1 fw=8.03.00 (9496). > > [3.871973] qla2xxx [:07:00.1]-001d: : Found an ISP2432 irq 49 > > iobase 0xc900062c5000. > > [3.872568] qla2xxx [:07:00.1]-0034:4: MSI-X: Unsupported ISP 2432 > > SSVID/SSDID (0x103C,0x7041). > > > > Is there some information I can provide to include this card in fully > > supported list? > > > Weelll ... that statement indicates that MSI-X is unsupported, not the > card itself. Yes, something seems to be bad about MSI-X. Other hardware is using MSI, QLA is using IOAPIC. cut -c 1-10,185- < /proc/interrupts 0: IO-APIC2-edge timer 1: IO-APIC1-edge i8042 8: IO-APIC8-edge rtc0 9: IO-APIC9-fasteoi acpi 12: IO-APIC 12-edge i8042 17: IO-APIC 17-fasteoi ata_piix 20: IO-APIC 20-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 21: IO-APIC 21-fasteoi ipmi_si 22: IO-APIC 22-fasteoi hpilo, uhci_hcd:usb4, uhci_hcd:usb6 23: IO-APIC 23-fasteoi uhci_hcd:usb3, uhci_hcd:usb5, radeon 24: DMAR-MSI0-edge dmar0 26: PCI-MSI 1572864-edge hpsa0-msix0 27: PCI-MSI 1572865-edge hpsa0-msix1 28: PCI-MSI 1572866-edge hpsa0-msix2 29: PCI-MSI 1572867-edge hpsa0-msix3 30: PCI-MSI 1572868-edge hpsa0-msix4 31: PCI-MSI 1572869-edge hpsa0-msix5 32: PCI-MSI 1572870-edge hpsa0-msix6 33: PCI-MSI 1572871-edge hpsa0-msix7 34: PCI-MSI 1572872-edge hpsa0-msix8 35: PCI-MSI 1572873-edge hpsa0-msix9 36: PCI-MSI 1572874-edge hpsa0-msix10 37: PCI-MSI 1572875-edge hpsa0-msix11 38: PCI-MSI 1572876-edge hpsa0-msix12 39: PCI-MSI 1572877-edge hpsa0-msix13 40: PCI-MSI 1572878-edge hpsa0-msix14 41: PCI-MSI 1572879-edge hpsa0-msix15 45: PCI-MSI 2097152-edge ens1f0 46: IO-APIC0-fasteoi qla2xxx 48: PCI-MSI 2099200-edge ens1f1 49: IO-APIC 10-fasteoi qla2xxx 50: PCI-MSI 1048576-edge enp2s0f0-0 51: PCI-MSI 1048577-edge enp2s0f0-1 52: PCI-MSI 1048578-edge enp2s0f0-2 53: PCI-MSI 1048579-edge enp2s0f0-3 54: PCI-MSI 1048580-edge enp2s0f0-4 55: PCI-MSI 1048581-edge enp2s0f0-5 56: PCI-MSI 1048582-edge enp2s0f0-6 57: PCI-MSI 1048583-edge enp2s0f0-7 59: PCI-MSI 1050624-edge enp2s0f1-0 60: PCI-MSI 1050625-edge enp2s0f1-1 61: PCI-MSI 1050626-edge enp2s0f1-2 62: PCI-MSI 1050627-edge enp2s0f1-3 63: PCI-MSI 1050628-edge enp2s0f1-4 64: PCI-MSI 1050629-edge enp2s0f1-5 65: PCI-MSI 1050630-edge enp2s0f1-6 66: PCI-MSI 1050631-edge enp2s0f1-7 NMI: Non-maskable interrupts LOC: Local timer interrupts SPU: Spurious interrupts PMI: Performance monitoring interrupts IWI: IRQ work interrupts RTR: APIC ICR read retries RES: Rescheduling interrupts CAL: Function call interrupts TLB: TLB shootdowns TRM: Thermal event interrupts THR: Threshold APIC interrupts MCE: Machine check exceptions MCP: Machine check polls ERR: MIS: PIN: Posted-interrupt notification event NPI: Nested posted-interrupt event PIW: Posted-interrupt wakeup event Full dmesg: [0.00] Linux version 4.14.0-rc1-00013-gebb2c2437d80 (mroos@dl360g6) (gcc version 7.2.0 (Debian 7.2.0-1)) #17 SMP Tue Sep 19 15:01:51 EEST 2017 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc1-00013-gebb2c2437d80 root=/dev/sda1 ro [0.00] x86/fpu: x87 FPU will use FXSAVE [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009f3ff] usable [0.00] BIOS-e820: [mem 0x0009f400-0x0009] reserved [0.00] BIOS-e820: [mem 0x000f-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xd761efff] usable [0.00] BIOS-e820: [mem 0xd761f000-0xd762bfff] ACPI data [0.00] BIOS-e820: [mem 0xd762c000-0xd762cfff] usable [0.00] BIOS-e820: [mem 0xd762d000-0xd7ff] reserved [0.00] BIOS-e820: [mem 0xe000-0xe3ff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfee0] reserved [0.00] BIOS-e820: [mem 0xff80-0x] res
[PATCH] scsi: SSDs can timeout during FS init because of too many unmaps
I encountered this issue putting XFS on several brands of SSDs on my system. During initialization, I would see a bunch of timeouts on WRITE_SAME_16 commands, which would get aborted, reissued, and complete. The logs look like this: kernel: sd 2:0:1:0: attempting task abort! scmd(88086ca0c8c0) kernel: sd 1:0:1:0: attempting task abort! scmd(88086c7f2bc0) kernel: sd 1:0:1:0: [sds] CDB: Write same(16) 93 08 00 00 00 00 24 04 07 b8 00 7f ff ff 00 00 kernel: sd 2:0:1:0: [sdt] CDB: Write same(16) 93 08 00 00 00 00 24 04 07 b8 00 7f ff ff 00 00 And so on (there are many, many more of these)... The interesting thing to note as that these are WS16 commands with the unmap bit set (this drive's version of UNMAP) with length 0x7f. This is over 8.3 million blocks to be unmapped at once. Since there are many concurrent "unmaps", the drive can get overwhelmed and time out. Why does this happen? Initializing the device with a filesystem (in my experience XFS) creates one huge discard for the SSD. This gets broken apart into smaller unmap seqments, which get sent down to the drive. For the SSDs that I've been working with (lbpws is always set), UNMAPs always gettranslated to WS16 with the unmap bit set. The size of these segments is determined when the drive is set up initially. Early on, in the routine sd_read_block_limits, we read the Block Limits VPD page (page 0xb0). Among other things, this page gives us the drive's MAX UNMAP LBA count as well as the MAX WRITE SAME LENGTH. In my experience, every SSD returns zero for MAX WRITE SAME length but does have a real value for MAX_UNMAP LBA count. The way the code is structured, because we read in zero for MAX WRITE SAME, we assume there is no limit for write same commands. This *may* be true, but unmap/discard commands translate into write same commands with the unmap bit set. Technically, this makes them no longer write same commands. Currently, the max discard size is actually based off of max_ws_blocks. When configuring max discard size later, we default to SD_MAX_WS16_BLOCKS (0x7f) because max_ws_blocks is currently always zero: max_blocks = min_not_zero(sdkp->max_ws_blocks, (u32)SD_MAX_WS16_BLOCKS); A reasonable fix for this would be to use the MAX UNMAP LBA count (stored as max_unmap_blocks) instead of max_ws_blocks in the case where we're defaulting to WS16 for unmaps. After discussing this issue with an SSD vendor's firmware team, they confirmed that this was a good way to proceed. That is, it made sense to use the max_unmap_blocks count instead of the default SD_MAX_WS16_BLOCKS value because 1) max_ws_blocks is always zero, 2) these are really unmap commands we're issuing, and 3) the SSD can handle a few unmaps the size of SD_MAX_WS16_BLOCKS but not necessarily a barrage of them. The largest max unmap size I've seen from returned from a drive (from the Block Limits VPD page) is 0x27 or about 30% of SD_MAX_WS16_BLOCKS. Other sizes are much smaller, typically 0x8 or about 6% of the previous max value. I've also done performance testing for this change. The only impact I've seen on SSDs is during filesystem initialization time, which would be expected since that's most likely the only time we'd be doing really large unmaps. Even so, the impact on FS init is fairly minimal, 10% for some models of SSDs, others no noticeable difference at all. Signed-off-by: Bill Kuzeja --- drivers/scsi/sd.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 11c1738..f24c4f2 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -715,8 +715,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) break; case SD_LBP_WS16: - max_blocks = min_not_zero(sdkp->max_ws_blocks, - (u32)SD_MAX_WS16_BLOCKS); + /* If no value given, use unmap limit - WS16 default too large */ + if (!sdkp->max_ws_blocks) + max_blocks = min_not_zero(sdkp->max_unmap_blocks, + (u32)SD_MAX_WS16_BLOCKS); + else + max_blocks = min_not_zero(sdkp->max_ws_blocks, + (u32)SD_MAX_WS16_BLOCKS); break; case SD_LBP_WS10: -- 1.8.3.1
Re: [PATCH] fcoe-utils: Fix get_ctlr_num() for large ctlr_* indices
On 09/18/2017 04:35 PM, Andrey Grafin wrote: > Each creation of a FCoE device increases counter which is used as a suffix > in a FCoE device name in sysfs (i.e. /sys/bus/fcoe/devices/ctlr_1). > Once this counter reaches the value of two digits (10 and larger), > get_ctlr_num() stopped working properly and a command like `fcoeadm -i` > which depends on get_ctlr_num() call doesn't show anything. > This patch solves this problem. > > Signed-off-by: Andrey Grafin > --- > lib/sysfs_hba.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lib/sysfs_hba.c b/lib/sysfs_hba.c > index 5cb7fd3fa5d..786215440ba 100644 > --- a/lib/sysfs_hba.c > +++ b/lib/sysfs_hba.c > @@ -606,7 +606,7 @@ static int get_ctlr_num(const char *netdev) > if (!ctlr) > continue; > > - ctlr_num = atoi(&ctlr[strlen(ctlr) - 1]); > + ctlr_num = atoi(&ctlr[sizeof("ctlr_") - 1]); > break; > } > > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
[PATCH V6 3/3] scsi: Align block queue to dma_get_cache_alignment()
In non-coherent DMA mode, kernel uses cache flushing operations to maintain I/O coherency, so scsi's block queue should be aligned to ARCH_DMA_MINALIGN. Otherwise, it will cause data corruption, at least on MIPS: Step 1, dma_map_single Step 2, cache_invalidate (no writeback) Step 3, dma_from_device Step 4, dma_unmap_single If a DMA buffer and a kernel structure share a same cache line, and if the kernel structure has dirty data, cache_invalidate (no writeback) will cause data lost. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen --- drivers/scsi/scsi_lib.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9cf6a80..19abc2e 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2132,11 +2132,11 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) q->limits.cluster = 0; /* -* set a reasonable default alignment on word boundaries: the -* host and device may alter it using +* set a reasonable default alignment on word/cacheline boundaries: +* the host and device may alter it using * blk_queue_update_dma_alignment() later. */ - blk_queue_dma_alignment(q, 0x03); + blk_queue_dma_alignment(q, max(4, dma_get_cache_alignment(dev)) - 1); } EXPORT_SYMBOL_GPL(__scsi_init_queue); -- 2.7.0
[PATCH V6 2/3] dma-mapping: Rework dma_get_cache_alignment() function
Make dma_get_cache_alignment() to accept a 'dev' argument. As a result, it can return different alignments due to different devices' I/O cache coherency. For compatibility, make all existing callers pass a NULL dev argument. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen --- drivers/infiniband/hw/mthca/mthca_main.c | 2 +- drivers/media/v4l2-core/videobuf2-dma-contig.c | 2 +- drivers/net/ethernet/broadcom/b44.c| 2 +- drivers/net/ethernet/ibm/emac/core.h | 2 +- drivers/net/ethernet/mellanox/mlx4/main.c | 2 +- drivers/spi/spi-qup.c | 4 ++-- drivers/tty/serial/mpsc.c | 16 drivers/tty/serial/samsung.c | 14 +++--- include/linux/dma-mapping.h| 14 +- 9 files changed, 31 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index e36a9bc..cac5fac 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -416,7 +416,7 @@ static int mthca_init_icm(struct mthca_dev *mdev, /* CPU writes to non-reserved MTTs, while HCA might DMA to reserved mtts */ mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * mdev->limits.mtt_seg_size, - dma_get_cache_alignment()) / mdev->limits.mtt_seg_size; + dma_get_cache_alignment(NULL)) / mdev->limits.mtt_seg_size; mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, mdev->limits.mtt_seg_size, diff --git a/drivers/media/v4l2-core/videobuf2-dma-contig.c b/drivers/media/v4l2-core/videobuf2-dma-contig.c index 9f389f3..7f54739 100644 --- a/drivers/media/v4l2-core/videobuf2-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf2-dma-contig.c @@ -484,7 +484,7 @@ static void *vb2_dc_get_userptr(struct device *dev, unsigned long vaddr, int ret = 0; struct sg_table *sgt; unsigned long contig_size; - unsigned long dma_align = dma_get_cache_alignment(); + unsigned long dma_align = dma_get_cache_alignment(NULL); /* Only cache aligned DMA transfers are reliable */ if (!IS_ALIGNED(vaddr | size, dma_align)) { diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c index a1125d1..291d6af 100644 --- a/drivers/net/ethernet/broadcom/b44.c +++ b/drivers/net/ethernet/broadcom/b44.c @@ -2587,7 +2587,7 @@ static inline void b44_pci_exit(void) static int __init b44_init(void) { - unsigned int dma_desc_align_size = dma_get_cache_alignment(); + unsigned int dma_desc_align_size = dma_get_cache_alignment(NULL); int err; /* Setup paramaters for syncing RX/TX DMA descriptors */ diff --git a/drivers/net/ethernet/ibm/emac/core.h b/drivers/net/ethernet/ibm/emac/core.h index 369de2c..236bf37 100644 --- a/drivers/net/ethernet/ibm/emac/core.h +++ b/drivers/net/ethernet/ibm/emac/core.h @@ -68,7 +68,7 @@ static inline int emac_rx_size(int mtu) return mal_rx_size(ETH_DATA_LEN + EMAC_MTU_OVERHEAD); } -#define EMAC_DMA_ALIGN(x) ALIGN((x), dma_get_cache_alignment()) +#define EMAC_DMA_ALIGN(x) ALIGN((x), dma_get_cache_alignment(NULL)) #define EMAC_RX_SKB_HEADROOM \ EMAC_DMA_ALIGN(CONFIG_IBM_EMAC_RX_SKB_HEADROOM) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index e61c99e..56b1449 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -1660,7 +1660,7 @@ static int mlx4_init_icm(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap, */ dev->caps.reserved_mtts = ALIGN(dev->caps.reserved_mtts * dev->caps.mtt_entry_sz, - dma_get_cache_alignment()) / dev->caps.mtt_entry_sz; + dma_get_cache_alignment(NULL)) / dev->caps.mtt_entry_sz; err = mlx4_init_icm_table(dev, &priv->mr_table.mtt_table, init_hca->mtt_base, diff --git a/drivers/spi/spi-qup.c b/drivers/spi/spi-qup.c index 974a8ce..0c698c3 100644 --- a/drivers/spi/spi-qup.c +++ b/drivers/spi/spi-qup.c @@ -862,7 +862,7 @@ static bool spi_qup_can_dma(struct spi_master *master, struct spi_device *spi, struct spi_transfer *xfer) { struct spi_qup *qup = spi_master_get_devdata(master); - size_t dma_align = dma_get_cache_alignment(); + size_t dma_align = dma_get_cache_alignment(NULL); int n_words; if (xfer->rx_buf) { @@ -1038,7 +1038,7 @@ static int spi_qup_probe(struct platform_device *pdev) master->transfer_one = spi_qup_transfer_one; master->dev.of_node = pdev->dev.of_node; master->auto_runtime_pm = true; -
[PATCH V6 1/3] dma-mapping: Introduce device_is_coherent() as a helper
We will use device_is_coherent() as a helper function, which will be used in the next patch. There is a MIPS-specific plat_device_is_coherent(), but we need a more generic solution, so add and use a new function pointer in dma_map_ops. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen --- arch/mips/cavium-octeon/dma-octeon.c | 3 ++- arch/mips/include/asm/mach-generic/dma-coherence.h | 6 +++--- arch/mips/loongson64/common/dma-swiotlb.c | 1 + arch/mips/mm/dma-default.c | 3 ++- arch/mips/netlogic/common/nlm-dma.c| 3 ++- include/linux/dma-mapping.h| 10 ++ 6 files changed, 20 insertions(+), 6 deletions(-) diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c index c64bd87..cd1a133 100644 --- a/arch/mips/cavium-octeon/dma-octeon.c +++ b/arch/mips/cavium-octeon/dma-octeon.c @@ -324,7 +324,8 @@ static struct octeon_dma_map_ops _octeon_pci_dma_map_ops = { .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, .sync_sg_for_device = octeon_dma_sync_sg_for_device, .mapping_error = swiotlb_dma_mapping_error, - .dma_supported = swiotlb_dma_supported + .dma_supported = swiotlb_dma_supported, + .device_is_coherent = plat_device_is_coherent }, }; diff --git a/arch/mips/loongson64/common/dma-swiotlb.c b/arch/mips/loongson64/common/dma-swiotlb.c index 34486c1..c758d9b 100644 --- a/arch/mips/loongson64/common/dma-swiotlb.c +++ b/arch/mips/loongson64/common/dma-swiotlb.c @@ -119,6 +119,7 @@ static const struct dma_map_ops loongson_dma_map_ops = { .sync_sg_for_device = loongson_dma_sync_sg_for_device, .mapping_error = swiotlb_dma_mapping_error, .dma_supported = loongson_dma_supported, + .device_is_coherent = plat_device_is_coherent }; void __init plat_swiotlb_setup(void) diff --git a/arch/mips/mm/dma-default.c b/arch/mips/mm/dma-default.c index c01bd20..6e18301 100644 --- a/arch/mips/mm/dma-default.c +++ b/arch/mips/mm/dma-default.c @@ -407,7 +407,8 @@ static const struct dma_map_ops mips_default_dma_map_ops = { .sync_sg_for_cpu = mips_dma_sync_sg_for_cpu, .sync_sg_for_device = mips_dma_sync_sg_for_device, .mapping_error = mips_dma_mapping_error, - .dma_supported = mips_dma_supported + .dma_supported = mips_dma_supported, + .device_is_coherent = plat_device_is_coherent }; const struct dma_map_ops *mips_dma_map_ops = &mips_default_dma_map_ops; diff --git a/arch/mips/netlogic/common/nlm-dma.c b/arch/mips/netlogic/common/nlm-dma.c index 0ec9d9d..aa11b27 100644 --- a/arch/mips/netlogic/common/nlm-dma.c +++ b/arch/mips/netlogic/common/nlm-dma.c @@ -79,7 +79,8 @@ const struct dma_map_ops nlm_swiotlb_dma_ops = { .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, .sync_sg_for_device = swiotlb_sync_sg_for_device, .mapping_error = swiotlb_dma_mapping_error, - .dma_supported = swiotlb_dma_supported + .dma_supported = swiotlb_dma_supported, + .device_is_coherent = plat_device_is_coherent }; void __init plat_swiotlb_setup(void) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 29ce981..08da227 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -131,6 +131,7 @@ struct dma_map_ops { #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK u64 (*get_required_mask)(struct device *dev); #endif + int (*device_is_coherent)(struct device *dev); int is_phys; }; @@ -697,6 +698,15 @@ static inline void *dma_zalloc_coherent(struct device *dev, size_t size, } #ifdef CONFIG_HAS_DMA +static inline int device_is_coherent(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + if (ops && ops->device_is_coherent) + return ops->device_is_coherent(dev); + else + return 1;/* compatible behavior */ +} + static inline int dma_get_cache_alignment(void) { #ifdef ARCH_DMA_MINALIGN -- 2.7.0
[PATCH] scsi: ufs: fix a pclint warning
Signed-off-by: Zang Leigang diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 794a4600e952..deb77535b8c9 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -3586,7 +3586,7 @@ static int ufshcd_uic_pwr_ctrl(struct ufs_hba *hba, struct uic_command *cmd) status = ufshcd_get_upmcrs(hba); if (status != PWR_LOCAL) { dev_err(hba->dev, - "pwr ctrl cmd 0x%0x failed, host upmcrs:0x%x\n", + "pwr ctrl cmd 0x%x failed, host upmcrs:0x%x\n", cmd->command, status); ret = (status != PWR_OK) ? status : -1; } -- 2.14.1
[RESEND PATCH v4 4/6] libsas: Use new workqueue to run sas event and disco event
Now all libsas works are queued to scsi host workqueue, include sas event work post by LLDD and sas discovery work, and a sas hotplug flow may be divided into several works, e.g libsas receive a PORTE_BYTES_DMAED event, currently we process it as following steps: sas_form_port --- run in work in shost workq sas_discover_domain --- run in another work in shost workq ... sas_probe_devices --- run in new work in shost workq We found during hot-add a device, libsas may need run several works in same workqueue to add device in system, the process is not atomic, it may interrupt by other sas event works, like PHYE_LOSS_OF_SIGNAL. This patch is preparation of execute libsas sas event in sync. We need to use different workqueue to run sas event and disco event. Otherwise the work will be blocked for waiting another chained work in the same workqueue. Signed-off-by: Yijing Wang CC: John Garry CC: Johannes Thumshirn CC: Ewan Milne CC: Christoph Hellwig CC: Tomas Henzl CC: Dan Williams Signed-off-by: Jason Yan --- drivers/scsi/libsas/sas_discover.c | 2 +- drivers/scsi/libsas/sas_event.c| 4 ++-- drivers/scsi/libsas/sas_init.c | 18 ++ include/scsi/libsas.h | 3 +++ 4 files changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 60de662..14f714d 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -534,7 +534,7 @@ static void sas_chain_work(struct sas_ha_struct *ha, struct sas_work *sw) * workqueue, or known to be submitted from a context that is * not racing against draining */ - scsi_queue_work(ha->core.shost, &sw->work); + queue_work(ha->disco_q, &sw->work); } static void sas_chain_event(int event, unsigned long *pending, diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c index 6db941d..b124198 100644 --- a/drivers/scsi/libsas/sas_event.c +++ b/drivers/scsi/libsas/sas_event.c @@ -40,7 +40,7 @@ int sas_queue_work(struct sas_ha_struct *ha, struct sas_work *sw) if (list_empty(&sw->drain_node)) list_add_tail(&sw->drain_node, &ha->defer_q); } else - rc = scsi_queue_work(ha->core.shost, &sw->work); + rc = queue_work(ha->event_q, &sw->work); return rc; } @@ -61,7 +61,7 @@ static int sas_queue_event(int event, struct sas_work *work, void __sas_drain_work(struct sas_ha_struct *ha) { - struct workqueue_struct *wq = ha->core.shost->work_q; + struct workqueue_struct *wq = ha->event_q; struct sas_work *sw, *_sw; int ret; diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c index e2d98a8..b49c46f 100644 --- a/drivers/scsi/libsas/sas_init.c +++ b/drivers/scsi/libsas/sas_init.c @@ -109,6 +109,7 @@ void sas_hash_addr(u8 *hashed, const u8 *sas_addr) int sas_register_ha(struct sas_ha_struct *sas_ha) { + char name[64]; int error = 0; mutex_init(&sas_ha->disco_mutex); @@ -142,10 +143,24 @@ int sas_register_ha(struct sas_ha_struct *sas_ha) goto Undo_ports; } + error = -ENOMEM; + snprintf(name, sizeof(name), "%s_event_q", dev_name(sas_ha->dev)); + sas_ha->event_q = create_singlethread_workqueue(name); + if (!sas_ha->event_q) + goto Undo_ports; + + snprintf(name, sizeof(name), "%s_disco_q", dev_name(sas_ha->dev)); + sas_ha->disco_q = create_singlethread_workqueue(name); + if (!sas_ha->disco_q) + goto Undo_event_q; + INIT_LIST_HEAD(&sas_ha->eh_done_q); INIT_LIST_HEAD(&sas_ha->eh_ata_q); return 0; + +Undo_event_q: + destroy_workqueue(sas_ha->event_q); Undo_ports: sas_unregister_ports(sas_ha); Undo_phys: @@ -176,6 +191,9 @@ int sas_unregister_ha(struct sas_ha_struct *sas_ha) __sas_drain_work(sas_ha); mutex_unlock(&sas_ha->drain_mutex); + destroy_workqueue(sas_ha->disco_q); + destroy_workqueue(sas_ha->event_q); + return 0; } diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h index 08c1c32..d1ab157 100644 --- a/include/scsi/libsas.h +++ b/include/scsi/libsas.h @@ -388,6 +388,9 @@ struct sas_ha_struct { struct device *dev; /* should be set */ struct module *lldd_module; /* should be set */ + struct workqueue_struct *event_q; + struct workqueue_struct *disco_q; + u8 *sas_addr; /* must be set */ u8 hashed_sas_addr[HASHED_SAS_ADDR_SIZE]; -- 2.5.0
[RESEND PATCH v4 0/6] Enhance libsas hotplug feature
Thanks Martin K. Petersen for applied some of the tidy-up patches. So I do not have to maintain these patches out of the tree. I will only send the reset of them in the next days if needed. Now the libsas hotplug has some issues, Dan Williams report a similar bug here before https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg39187.html The issues we have found 1. if LLDD burst reports lots of phy-up/phy-down sas events, some events may lost because a same sas events is pending now, finally libsas topo may different the hardware. 2. receive a phy down sas event, libsas call sas_deform_port to remove devices, it would first delete the sas port, then put a destruction discovery event in a new work, and queue it at the tail of workqueue, once the sas port be deleted, its children device will be deleted too, when the destruction work start, it will found the target device has been removed, and report a sysfs warnning. 3. since a hotplug process will be devided into several works, if a phy up sas event insert into phydown works, like destruction work ---> PORTE_BYTES_DMAED (sas_form_port) >PHYE_LOSS_OF_SIGNAL the hot remove flow would broken by PORTE_BYTES_DMAED event, it's not we expected, and issues would occur. v3->v4: -use dynamic alloced work and support shutting down the phy if active event reached the threshold -use flush_workqueue instead of wait-completion to process discover events synchronously -direct call probe and destruct function v2->v3: some code improvements suggested by Johannes and John, split v2 patch 2 into several small pathes. v1->v2: some code improvements suggested by John Garry Jason Yan (6): libsas: Use dynamic alloced work to avoid sas event lost libsas: shut down the PHY if events reached the threshold libsas: make the event threshold configurable libsas: Use new workqueue to run sas event and disco event libsas: libsas: use flush_workqueue to process disco events synchronously libsas: direct call probe and destruct drivers/scsi/hisi_sas/hisi_sas_main.c | 6 +++ drivers/scsi/libsas/sas_ata.c | 1 - drivers/scsi/libsas/sas_discover.c| 36 -- drivers/scsi/libsas/sas_event.c | 79 ++ drivers/scsi/libsas/sas_expander.c| 2 +- drivers/scsi/libsas/sas_init.c| 91 +-- drivers/scsi/libsas/sas_internal.h| 7 +++ drivers/scsi/libsas/sas_phy.c | 73 ++-- drivers/scsi/libsas/sas_port.c| 25 ++ include/scsi/libsas.h | 29 --- include/scsi/scsi_transport_sas.h | 1 + 11 files changed, 258 insertions(+), 92 deletions(-) -- 2.5.0
[RESEND PATCH v4 3/6] libsas: make the event threshold configurable
Add a sysfs attr that LLDD can configure it for every host. We made a example in hisi_sas. Other LLDDs using libsas can implement it if they want. Suggested-by: Hannes Reinecke Signed-off-by: Jason Yan CC: John Garry CC: Johannes Thumshirn CC: Ewan Milne CC: Christoph Hellwig CC: Tomas Henzl CC: Dan Williams --- drivers/scsi/hisi_sas/hisi_sas_main.c | 6 ++ drivers/scsi/libsas/sas_init.c| 27 +++ include/scsi/libsas.h | 1 + 3 files changed, 34 insertions(+) diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index 5e47abb..9eceed1 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -1470,6 +1470,11 @@ EXPORT_SYMBOL_GPL(hisi_sas_rescan_topology); struct scsi_transport_template *hisi_sas_stt; EXPORT_SYMBOL_GPL(hisi_sas_stt); +struct device_attribute *host_attrs[] = { +&dev_attr_phy_event_threshold, +NULL, +}; + static struct scsi_host_template _hisi_sas_sht = { .module = THIS_MODULE, .name = DRV_NAME, @@ -1489,6 +1494,7 @@ static struct scsi_host_template _hisi_sas_sht = { .eh_bus_reset_handler = sas_eh_bus_reset_handler, .target_destroy = sas_target_destroy, .ioctl = sas_ioctl, + .shost_attrs= host_attrs, }; struct scsi_host_template *hisi_sas_sht = &_hisi_sas_sht; EXPORT_SYMBOL_GPL(hisi_sas_sht); diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c index b1e03d5..e2d98a8 100644 --- a/drivers/scsi/libsas/sas_init.c +++ b/drivers/scsi/libsas/sas_init.c @@ -537,6 +537,33 @@ static struct sas_function_template sft = { .smp_handler = sas_smp_handler, }; +static inline ssize_t phy_event_threshold_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct Scsi_Host *shost = class_to_shost(dev); + struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost); + + return scnprintf(buf, PAGE_SIZE, "%u\n", sha->event_thres); +} + +static inline ssize_t phy_event_threshold_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct Scsi_Host *shost = class_to_shost(dev); + struct sas_ha_struct *sha = SHOST_TO_SAS_HA(shost); + + sha->event_thres = simple_strtol(buf, NULL, 10); + + return count; +} + +DEVICE_ATTR(phy_event_threshold, + S_IRUGO|S_IWUSR, + phy_event_threshold_show, + phy_event_threshold_store); +EXPORT_SYMBOL_GPL(dev_attr_phy_event_threshold); + struct scsi_transport_template * sas_domain_attach_transport(struct sas_domain_function_template *dft) { diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h index 2fa0b13..08c1c32 100644 --- a/include/scsi/libsas.h +++ b/include/scsi/libsas.h @@ -679,6 +679,7 @@ extern int sas_bios_param(struct scsi_device *, sector_t capacity, int *hsc); extern struct scsi_transport_template * sas_domain_attach_transport(struct sas_domain_function_template *); +extern struct device_attribute dev_attr_phy_event_threshold; int sas_discover_root_expander(struct domain_device *); -- 2.5.0
[RESEND PATCH v4 6/6] libsas: direct call probe and destruct
In commit 87c8331f ([SCSI] libsas: prevent domain rediscovery competing with ata error handling) introduced disco mutex to prevent rediscovery competing with ata error handling and put the whole revalidation in the mutex. But the rphy add/remove needs to wait for the error handling which also grabs the disco mutex. This may leads to dead lock.So the probe and destruct event were introduce to do the rphy add/remove asynchronously and out of the lock. The asynchronously processed workers makes the whole discovery process not atomic, the other events may interrupt the process. For example, if a loss of signal event inserted before the probe event, the sas_deform_port() is called and the port will be deleted. And sas_port_delete() may run before the destruct event, but the port-x:x is the top parent of end device or expander. This leads to a kernel WARNING such as: [ 82.042979] sysfs group 'power' not found for kobject 'phy-1:0:22' [ 82.042983] [ cut here ] [ 82.042986] WARNING: CPU: 54 PID: 1714 at fs/sysfs/group.c:237 sysfs_remove_group+0x94/0xa0 [ 82.043059] Call trace: [ 82.043082] [] sysfs_remove_group+0x94/0xa0 [ 82.043085] [] dpm_sysfs_remove+0x60/0x70 [ 82.043086] [] device_del+0x138/0x308 [ 82.043089] [] sas_phy_delete+0x38/0x60 [ 82.043091] [] do_sas_phy_delete+0x6c/0x80 [ 82.043093] [] device_for_each_child+0x58/0xa0 [ 82.043095] [] sas_remove_children+0x40/0x50 [ 82.043100] [] sas_destruct_devices+0x64/0xa0 [ 82.043102] [] process_one_work+0x1fc/0x4b0 [ 82.043104] [] worker_thread+0x50/0x490 [ 82.043105] [] kthread+0xfc/0x128 [ 82.043107] [] ret_from_fork+0x10/0x50 Make probe and destruct a direct call in the disco and revalidate function, but put them outside the lock. The whole discovery or revalidate won't be interrupted by other events. And the DISCE_PROBE and DISCE_DESTRUCT event are deleted as a result of the direct call. Introduce a new list to destruct the sas_port and put the port delete after the destruct. This makes sure the right order of destroying the sysfs kobject and fix the warning above. Signed-off-by: Jason Yan CC: John Garry CC: Johannes Thumshirn CC: Ewan Milne CC: Christoph Hellwig CC: Tomas Henzl CC: Dan Williams --- drivers/scsi/libsas/sas_ata.c | 1 - drivers/scsi/libsas/sas_discover.c | 34 -- drivers/scsi/libsas/sas_expander.c | 2 +- drivers/scsi/libsas/sas_internal.h | 1 + drivers/scsi/libsas/sas_port.c | 3 +++ include/scsi/libsas.h | 3 +-- include/scsi/scsi_transport_sas.h | 1 + 7 files changed, 27 insertions(+), 18 deletions(-) diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 87f5e694..dbe8c5e 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -729,7 +729,6 @@ int sas_discover_sata(struct domain_device *dev) if (res) return res; - sas_discover_event(dev->port, DISCE_PROBE); return 0; } diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c index 14f714d..d5f5b58 100644 --- a/drivers/scsi/libsas/sas_discover.c +++ b/drivers/scsi/libsas/sas_discover.c @@ -212,13 +212,9 @@ void sas_notify_lldd_dev_gone(struct domain_device *dev) } } -static void sas_probe_devices(struct work_struct *work) +static void sas_probe_devices(struct asd_sas_port *port) { struct domain_device *dev, *n; - struct sas_discovery_event *ev = to_sas_discovery_event(work); - struct asd_sas_port *port = ev->port; - - clear_bit(DISCE_PROBE, &port->disc.pending); /* devices must be domain members before link recovery and probe */ list_for_each_entry(dev, &port->disco_list, disco_list_node) { @@ -294,7 +290,6 @@ int sas_discover_end_dev(struct domain_device *dev) res = sas_notify_lldd_dev_found(dev); if (res) return res; - sas_discover_event(dev->port, DISCE_PROBE); return 0; } @@ -353,13 +348,9 @@ static void sas_unregister_common_dev(struct asd_sas_port *port, struct domain_d sas_put_device(dev); } -static void sas_destruct_devices(struct work_struct *work) +void sas_destruct_devices(struct asd_sas_port *port) { struct domain_device *dev, *n; - struct sas_discovery_event *ev = to_sas_discovery_event(work); - struct asd_sas_port *port = ev->port; - - clear_bit(DISCE_DESTRUCT, &port->disc.pending); list_for_each_entry_safe(dev, n, &port->destroy_list, disco_list_node) { list_del_init(&dev->disco_list_node); @@ -370,6 +361,16 @@ static void sas_destruct_devices(struct work_struct *work) } } +void sas_destruct_ports(struct asd_sas_port *port) +{ + struct sas_port *sas_port, *p; + + list_for_each_entry_safe(sas_port, p, &port->sas_port_del_list, del_list) { + list_del_init(&sas_port->del_list); + sas_port_delete(sas_port);
[RESEND PATCH v4 1/6] libsas: Use dynamic alloced work to avoid sas event lost
Now libsas hotplug work is static, every sas event type has its own static work, LLDD driver queues the hotplug work into shost->work_q. If LLDD driver burst posts lots hotplug events to libsas, the hotplug events may pending in the workqueue like shost->work_q new work[PORTE_BYTES_DMAED] --> |[PHYE_LOSS_OF_SIGNAL][PORTE_BYTES_DMAED] -> processing |<---wait worker to process>| In this case, a new PORTE_BYTES_DMAED event coming, libsas try to queue it to shost->work_q, but this work is already pending, so it would be lost. Finally, libsas delete the related sas port and sas devices, but LLDD driver expect libsas add the sas port and devices(last sas event). This patch use dynamic allocated work to avoid this issue. Signed-off-by: Yijing Wang CC: John Garry CC: Johannes Thumshirn CC: Ewan Milne CC: Christoph Hellwig CC: Tomas Henzl CC: Dan Williams Signed-off-by: Jason Yan --- drivers/scsi/libsas/sas_event.c| 75 +- drivers/scsi/libsas/sas_init.c | 27 -- drivers/scsi/libsas/sas_internal.h | 6 +++ drivers/scsi/libsas/sas_phy.c | 44 +- drivers/scsi/libsas/sas_port.c | 18 - include/scsi/libsas.h | 16 +--- 6 files changed, 115 insertions(+), 71 deletions(-) diff --git a/drivers/scsi/libsas/sas_event.c b/drivers/scsi/libsas/sas_event.c index 0bb9eef..6db941d 100644 --- a/drivers/scsi/libsas/sas_event.c +++ b/drivers/scsi/libsas/sas_event.c @@ -29,7 +29,8 @@ int sas_queue_work(struct sas_ha_struct *ha, struct sas_work *sw) { - int rc = 0; + /* it's added to the defer_q when draining so return succeed */ + int rc = 1; if (!test_bit(SAS_HA_REGISTERED, &ha->state)) return 0; @@ -44,19 +45,15 @@ int sas_queue_work(struct sas_ha_struct *ha, struct sas_work *sw) return rc; } -static int sas_queue_event(int event, unsigned long *pending, - struct sas_work *work, +static int sas_queue_event(int event, struct sas_work *work, struct sas_ha_struct *ha) { - int rc = 0; + unsigned long flags; + int rc; - if (!test_and_set_bit(event, pending)) { - unsigned long flags; - - spin_lock_irqsave(&ha->lock, flags); - rc = sas_queue_work(ha, work); - spin_unlock_irqrestore(&ha->lock, flags); - } + spin_lock_irqsave(&ha->lock, flags); + rc = sas_queue_work(ha, work); + spin_unlock_irqrestore(&ha->lock, flags); return rc; } @@ -66,6 +63,7 @@ void __sas_drain_work(struct sas_ha_struct *ha) { struct workqueue_struct *wq = ha->core.shost->work_q; struct sas_work *sw, *_sw; + int ret; set_bit(SAS_HA_DRAINING, &ha->state); /* flush submitters */ @@ -78,7 +76,11 @@ void __sas_drain_work(struct sas_ha_struct *ha) clear_bit(SAS_HA_DRAINING, &ha->state); list_for_each_entry_safe(sw, _sw, &ha->defer_q, drain_node) { list_del_init(&sw->drain_node); - sas_queue_work(ha, sw); + ret = sas_queue_work(ha, sw); + if (ret != 1) { + struct asd_sas_event *ev = to_asd_sas_event(&sw->work); + sas_free_event(ev); + } } spin_unlock_irq(&ha->lock); } @@ -119,29 +121,68 @@ void sas_enable_revalidation(struct sas_ha_struct *ha) if (!test_and_clear_bit(ev, &d->pending)) continue; - sas_queue_event(ev, &d->pending, &d->disc_work[ev].work, ha); + sas_queue_event(ev, &d->disc_work[ev].work, ha); } mutex_unlock(&ha->disco_mutex); } + +static void sas_port_event_worker(struct work_struct *work) +{ + struct asd_sas_event *ev = to_asd_sas_event(work); + + sas_port_event_fns[ev->event](work); + sas_free_event(ev); +} + +static void sas_phy_event_worker(struct work_struct *work) +{ + struct asd_sas_event *ev = to_asd_sas_event(work); + + sas_phy_event_fns[ev->event](work); + sas_free_event(ev); +} + static int sas_notify_port_event(struct asd_sas_phy *phy, enum port_event event) { + struct asd_sas_event *ev; struct sas_ha_struct *ha = phy->ha; + int ret; BUG_ON(event >= PORT_NUM_EVENTS); - return sas_queue_event(event, &phy->port_events_pending, - &phy->port_events[event].work, ha); + ev = sas_alloc_event(phy); + if (!ev) + return -ENOMEM; + + INIT_SAS_EVENT(ev, sas_port_event_worker, phy, event); + + ret = sas_queue_event(event, &ev->work, ha); + if (ret != 1) + sas_free_event(ev); + + return ret; } int sas_notify_phy_event(struct asd_sas_phy *phy, enum phy_event event) { + struct asd_sas_event *ev; struct sas_ha_s
[RESEND PATCH v4 5/6] libsas: libsas: use flush_workqueue to process disco events synchronously
Use flush_workqueue to insure the disco and revalidate events processed synchronously. Signed-off-by: Jason Yan CC: John Garry CC: Johannes Thumshirn CC: Ewan Milne CC: Christoph Hellwig CC: Tomas Henzl CC: Dan Williams --- drivers/scsi/libsas/sas_port.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c index 9326628..64722f4 100644 --- a/drivers/scsi/libsas/sas_port.c +++ b/drivers/scsi/libsas/sas_port.c @@ -192,6 +192,7 @@ static void sas_form_port(struct asd_sas_phy *phy) si->dft->lldd_port_formed(phy); sas_discover_event(phy->port, DISCE_DISCOVER_DOMAIN); + flush_workqueue(sas_ha->disco_q); } /** @@ -277,6 +278,9 @@ void sas_porte_broadcast_rcvd(struct work_struct *work) SAS_DPRINTK("broadcast received: %d\n", prim); sas_discover_event(phy->port, DISCE_REVALIDATE_DOMAIN); + + if (phy->port) + flush_workqueue(phy->port->ha->disco_q); } void sas_porte_link_reset_err(struct work_struct *work) -- 2.5.0
[RESEND PATCH v4 2/6] libsas: shut down the PHY if events reached the threshold
If the PHY burst too many events, we will alloc a lot of events for the worker. This may leads to memory exhaustion. Dan Williams suggested to shut down the PHY if the events reached the threshold, because in this case the PHY may have gone into some erroneous state. Users can re-enable the PHY by sysfs if they want. We cannot use the fixed memory pool because if we run out of events, the shut down event and loss of signal event will lost too. The events still need to be allocated and processed in this case. Suggested-by: Dan Williams Signed-off-by: Jason Yan CC: John Garry CC: Johannes Thumshirn CC: Ewan Milne CC: Christoph Hellwig CC: Tomas Henzl --- drivers/scsi/libsas/sas_init.c | 21 - drivers/scsi/libsas/sas_phy.c | 31 ++- include/scsi/libsas.h | 6 ++ 3 files changed, 56 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c index 85c278a..b1e03d5 100644 --- a/drivers/scsi/libsas/sas_init.c +++ b/drivers/scsi/libsas/sas_init.c @@ -122,6 +122,8 @@ int sas_register_ha(struct sas_ha_struct *sas_ha) INIT_LIST_HEAD(&sas_ha->defer_q); INIT_LIST_HEAD(&sas_ha->eh_dev_q); + sas_ha->event_thres = SAS_PHY_SHUTDOWN_THRES; + error = sas_register_phys(sas_ha); if (error) { printk(KERN_NOTICE "couldn't register sas phys:%d\n", error); @@ -556,14 +558,31 @@ EXPORT_SYMBOL_GPL(sas_domain_attach_transport); struct asd_sas_event *sas_alloc_event(struct asd_sas_phy *phy) { + struct asd_sas_event *event; gfp_t flags = in_interrupt() ? GFP_ATOMIC : GFP_KERNEL; - return kmem_cache_zalloc(sas_event_cache, flags); + event = kmem_cache_zalloc(sas_event_cache, flags); + if (!event) + return NULL; + + atomic_inc(&phy->event_nr); + if (atomic_read(&phy->event_nr) > phy->ha->event_thres && + !phy->in_shutdown) { + phy->in_shutdown = 1; + sas_printk("The phy%02d bursting events, shut it down.\n", + phy->id); + sas_notify_phy_event(phy, PHYE_SHUTDOWN); + } + + return event; } void sas_free_event(struct asd_sas_event *event) { + struct asd_sas_phy *phy = event->phy; + kmem_cache_free(sas_event_cache, event); + atomic_dec(&phy->event_nr); } /* -- SAS Class register/unregister -- */ diff --git a/drivers/scsi/libsas/sas_phy.c b/drivers/scsi/libsas/sas_phy.c index 59f8292..3df1eec 100644 --- a/drivers/scsi/libsas/sas_phy.c +++ b/drivers/scsi/libsas/sas_phy.c @@ -35,6 +35,7 @@ static void sas_phye_loss_of_signal(struct work_struct *work) struct asd_sas_event *ev = to_asd_sas_event(work); struct asd_sas_phy *phy = ev->phy; + phy->in_shutdown = 0; phy->error = 0; sas_deform_port(phy, 1); } @@ -44,6 +45,7 @@ static void sas_phye_oob_done(struct work_struct *work) struct asd_sas_event *ev = to_asd_sas_event(work); struct asd_sas_phy *phy = ev->phy; + phy->in_shutdown = 0; phy->error = 0; } @@ -105,6 +107,32 @@ static void sas_phye_resume_timeout(struct work_struct *work) } +static void sas_phye_shutdown(struct work_struct *work) +{ + struct asd_sas_event *ev = to_asd_sas_event(work); + struct asd_sas_phy *phy = ev->phy; + struct sas_ha_struct *sas_ha = phy->ha; + struct sas_internal *i = + to_sas_internal(sas_ha->core.shost->transportt); + + if (phy->enabled && i->dft->lldd_control_phy) { + int ret; + + phy->error = 0; + phy->enabled = 0; + ret = i->dft->lldd_control_phy(phy, PHY_FUNC_DISABLE, NULL); + if (ret) + sas_printk("lldd disable phy%02d returned %d\n", + phy->id, ret); + + } else if (!i->dft->lldd_control_phy) + sas_printk("lldd does not support phy%02d control\n", phy->id); + else + sas_printk("phy%02d is not enabled, cannot shutdown\n", + phy->id); + +} + /* -- Phy class registration -- */ int sas_register_phys(struct sas_ha_struct *sas_ha) @@ -116,6 +144,7 @@ int sas_register_phys(struct sas_ha_struct *sas_ha) struct asd_sas_phy *phy = sas_ha->sas_phy[i]; phy->error = 0; + atomic_set(&phy->event_nr, 0); INIT_LIST_HEAD(&phy->port_phy_el); phy->port = NULL; @@ -151,5 +180,5 @@ const work_func_t sas_phy_event_fns[PHY_NUM_EVENTS] = { [PHYE_OOB_ERROR] = sas_phye_oob_error, [PHYE_SPINUP_HOLD] = sas_phye_spinup_hold, [PHYE_RESUME_TIMEOUT] = sas_phye_resume_timeout, - + [PHYE_SHUTDOWN] = sas_phye_shutdown, }; diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h index c80321b..2fa0b13 100644 --- a/include/scsi/libsas.h +++