RE: [PATCH 2/6] be2iscsi: relinquishing control after processing 512 CQE
-Original Message- From: Mike Christie [mailto:micha...@cs.wisc.edu] Sent: Sunday, March 30, 2014 10:25 PM To: Jayamohan Kallickal Cc: jbottom...@parallels.com; linux-scsi@vger.kernel.org; Jayamohan Kallickal; Minh Duc Tran; Sony John-N Subject: Re: [PATCH 2/6] be2iscsi: relinquishing control after processing 512 CQE On 03/27/2014 11:39 AM, Jayamohan Kallickal wrote: > @@ -2323,14 +2319,33 @@ void beiscsi_process_all_cqs(struct > work_struct *work) > > static int be_iopoll(struct blk_iopoll *iop, int budget) { > - unsigned int ret; > + unsigned int ret, num_eq_processed; > struct beiscsi_hba *phba; > struct be_eq_obj *pbe_eq; > + struct be_eq_entry *eqe = NULL; > + struct be_queue_info *eq; > > + num_eq_processed = 0; > pbe_eq = container_of(iop, struct be_eq_obj, iopoll); > + phba = pbe_eq->phba; > + eq = &pbe_eq->q; > + eqe = queue_tail_node(eq); > + > + hwi_ring_eq_db(phba, eq->id, 1, num_eq_processed, 0, 1); >Is this right? num_eq_processed will be 0 above. Should this be moved down >below to after num_eq_processed has been incremented? I am unarming the interrupts so that we are not interrupted while in be_iopoll. So , this is fine > + > + while (eqe->dw[offsetof(struct amap_eq_entry, valid) / 32] > + & EQE_VALID_MASK) { > + > + AMAP_SET_BITS(struct amap_eq_entry, valid, eqe, 0); > + queue_tail_inc(eq); > + eqe = queue_tail_node(eq); > + num_eq_processed++; > + } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 5/6] be2iscsi: Fix TCP parameters while connection offloading.
-Original Message- From: Mike Christie [mailto:micha...@cs.wisc.edu] Sent: Sunday, March 30, 2014 9:54 PM To: Jayamohan Kallickal Cc: jbottom...@parallels.com; linux-scsi@vger.kernel.org; Jayamohan Kallickal; Minh Duc Tran; Sony John-N Subject: Re: [PATCH 5/6] be2iscsi: Fix TCP parameters while connection offloading. On 03/27/2014 11:39 AM, Jayamohan Kallickal wrote: > +int mgmt_open_connection_v1(struct beiscsi_hba *phba, > + struct sockaddr *dst_addr, > + struct beiscsi_endpoint *beiscsi_ep, > + struct be_dma_mem *nonemb_cmd) { > + struct hwi_controller *phwi_ctrlr; > + struct hwi_context_memory *phwi_context; > + struct sockaddr_in *daddr_in = (struct sockaddr_in *)dst_addr; > + struct sockaddr_in6 *daddr_in6 = (struct sockaddr_in6 *)dst_addr; > + struct be_ctrl_info *ctrl = &phba->ctrl; > + struct be_mcc_wrb *wrb; > + struct tcp_connect_and_offload_in_v1 *req; > + unsigned short def_hdr_id; > + unsigned short def_data_id; > + struct phys_addr template_address = { 0, 0 }; > + struct phys_addr *ptemplate_address; > + unsigned int tag = 0; > + unsigned int i, ulp_num; > + unsigned short cid = beiscsi_ep->ep_cid; > + struct be_sge *sge; > + > + phwi_ctrlr = phba->phwi_ctrlr; > + phwi_context = phwi_ctrlr->phwi_ctxt; > + > + ulp_num = phwi_ctrlr->wrb_context[BE_GET_CRI_FROM_CID(cid)].ulp_num; > + > + def_hdr_id = (unsigned short)HWI_GET_DEF_HDRQ_ID(phba, ulp_num); > + def_data_id = (unsigned short)HWI_GET_DEF_BUFQ_ID(phba, ulp_num); > + > + ptemplate_address = &template_address; > + ISCSI_GET_PDU_TEMPLATE_ADDRESS(phba, ptemplate_address); > + spin_lock(&ctrl->mbox_lock); > + tag = alloc_mcc_tag(phba); > + if (!tag) { > + spin_unlock(&ctrl->mbox_lock); > + return tag; > + } > + wrb = wrb_from_mccq(phba); > + sge = nonembedded_sgl(wrb); > + > + req = nonemb_cmd->va; > + memset(req, 0, sizeof(*req)); > + wrb->tag0 |= tag; > + > + be_wrb_hdr_prepare(wrb, sizeof(*req), false, 1); > + be_cmd_hdr_prepare(&req->hdr, CMD_SUBSYSTEM_ISCSI, > +OPCODE_COMMON_ISCSI_TCP_CONNECT_AND_OFFLOAD, > +sizeof(*req)); > + if (dst_addr->sa_family == PF_INET) { > + __be32 s_addr = daddr_in->sin_addr.s_addr; > + req->ip_address.ip_type = BE2_IPV4; > + req->ip_address.addr[0] = s_addr & 0x00ff; > + req->ip_address.addr[1] = (s_addr & 0xff00) >> 8; > + req->ip_address.addr[2] = (s_addr & 0x00ff) >> 16; > + req->ip_address.addr[3] = (s_addr & 0xff00) >> 24; > + req->tcp_port = ntohs(daddr_in->sin_port); > + beiscsi_ep->dst_addr = daddr_in->sin_addr.s_addr; > + beiscsi_ep->dst_tcpport = ntohs(daddr_in->sin_port); > + beiscsi_ep->ip_type = BE2_IPV4; > + } else if (dst_addr->sa_family == PF_INET6) { > + req->ip_address.ip_type = BE2_IPV6; > + memcpy(&req->ip_address.addr, > +&daddr_in6->sin6_addr.in6_u.u6_addr8, 16); > + req->tcp_port = ntohs(daddr_in6->sin6_port); > + beiscsi_ep->dst_tcpport = ntohs(daddr_in6->sin6_port); > + memcpy(&beiscsi_ep->dst6_addr, > +&daddr_in6->sin6_addr.in6_u.u6_addr8, 16); > + beiscsi_ep->ip_type = BE2_IPV6; > + } else{ > + beiscsi_log(phba, KERN_ERR, BEISCSI_LOG_CONFIG, > + "BG_%d : unknown addr family %d\n", > + dst_addr->sa_family); > + spin_unlock(&ctrl->mbox_lock); > + free_mcc_tag(&phba->ctrl, tag); > + return -EINVAL; > + > + } > + req->cid = cid; > + i = phba->nxt_cqid++; > + if (phba->nxt_cqid == phba->num_cpus) > + phba->nxt_cqid = 0; > + req->cq_id = phwi_context->be_cq[i].id; > + beiscsi_log(phba, KERN_INFO, BEISCSI_LOG_CONFIG, > + "BG_%d : i=%d cq_id=%d\n", i, req->cq_id); > + req->defq_id = def_hdr_id; > + req->hdr_ring_id = def_hdr_id; > + req->data_ring_id = def_data_id; > + req->do_offload = 1; > + req->dataout_template_pa.lo = ptemplate_address->lo; > + req->dataout_template_pa.hi = ptemplate_address->hi; > + sge->pa_hi = cpu_to_le32(upper_32_bits(nonemb_cmd->dma)); > + sge->pa_lo = cpu_to_le32(nonemb_cmd->dma & 0x); > + sge->len = cpu_to_le32(nonemb_cmd->size); > + > + req->hdr.version = 1; > + req->tcp_window_size = 0; > + req->tcp_window_scale_count = 2; >Is this approx 100 lines exactly the same as mgmt_open_connection except these >3 lines? Maybe a lib/helper function that does the common 100 lines then >some >v1 function over that which does the extra 3 lines. Agreed. Will redo this Thanks Jay -- -- To unsubscribe from this list: send the l
RE: [PATCH 3/6] be2iscsi: Fix MCCQ posting for Mbx-Cmd after driver initialization is complete
-Original Message- From: Mike Christie [mailto:micha...@cs.wisc.edu] Sent: Thursday, March 27, 2014 5:46 PM To: Jayamohan Kallickal; jbottom...@parallels.com; linux-scsi@vger.kernel.org Cc: Jayamohan Kallickal; Sony John-N Subject: Re: [PATCH 3/6] be2iscsi: Fix MCCQ posting for Mbx-Cmd after driver initialization is complete On 03/27/2014 09:39 AM, Jayamohan Kallickal wrote: > Even before probe for function was completed, iSCSI Daemon had > initiated login to target while OS was coming up. The targets which > had node.startup=automatic, login process was initiated.Since > function specific initialization was still in progress this lead to kernel > panic. > The reson was MCCQ was not yet created and login Mbx_Cmd was sent on MCCQ. > wrb_from_mccq() failed with kernel-panic access. > > In the fix checking in alloc_mcc_tag() if driver initialization is complete > or not. > Based on the result driver continues or return with an Error. > > Signed-off-by: John Soni Jose > Signed-off-by: Jayamohan Kallickal > --- > drivers/scsi/be2iscsi/be_cmds.c | 8 > drivers/scsi/be2iscsi/be_main.c | 2 ++ > drivers/scsi/be2iscsi/be_main.h | 1 + > 3 files changed, 11 insertions(+) > > diff --git a/drivers/scsi/be2iscsi/be_cmds.c > b/drivers/scsi/be2iscsi/be_cmds.c index 1432ed5..4cabf5f 100644 > --- a/drivers/scsi/be2iscsi/be_cmds.c > +++ b/drivers/scsi/be2iscsi/be_cmds.c > @@ -118,6 +118,14 @@ unsigned int alloc_mcc_tag(struct beiscsi_hba > *phba) { > unsigned int tag = 0; > > + > + if (phba->state & BE_ADAPTER_INIT_PHASE) { > + beiscsi_log(phba, KERN_ERR, > + BEISCSI_LOG_INIT | BEISCSI_LOG_CONFIG, > + "BC_%d : Driver Initialization still in > progress\n"); > + return tag; > + } > + >I think normally we do not call scsi_host_add and expose the host until the >driver can handle requests from entry points like sysfs and queuecommand. >Can you just move iscsi_host_add from beiscsi_hba_alloc to beiscsi_dev_probe >after beiscsi_setup_boot_info is called? Tried moving iscsi_add_host and it did not help the issue where the ep_connect was being called before the probe is completed. Chris Leech has a solution for this and when applied along with this patch works for us. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 02/55] scsi: Mark function as static in be2iscsi/be_cmds.c
Looks good Acked-by: Jayamohan Kallickal --- -Original Message- From: Rashika Kheria [mailto:rashika.khe...@gmail.com] Sent: Saturday, March 29, 2014 10:44 AM To: linux-ker...@vger.kernel.org Cc: Jayamohan Kallickal; James E.J. Bottomley; linux-scsi@vger.kernel.org; j...@joshtriplett.org Subject: [PATCH 02/55] scsi: Mark function as static in be2iscsi/be_cmds.c Mark function as static in be2iscsi/be_cmds.c because it is not used outside this file. This eliminates the following warning in be2iscsi/be_cmds.c: drivers/scsi/be2iscsi/be_cmds.c:401:5: warning: no previous prototype for ‘beiscsi_process_mcc’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria Reviewed-by: Josh Triplett --- drivers/scsi/be2iscsi/be_cmds.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/be2iscsi/be_cmds.c b/drivers/scsi/be2iscsi/be_cmds.c index 3338391..026c3e3 100644 --- a/drivers/scsi/be2iscsi/be_cmds.c +++ b/drivers/scsi/be2iscsi/be_cmds.c @@ -398,7 +398,7 @@ static void beiscsi_cq_notify(struct beiscsi_hba *phba, u16 qid, bool arm, } -int beiscsi_process_mcc(struct beiscsi_hba *phba) +static int beiscsi_process_mcc(struct beiscsi_hba *phba) { struct be_mcc_compl *compl; int num = 0, status = 0; -- 1.7.9.5
[Announce] sg3_utils-1.38 available
sg3_utils is a package of command line utilities for sending SCSI and some ATA commands to devices. This package targets the Linux 3, 2.6 and 2.4 kernel series. It also has ports to FreeBSD, Tru64, Solaris, and Windows (cygwin and MinGW). Users of sg_xcopy note that the default recipient of the xcopy(lid1) command has changed from the source to the destination device. An environment variable [XCOPY_TO_SRC] can re-instate the previous default. This version tracks various changes made by www.t10.org since October 2013. For an overview of sg3_utils and downloads see this page: http://sg.danny.cz/sg/sg3_utils.html The sg_ses utility (for enclosure devices) is discussed at: http://sg.danny.cz/sg/sg_ses.html A full changelog can be found at: http://sg.danny.cz/sg/p/sg3_utils.ChangeLog A release announcement will be sent to freecode.com . Changelog for sg3_utils-1.38 [20140401] [svn: r563] - sg_ses: add --dev-slot-num= and --sas-addr= - fix --data=- problem with large buffers - new --data=@FN to read hex data from file FN - error and warning message cleanup - add --maxlen= option - sg_inq: add --block=0|1 option to control opens - add --inhex=FN to read response in ASCII hex from a file; --inhex=FN --raw reads response in binary - make -HHH (- for '-p ai') output suitable for another sg_inq invocation to use --inhex to decode - add LU_CONG to standard inquiry response output - decode ASCII information VPD pages - add HAW_ZBC in block dev char. VPD page (sbc4r01) - sync version descriptors dated 20131126 - allow --page=-1 to force std INQUIRY decoding - fix overflow in encode_whitespaces - improve unit serial number display (VPD page 0x80) - sg_vpd: add LU_CONG to standard inquiry response output - add --inhex=FN to read response in ASCII hex from a file; --inhex=FN --raw reads response in binary - decode Third Party Copy (tpc) page - add HAW_ZBC in block dev char. VPD page (sbc4r01) - add LTO and DDS vendor pages - allow --page=num to restrict --enumerate output - sg_persist: add PROUT: Replace Lost Reservation (spc4r36) - add --transport-id= for SOP: 'sop,' - sg_readcap: for --16 show physical block size if different from logical block size - sg_xcopy: environment variables: XCOPY_TO_SRC and XCOPY_TO_DST indicate where xcopy command is sent - change default to send xcopy to dst (was src) - improve CL handling of short options (e.g. '-vv') - sg_luns: guard against garbage response - sg_decode_sense: with --nospace ignore spaces on command line, so multiple arguments are concatenated - sg_write_same: repeat if unit attention - sg_rtpg: fix indexing bug with --extended option - sg_logs: placeholder for pending defects lpage - sg_unmap: fix another problem with --grpnum= option - sg_lib.h: add PDT_ZBC define (spc4r36p) - sg_lib_data: sync asc/ascq codes with T10 dated 20140320 - add pdt string for ZBC (spc4r36p) - sg_lib: extensions to sg_get_num() and sg_get_llnum() - sg_cmds_extra: fix sa bug in sg_ll_3party_copy_out() - scripts/rescan-scsi-bus.sh: check if FC driver exports issue_lip before using it - man page added (Linux only) - scripts/59-scsi-sg3_utils.rules: linux specific udev rules - examples: add sg_tst_excl3 for testing O_EXCL - improve sg_tst_excl and sg_tst_excl2 - add sg_tst_context for testing file handle contexts - upgrade automake to version 1.13.3 - add suse directory and 'spec' file to facilitate builds Changelog for sg3_utils-1.37 [20131014] [svn: r522] ... Doug Gilbert -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] blk-mq: support for shared tags
On 04/02/2014 12:46 AM, Christoph Hellwig wrote: > On Tue, Apr 01, 2014 at 05:16:21PM -0700, Matias Bjorling wrote: >> Hi Christoph, >> >> Can you rebase it on top of 3.14. I have trouble applying it for testing. > > Hi Martin, > > the series is based on top of Jens' for-next branch. I've also pushed out a > git tree to the blk-mq-share-tags.2 branch of > > git://git.infradead.org/users/hch/scsi.git > > to make testing and reviewing easier. > Thanks. Regarding the tags API. I think the best approach is a struct blk_mq_tags_reg. That'll make their parameters very visible in the drivers. I'll send a patch with the change, using the nvme driver as an example. >> For nvme, there's need for two separate types of queues. The admin queue >> (before initializing blk-mq) and the actual hardware queues. >> >> Should we allow the driver to get/put tags before initializing blk-mq? >> Or let drivers implement their own framework? > > What do you mean with initializing blk-mq? We need to allocate data > structures for sure, and I don't see much else in terms of initialization > in blk-mq. > For the nvme driver, there's a single admin queue, which is outside blk-mq's control, and the X normal queues. Should we allow the shared tags structure to be used (get/put) for the admin queue, without initializing blk-mq? or should the drivers simply implement their own tags for their admin queue? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
New release of SCST/Qlogic target interface driver.
Good evening, I hope the week is going well for everyone. The weather service started mumbling last night about eight more inches of snow getting dumped on West-Central Minnesota so I figured if I was going to get a last weekend of cross-country skiing in I had better finish the week down at the lake before driving on the interstate got bad. So I loaded the pickup and pushed Izzy our Golden Retriever into the cab and headed out. On the way down he convinced me that since most of the weekend would be spent working on the chicken cruncher code we should get a new release of the SCST/Qlogic interface driver out now that the kinks seemed to be worked out of the Selective Retransmission Request (SRR) handling. So any Qlogic HBA customers or storage OEM's who are interested in running the SCST storage stack on the latest generation Qlogic HBA's and CNA's can pick up version 1.1 of the SCST Qlogic Adapter (SQA) interface driver at the following URL: ftp://ftp.enjellic.com/pub/scst/sqa_driver-1.1.tar.gz The interface driver has been subjected to enough abuse and traffic at this point that we elected to remove the 'beta' designation with this release. As with any storage product, ** ALWAYS **, evaluate the product carefully for suitability in the intended environment. This interface driver was developed for and is in production use in a large Cisco/Nexus fibre-channel/FCOE environment. It has been tested with the Qlogic 8362 CNA and 2672 HBA running in CNA personality mode as well as the 24xx and 25xx families of fibre-channel adapters. While not tested I would not anticipate any issues with 2672 HBA's running in 16 GBPS FC mode. This driver provides the interface 'glue' which allows SCST to use the in-kernel qla2xxx core and target driver code. It plays the same role for SCST that the tcm_qla2xxx code does for the in-kernel LIO target. Its now starting to snow out and despite the fact that I'm chucking wood into the stove as fast as I can it is only about 49 degress in the cabin. Izzy has commandeered the bed since its the only warm spot so I'm going to heat up a couple of pieces of pizza and wrap up in a blanket in front of the stove. Best wishes to everyone in Linux storage land for a similarily enjoyable evening/weekend. Greg and Izzy As always, Dr. Greg Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: g...@enjellic.com -- "A computer is like a horse, it will sense weakness." -- Greg Wettstein GAUSSIAN quote. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] target/core: T10-Dif: check HW support capabilities
Regards, Quinn Tran On 4/2/14 11:20 AM, "Nicholas A. Bellinger" wrote: >On Wed, 2014-04-02 at 09:51 +0300, Sagi Grimberg wrote: >> On 4/1/2014 8:45 PM, Nicholas A. Bellinger wrote: >> > On Tue, 2014-04-01 at 20:27 +0300, sagi grimberg wrote: >> >> On 4/1/2014 8:09 PM, Martin K. Petersen wrote: >> "Sagi" == Sagi Grimberg writes: >> >>> Sagi> I originally wrote the code to support that. But it got left >> >>> Sagi> behind since I figured it is not an interesting use-case. If >>your >> >>> Sagi> beckend doesn't support T10-PI why should the target publish >>it >> >>> Sagi> support it and ask the device to strip/insert it? I suppose >>it is >> >>> Sagi> to allow the initiator to protect half-way, but I don't know >>how >> >>> Sagi> interesting it is if the data is not stored with protection... >> >>> >> >>> That depends what you do on the backend. There are several devices >>out >> >>> there that expose PI to the host but use a different protection >>scheme >> >>> internally. And then synthesize PI on the host-facing side. Some >>even do >> >>> T10 PI to an internal protection scheme and then back to T10 PI when >> >>> talking to the disk drives in the back end. >> >>> >> >> Hey Martin, >> >> >> >> I understand, but even for internal different T10-PI schemes, is >> >> stripping protection from incoming data >> >> at the fabric level (and then do whatever with it in the backend >>level) >> >> the right thing to do here? >> >> I mean we basically lose protection across the PCI with this scheme >> >> aren't we? >> >> >> > The WRITE_STRIP + READ_INSERT case would be still be useful for IBLOCK >> > backends that don't support real hw PI, so that at least the >>protection >> > can be in place for data movement between physical machines. >> > >> > Also, I think the amount of changes required to support this type of >> > configuration in target-core are quite small. >> >> So trying to understand how this will come to use. >> Target will publish the fabric T10-PI support based only on the >> transport configuration (not accounting the backing devices >>configuration). > >Yes, passing in the transport configuration for PI at >transport_init_session() time seems to make the most sense here in order >to address all fabric types. QT> Ack. Registering PI cap per IT nexus would fit all fabrics. > >> Then upon each cmd the target will look on {backstore configuration, >> PROTECT bit, transport configuration} - then will decide on protection >> operation (STRIP/INSERT/PASS). >> >> Looks right? >> > >Correct. > >I'm thinking it makes sense for target-core to perform the WRITE_INSERT >+ READ_STRIP (in software) when the transport does not directly support >PI, but the backend has PI enabled. > >--nab QT> I see, this cover the case of a transport within a Portal does support PI. > This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message. <>
Re: [PATCH 2/6] io: define an interface for IO extensions
On Wed, Apr 02, 2014 at 03:53:33PM -0700, Zach Brown wrote: > > > I'd just remove this generic teardown callback path entirely. If > > > there's PI state hanging off the iocb tear it down during iocb teardown. > > > > Hmm, I thought aio_complete /was/ iocb teardown time. > > Well, usually :). If you build up before aio_run_iocb() then you nead > to teardown in kiocb_free(), which is also called by aio_complete(). Oh, yeah. I handle that by tearing down the extensions if stuff fails, though I don't remember if that was in this version of the patchset. > > > (Isn't there some allocate-and-copy-from-userspace helper now? But..) > > > > Is there? I didn't find one when I looked, but it wasn't an > > exhaustive > > search. > > I could have sworn that I saw something.. ah, right, memdup_user(). Noted. :) > > > I don't like the rudundancy of the implicit size requirement by a > > > field's flag being set being duplicated by the explicit size argument. > > > What does that give us, exactly? > > > > Either another sanity check or another way to screw up, depending on how you > > look at it. I'd been considering shortening the size field to u32 and > > adding a > > magic number field, but I wonder if that's really necessary. Seems like it > > shouldn't be -- if userland screws up, it's not hard to kill the process. > > (Or segv it, or...) > > I don't think I'd bother. The bits should be enough and are already > necessary to have explicit indicators of fields being set. > > > Fields in the iocb As each of these are initialized I'd just > > > test the presence bits and __get_user() the userspace arguemnts > > > directly, or copy_from_user() something slightly more complicated on to > > > the stack. > > > > > > That gets rid of us having to care about the size at all. It stops us > > > from allocating a kernel copy and pinning it for the duration of the IO. > > > We'd just be sampling the present userspace arguments as we initialie > > > the iocb during submission. > > > > I like this idea. For the PI extension, nothing particularly error-prone > > happens in teardown, which allows the flexibility to copy_from_user any > > arguments required, and to copy_to_user any setup errors that happen. I can > > get rid a lot of allocate-and-copy nonsense, as you point out. > > > > Ok, I'll migrate my patches towards this strategy, and let's see how much > > code > > goes away. :) > > Cool :). > > > I've also noticed a bug where if you make one of these PI-extended calls on > > a > > file living on a filesystem, it'll extend the io request's range to be > > filesystem block-aligned, which causes all kinds of havoc with the user > > provided PI buffers, since they now need to be extended to fit the added > > blocks. Alternately, one could require PI IOs to be fs-block aligned when > > dealing with regular files. > > I think, like O_DIRECT, it just has to be aligned or fail :(. Heh. O_DIRECT is a hilarious maze of twisty unobvious requirements. Yuck. #define O_IMNAIVEENOUGHTOTHINKIKNOWWHATTHISDOES O_DIRECT --D > > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] aio/dio: enable PI passthrough
On Wed, Apr 02, 2014 at 03:33:11PM -0700, Zach Brown wrote: > > One thing I'm not sure about: What's the largest IO (in terms of # of > > blocks, > > not # of struct iovecs) that I can throw at the kernel? > > Yeah, dunno. I'd guess big :). I'd hope that the PI code already has a > way to clamp the size of bios if there's a limit to the size of PI data > that can be managed downstream? I guess if we restricted the size of the PI buffer to a page's worth of pointers to struct page, that limits us to 128M on x64 with DIF and 512b sectors. That's not really a whole lot; I suppose one could (ab)use vmalloc. Yes, blk-integrity clamps the size of the bio to fit the downstream device's maximum integrity sg size. See max_integrity_segments for details, or the mostly-undocumented sg_prot_tablesize sysfs attribute that reveals it. I don't know what a practical limit is; scsi_debug sets it to 65536. --D > > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] io: define an interface for IO extensions
> > I'd just remove this generic teardown callback path entirely. If > > there's PI state hanging off the iocb tear it down during iocb teardown. > > Hmm, I thought aio_complete /was/ iocb teardown time. Well, usually :). If you build up before aio_run_iocb() then you nead to teardown in kiocb_free(), which is also called by aio_complete(). > > (Isn't there some allocate-and-copy-from-userspace helper now? But..) > > Is there? I didn't find one when I looked, but it wasn't an > exhaustive > search. I could have sworn that I saw something.. ah, right, memdup_user(). > > I don't like the rudundancy of the implicit size requirement by a > > field's flag being set being duplicated by the explicit size argument. > > What does that give us, exactly? > > Either another sanity check or another way to screw up, depending on how you > look at it. I'd been considering shortening the size field to u32 and adding > a > magic number field, but I wonder if that's really necessary. Seems like it > shouldn't be -- if userland screws up, it's not hard to kill the process. > (Or segv it, or...) I don't think I'd bother. The bits should be enough and are already necessary to have explicit indicators of fields being set. > > Fields in the iocb As each of these are initialized I'd just > > test the presence bits and __get_user() the userspace arguemnts > > directly, or copy_from_user() something slightly more complicated on to > > the stack. > > > > That gets rid of us having to care about the size at all. It stops us > > from allocating a kernel copy and pinning it for the duration of the IO. > > We'd just be sampling the present userspace arguments as we initialie > > the iocb during submission. > > I like this idea. For the PI extension, nothing particularly error-prone > happens in teardown, which allows the flexibility to copy_from_user any > arguments required, and to copy_to_user any setup errors that happen. I can > get rid a lot of allocate-and-copy nonsense, as you point out. > > Ok, I'll migrate my patches towards this strategy, and let's see how much code > goes away. :) Cool :). > I've also noticed a bug where if you make one of these PI-extended calls on a > file living on a filesystem, it'll extend the io request's range to be > filesystem block-aligned, which causes all kinds of havoc with the user > provided PI buffers, since they now need to be extended to fit the added > blocks. Alternately, one could require PI IOs to be fs-block aligned when > dealing with regular files. I think, like O_DIRECT, it just has to be aligned or fail :(. - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] aio/dio: enable PI passthrough
> One thing I'm not sure about: What's the largest IO (in terms of # of blocks, > not # of struct iovecs) that I can throw at the kernel? Yeah, dunno. I'd guess big :). I'd hope that the PI code already has a way to clamp the size of bios if there's a limit to the size of PI data that can be managed downstream? - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] io: define an interface for IO extensions
On Wed, Apr 02, 2014 at 12:49:47PM -0700, Zach Brown wrote: > > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long > > res2) > > struct io_event *ev_page, *event; > > unsigned long flags; > > unsigned tail, pos; > > + int ret; > > + > > + ret = io_teardown_extensions(iocb); > > + if (ret) { > > + if (!res) > > + res = ret; > > + else if (!res2) > > + res2 = ret; > > + else > > + pr_err("error %d tearing down aio extensions\n", ret); > > + } > > This ends up trying to copy the kernel's io_extension copy back to > userspace from interrupts, which obviously won't fly. > > And to what end? So that maybe someone can later add an 'extension' > that can fill in some field that's then copied to userspace? But by > copying the entire argument struct back? > > Let's not get ahead of ourselves. If they're going to try and give > userspace some feedback after IO completion they're going to have to try > a lot harder because they don't have acces to the submitting task > context anymore. They'd have to pin some reference to a feedback > mechanism in the in-flight io. I think we'd want that explicit in the > iocb, not hiding off on the other side of this extension interface. I think we'd want to find an extension that really needs this. PI doesn't. We can skate by without supporting the teardown errors case for now. > I'd just remove this generic teardown callback path entirely. If > there's PI state hanging off the iocb tear it down during iocb teardown. Hmm, I thought aio_complete /was/ iocb teardown time. > > +struct io_extension_type { > > + unsigned int type; > > + unsigned int extension_struct_size; > > + int (*setup_fn)(struct kiocb *, int is_write); > > + int (*destroy_fn)(struct kiocb *); > > +}; > > I'd also get rid of all of this. More below. > > > +static int io_setup_extensions(struct kiocb *req, int is_write, > > + struct io_extension __user *ioext) > > +{ > > + struct io_extension_type *iet; > > + __u64 sz, has; > > + int ret; > > + > > + /* Check size of buffer */ > > + if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz > > + return -EFAULT; > > + if (sz > PAGE_SIZE || > > + sz > sizeof(struct io_extension) || > > + sz < IO_EXT_SIZE(ie_has)) > > + return -EINVAL; > > + > > + /* Check that the buffer's big enough */ > > + if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has > > + return -EFAULT; > > + ret = io_check_bufsize(has, sz); > > + if (ret) > > + return ret; > > + > > + /* Copy from userland */ > > + req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO); > > + if (!req->ki_ioext) > > + return -ENOMEM; > > + > > + req->ki_ioext->ke_user = ioext; > > + if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) { > > + ret = -EFAULT; > > + goto out; > > + } > > (Isn't there some allocate-and-copy-from-userspace helper now? But..) Is there? I didn't find one when I looked, but it wasn't an exhaustive search. > I don't like the rudundancy of the implicit size requirement by a > field's flag being set being duplicated by the explicit size argument. > What does that give us, exactly? Either another sanity check or another way to screw up, depending on how you look at it. I'd been considering shortening the size field to u32 and adding a magic number field, but I wonder if that's really necessary. Seems like it shouldn't be -- if userland screws up, it's not hard to kill the process. (Or segv it, or...) > Our notion of the total size only seems to only matter if we're copying > the entire struct from userspace and I'm don't think we need to do that. > > For each argument, we're translating it into some kernel equivalent, > right? Yes. > Fields in the iocb As each of these are initialized I'd just > test the presence bits and __get_user() the userspace arguemnts > directly, or copy_from_user() something slightly more complicated on to > the stack. > > That gets rid of us having to care about the size at all. It stops us > from allocating a kernel copy and pinning it for the duration of the IO. > We'd just be sampling the present userspace arguments as we initialie > the iocb during submission. I like this idea. For the PI extension, nothing particularly error-prone happens in teardown, which allows the flexibility to copy_from_user any arguments required, and to copy_to_user any setup errors that happen. I can get rid a lot of allocate-and-copy nonsense, as you point out. Ok, I'll migrate my patches towards this strategy, and let's see how much code goes away. :) I've also noticed a bug where if you make one of these PI-extended calls on a file living on a filesystem, it'll extend the io request's range to be filesystem block-aligned, which causes all kinds
RE: [PATCH 0/7] Performance improvements for LSI SCSI cards
> CC few more LSI driver developers. I have reviewed the patches. It looks good. Sreekanth - Please ACK once you are done. Reviewed-by: Praveen Krishnamoorthy N�r��yb�X��ǧv�^�){.n�+{���"�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: [PATCH 2/6] io: define an interface for IO extensions
On Wed, Apr 02, 2014 at 03:22:20PM -0400, Jeff Moyer wrote: > "Darrick J. Wong" writes: > > > Define a generic interface to allow userspace to attach metadata to an > > IO operation. This interface will be used initially to implement > > protection information (PI) pass through, though it ought to be usable > > by anyone else desiring to extend the IO interface. It should not be > > difficult to modify the non-AIO calls to use this mechanism. > > My main issue with this patch is determining what exactly gets returned > to userspace when there is an issue in the teardown_extensions path. > It looks like you'll get the first error propagated from > io_teardown_extensions, others are ignored. Then, in aio_complete, if > there was no error with the I/O, then you'll get the teardown error > reported in event->res, otherwise you'll get it in event->res2. So, > what are the valid errors returned by the teardown routine for > extensions? How is the userspace app supposed to determine where the > error came from, the I/O or a failure in the extension teardown? There's also the question of which extension spat out the error. One solution would be to augment struct io_extension with all the error fields that we want (an extension can declare its own if needed) as we do now, and if errors happen during setup, we can just copy_to_user them back. If nothing else fails with the IO setup, the setup routine can return -EINVAL, and userspace can look for updated error fields in the struct. Unfortunately for the teardown error case you'd have to pin the whole page in memory for the duration of the IO just to have it around. For now this isn't a problem because teardown can't fail anyway. > I think it may make sense to only use res2 for reporting io extension > teardown failures. Any new code that will use extensions can certainly > be written to check both res and res2, and this method would prevent the > ambiguity I mentioned. Hmm, doesn't look like anyone actually uses res2 except for USB gadgets. It's tempting just to shove the first ioextension error code that comes along into res2 and abort the whole thing, and let userspace guess where the res2 code came from. I think there's an additional problem with stuffing return codes: in the case of synchronous IO syscalls, we'd have to deal with how to cram error codes from (potentially) multiple sources into the single return value, while not giving userspace any help as to where the code came from. Now that I've written all that out, I don't like this idea so I'll drop it. :) > Finally, I know this is an RFC, but please add some man-page changes to > your patch set, and CC linux-man. Michael Kerrisk typically has > valuable advice on new APIs. I'll do that the next time I rev the patches. Thank you for the suggestion. --D > > Cheers, > Jeff > > > > > Signed-off-by: Darrick J. Wong > > --- > > fs/aio.c | 180 > > +- > > include/linux/aio.h |7 ++ > > include/uapi/linux/aio_abi.h | 15 +++- > > 3 files changed, 197 insertions(+), 5 deletions(-) > > > > > > diff --git a/fs/aio.c b/fs/aio.c > > index 062a5f6..0c40bdc 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt; > > static const struct file_operations aio_ring_fops; > > static const struct address_space_operations aio_ctx_aops; > > > > +static int io_teardown_extensions(struct kiocb *req); > > +static int io_setup_extensions(struct kiocb *req, int is_write, > > + struct io_extension __user *ioext); > > +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req); > > + > > static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) > > { > > struct qstr this = QSTR_INIT("[aio]", 5); > > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long > > res2) > > struct io_event *ev_page, *event; > > unsigned long flags; > > unsigned tail, pos; > > + int ret; > > + > > + ret = io_teardown_extensions(iocb); > > + if (ret) { > > + if (!res) > > + res = ret; > > + else if (!res2) > > + res2 = ret; > > + else > > + pr_err("error %d tearing down aio extensions\n", ret); > > + } > > > > /* > > * Special case handling for sync iocbs: > > @@ -1350,15 +1366,167 @@ rw_common: > > return 0; > > } > > > > +/* IO extension code */ > > +#define REQUIRED_STRUCTURE_SIZE(type, member) \ > > + (offsetof(type, member) + sizeof(((type *)NULL)->member)) > > +#define IO_EXT_SIZE(member) \ > > + REQUIRED_STRUCTURE_SIZE(struct io_extension, member) > > + > > +struct io_extension_type { > > + unsigned int type; > > + unsigned int extension_struct_size; > > + int (*setup_fn)(struct kiocb *, int is_write); > > + int (*destroy_fn)(struct kiocb *); > > +}; > > + > > +static struct io_extensi
[PATCH RESEND] scsi_transport_sas: move bsg destructor into sas_rphy_remove
The recent change in sysfs, bcdde7e221a8750f9b62b6d0bd31b72ea4ad9309 "sysfs: make __sysfs_remove_dir() recursive" revealed an asymmetric rphy device creation/deletion sequence in scsi_transport_sas: modprobe mpt2sas sas_rphy_add device_add A rphy->dev device_add B sas_device transport class device_add C sas_end_device transport class device_add D bsg class rmmod mpt2sas sas_rphy_delete sas_rphy_remove device_del B device_del C device_del A sysfs_remove_group recursive sysfs dir removal sas_rphy_free device_del D warning where device A is the parent of B, C, and D. When sas_rphy_free tries to unregister the bsg request queue (device D above), the ensuing sysfs cleanup discovers that its sysfs group has already been removed and emits a warning, "sysfs group... not found for kobject 'end_device-X:0'". Since bsg creation is a side effect of sas_rphy_add, move its complementary removal call into sas_rphy_remove. This imposes the following tear-down order for the devices above: D, B, C, A. Note the sas_device and sas_end_device transport class devices (B and C above) are created and destroyed both via the list match traversal in attribute_container_device_trigger, so the order in which they are handled is fixed. This is fine as long as they are deleted before their parent device. Signed-off-by: Joe Lawrence Cc: "James E.J. Bottomley" Cc: Dan Williams Cc: "Reddy, Sreekanth" --- drivers/scsi/scsi_transport_sas.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index 1b68142..c341f85 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -1621,8 +1621,6 @@ void sas_rphy_free(struct sas_rphy *rphy) list_del(&rphy->list); mutex_unlock(&sas_host->lock); - sas_bsg_remove(shost, rphy); - transport_destroy_device(dev); put_device(dev); @@ -1681,6 +1679,7 @@ sas_rphy_remove(struct sas_rphy *rphy) } sas_rphy_unlink(rphy); + sas_bsg_remove(NULL, rphy); transport_remove_device(dev); device_del(dev); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] scsi_transport_sas: move bsg destructor into sas_rphy_remove
On Wed, 2 Apr 2014 16:40:41 -0400 Joe Lawrence wrote: > Since bsg creation is a side effect of sas_rphy_add, move its > complementary removal call into sas_rphy_remove. Hello James, This a resend of a patch posted to quiet an end-device sysfs warning in its device deletion path when removing the mpt2sas driver: (initial report) sysfs group not found for kobject on mpt2sas unload http://marc.info/?t=13849746004 (patch + ACK + comments) http://marc.info/?t=13860945551 (gentoo, LSI repro) mpt2sas driver barfs when force removing a drive on 3.13.1 http://marc.info/?t=13912235131 Dan Williams had a few other suggestions for cleanup in this area, but those could be handled in a subsequent patch. This one missed 3.14 inclusion and the warning still occurs in Linus's tree as of today (post scsi/block merge) hence the repost. Regards, -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] aio/dio: enable PI passthrough
On Wed, Apr 02, 2014 at 01:01:33PM -0700, Zach Brown wrote: > > +static int setup_pi_ext(struct kiocb *req, int is_write) > > +{ > > + struct file *file = req->ki_filp; > > + struct io_extension *ext = &req->ki_ioext->ke_kern; > > + void *p; > > + unsigned long start, end; > > + int retval; > > + > > + if (!(file->f_flags & O_DIRECT)) { > > + pr_debug("EINVAL: can't use PI without O_DIRECT.\n"); > > + return -EINVAL; > > + } > > + > > + BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages); > > + > > + end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen + > > + PAGE_SIZE - 1) >> PAGE_SHIFT; > > + start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT; > > + req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf); > > + req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen; > > + req->ki_ioext->ke_pi_iter.pi_nrpages = end - start; > > + p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages * > > + sizeof(struct page *), > > + GFP_NOIO); > > Can userspace give us bad data and get us to generate insane allcation > attempt warnings? Easily. One of the bits I have to work on for the PI part is figuring out how to check with the PI provider that the arguments (the iovec and the pi buffer) actually make any sense, in terms of length and alignment requirements (PI tuples can't cross pages). I think it's as simple as adding a bio_integrity ops call, and then calling down to it from the kiocb level. One thing I'm not sure about: What's the largest IO (in terms of # of blocks, not # of struct iovecs) that I can throw at the kernel? > > + if (p == NULL) { > > + pr_err("%s: no room for page array?\n", __func__); > > + return -ENOMEM; > > + } > > + req->ki_ioext->ke_pi_iter.pi_userpages = p; > > + > > + retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf, > > +req->ki_ioext->ke_pi_iter.pi_nrpages, > > +is_write, > > Isn't this is_write backwards? If it's a write syscall then the PI > pages is going to be read from. Yes, I think so. Good catch! --D > > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code
On Wed, Apr 02, 2014 at 12:17:58PM -0700, Zach Brown wrote: > > +static int bio_integrity_generate_verify(struct bio *bio, int operate) > > { > > > + if (operate) > > + sector = bio->bi_iter.bi_sector; > > + else > > + sector = bio->bi_integrity->bip_iter.bi_sector; > > > + if (operate) { > > + bi->generate_fn(&bix); > > + } else { > > + ret = bi->verify_fn(&bix); > > + if (ret) { > > + kunmap_atomic(kaddr); > > + return ret; > > + } > > + } > > I was glad to see this replaced with explicit sector and func arguments > in later refactoring in the 6/ patch. > > But I don't think the function poiner casts in that 6/ patch are wise > (Or even safe all the time, given crazy function pointer trampolines? > Is that still a thing?). I'd have made a single walk_fn type that > returns and have the non-returning iterators just return 0. Noted. I cleaned all that crap out just yesterday, so now there's only one walk function and some context data that gets passed to the iterator function. Much less horrifying. (I really only included this patch so that I'd have less rebasing work when 3.15-rc1 comes out.) --D > > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Scst-devel] OSS target - VMware SCSI reservation bug conformity.
On Mar 28, 8:53pm, Vladislav Bolkhovitin wrote: } Subject: Re: [Scst-devel] OSS target - VMware SCSI reservation bug conform > Dr. Greg Wettstein, on 03/27/2014 11:21 AM wrote: > > Hi, hope the week is going well for everyone. > > > > There appears to be evidence that VMware has an issue with exact SCSI > > standards compliance when it comes to handling corner cases with SCSI > > reservation requests. It appears as if Dell is pushing firmware hot > > fixes for the EqualLogic controllers to work around the issue. > Hi Greg, Hi Vlad, thanks for taking the time to respond. > That's interesting, but, unfortunately, your message doesn't contain > sufficient technical details to look at this issue, if it exists. Or > do you think we are magicians who can read minds and see through > walls? ;) Actually I did, but I assumed a maintenance contract would be needed for that. I had the following reasons for raising the issue: 1.) Does anyone in the open-source storage eco-system, ie. SCST/LIO whatever, have any confirmation that this is a known issue. 2.) To alert other open-source storage users/vendors that, at least from our experience, it appears as if the problem may be rare but legitimate. 3.) To determine, if the bug could be found, whether things like mode pages would make sense in an open-source stack to address issues such as this. I've had a fair amount of private feedback that there is a good chance the issue may be legitimate. I've also had feedback that there may be other issues with VMware 'corner-case' behavior. Given the nature of the VAAI extensions/primitives rolling out I would anticipate that to be a fertile area for these types of issues as well. I'm not even sure, given the nature of the issue, if it could be tracked down but everyone who is interested in the issue can now be looking for it. Here is the essence of what we have to work with, redacted due to the volume of messages, to sentinel events. SDS proxy: Mar 19 21:50:30 PROXY kernel: rport-3:0-0: blocked FC remote port time out: removing target and saving binding Mar 19 21:50:30 PROXY kernel: sd 3:0:0:0: rejecting I/O to offline device Mar 19 21:50:32 PROXY kernel: qla2xxx [:04:00.1]-8009:3: DEVICE RESET ISSUED nexus=3:0:0 cmd=da751240. .. .. Noise from Qlogic adapter doing DTB reset. .. Mar 19 21:50:40 PROXY kernel: qla2xxx [:04:00.1]-8018:3: ADAPTER RESET ISSUED nexus=3:0:0. .. .. More noise from the Qlogic adapater. .. Mar 19 21:55:43 PROXY kernel: qla2xxx [:04:00.1]-8017:3: ADAPTER RESET SUCCEEDED nexus=3:0:0. --- VMware logs: -- Mar 19 21:49:08 VMWARE1 2014-03-20T02:49:08.592Z VMWARE1 vmkernel: cpu2:8432)<6>qla2xxx :42:00.0: scsi(8:0:1): Abort command succeeded -- 1 1247096017. Mar 19 21:49:12 VMWARE1 2014-03-20T02:49:12.664Z VMWARE1 vobd: [vmfsCorrelator] 11527431227280us: [esx.problem.vmfs.heartbeat.timedout] 52f3b635-9ea13918-af49-bc305bee68bc VOLUMENAME Mar 19 21:49:14 VMWARE1 2014-03-20T02:49:12.670Z VMWARE1 Hostd: [24FDEB90 info 'Vimsvc.ha-eventmgr'] Event 466 : Lost access to volume 52f3b635-9ea13918-af49-bc305bee68bc (VOLUMENAME) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. .. .. Noise from VMWARE hosts about wanting path updates - only single path .. to proxy, heartbeat timeouts etc. .. Mar 19 21:49:32 VMWARE1 2014-03-20T02:49:32.911Z VMWARE1 vmkernel: cpu10:8202)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T0:L1, reservation state on device eui.6665356665393330 is unknown. .. .. More noise from VMWARE hosts, additional reservation failures etc. .. Mar 19 21:50:32 VMWARE2 2014-03-20T02:50:30.554Z VMWARE2 vmkwarning: cpu22:8237)WARNING: HBX: 564: Volume 52f3b635-9ea13918-af49-bc305bee68bc ("VOLUMENAME") may be damaged on disk. Corrupt heartbeat detected at offset 3653632: [HB state 0 offset 0 gen 0 stampUS 0 uuid -- --- And at that point things were pretty much over with. I would certainly be open to suggestions on how to track or obtain useful information for you. The SDS proxy was sustaining about 350 megabytes/second of I/O from seven initiators so I don't think turning on target mode debugging and cmd tracing is much of an option. > Thanks, > Vlad Have a good weekend. Greg }-- End of excerpt from Vladislav Bolkhovitin As always, Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: g...@enjellic.com --
Re: [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO
> 's a small number of them with strong semantics because they're a part > of the syscall ABI. ("There's" a small number of them.. vim troubles :)) - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] aio/dio: enable PI passthrough
> +static int setup_pi_ext(struct kiocb *req, int is_write) > +{ > + struct file *file = req->ki_filp; > + struct io_extension *ext = &req->ki_ioext->ke_kern; > + void *p; > + unsigned long start, end; > + int retval; > + > + if (!(file->f_flags & O_DIRECT)) { > + pr_debug("EINVAL: can't use PI without O_DIRECT.\n"); > + return -EINVAL; > + } > + > + BUG_ON(req->ki_ioext->ke_pi_iter.pi_userpages); > + > + end = (((unsigned long)ext->ie_pi_buf) + ext->ie_pi_buflen + > + PAGE_SIZE - 1) >> PAGE_SHIFT; > + start = ((unsigned long)ext->ie_pi_buf) >> PAGE_SHIFT; > + req->ki_ioext->ke_pi_iter.pi_offset = offset_in_page(ext->ie_pi_buf); > + req->ki_ioext->ke_pi_iter.pi_len = ext->ie_pi_buflen; > + req->ki_ioext->ke_pi_iter.pi_nrpages = end - start; > + p = kzalloc(req->ki_ioext->ke_pi_iter.pi_nrpages * > + sizeof(struct page *), > + GFP_NOIO); Can userspace give us bad data and get us to generate insane allcation attempt warnings? > + if (p == NULL) { > + pr_err("%s: no room for page array?\n", __func__); > + return -ENOMEM; > + } > + req->ki_ioext->ke_pi_iter.pi_userpages = p; > + > + retval = get_user_pages_fast((unsigned long)ext->ie_pi_buf, > + req->ki_ioext->ke_pi_iter.pi_nrpages, > + is_write, Isn't this is_write backwards? If it's a write syscall then the PI pages is going to be read from. - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] io: define an interface for IO extensions
> @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long > res2) > struct io_event *ev_page, *event; > unsigned long flags; > unsigned tail, pos; > + int ret; > + > + ret = io_teardown_extensions(iocb); > + if (ret) { > + if (!res) > + res = ret; > + else if (!res2) > + res2 = ret; > + else > + pr_err("error %d tearing down aio extensions\n", ret); > + } This ends up trying to copy the kernel's io_extension copy back to userspace from interrupts, which obviously won't fly. And to what end? So that maybe someone can later add an 'extension' that can fill in some field that's then copied to userspace? But by copying the entire argument struct back? Let's not get ahead of ourselves. If they're going to try and give userspace some feedback after IO completion they're going to have to try a lot harder because they don't have acces to the submitting task context anymore. They'd have to pin some reference to a feedback mechanism in the in-flight io. I think we'd want that explicit in the iocb, not hiding off on the other side of this extension interface. I'd just remove this generic teardown callback path entirely. If there's PI state hanging off the iocb tear it down during iocb teardown. > +struct io_extension_type { > + unsigned int type; > + unsigned int extension_struct_size; > + int (*setup_fn)(struct kiocb *, int is_write); > + int (*destroy_fn)(struct kiocb *); > +}; I'd also get rid of all of this. More below. > +static int io_setup_extensions(struct kiocb *req, int is_write, > +struct io_extension __user *ioext) > +{ > + struct io_extension_type *iet; > + __u64 sz, has; > + int ret; > + > + /* Check size of buffer */ > + if (unlikely(copy_from_user(&sz, &ioext->ie_size, sizeof(sz > + return -EFAULT; > + if (sz > PAGE_SIZE || > + sz > sizeof(struct io_extension) || > + sz < IO_EXT_SIZE(ie_has)) > + return -EINVAL; > + > + /* Check that the buffer's big enough */ > + if (unlikely(copy_from_user(&has, &ioext->ie_has, sizeof(has > + return -EFAULT; > + ret = io_check_bufsize(has, sz); > + if (ret) > + return ret; > + > + /* Copy from userland */ > + req->ki_ioext = kzalloc(sizeof(struct kio_extension), GFP_NOIO); > + if (!req->ki_ioext) > + return -ENOMEM; > + > + req->ki_ioext->ke_user = ioext; > + if (unlikely(copy_from_user(&req->ki_ioext->ke_kern, ioext, sz))) { > + ret = -EFAULT; > + goto out; > + } (Isn't there some allocate-and-copy-from-userspace helper now? But..) I don't like the rudundancy of the implicit size requirement by a field's flag being set being duplicated by the explicit size argument. What does that give us, exactly? Our notion of the total size only seems to only matter if we're copying the entire struct from userspace and I'm don't think we need to do that. For each argument, we're translating it into some kernel equivalent, right? Fields in the iocb As each of these are initialized I'd just test the presence bits and __get_user() the userspace arguemnts directly, or copy_from_user() something slightly more complicated on to the stack. That gets rid of us having to care about the size at all. It stops us from allocating a kernel copy and pinning it for the duration of the IO. We'd just be sampling the present userspace arguments as we initialie the iocb during submission. > + /* Try to initialize all the extensions */ > + has = 0; > + for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) { > + if (!(req->ki_ioext->ke_kern.ie_has & iet->type)) > + continue; > + ret = iet->setup_fn(req, is_write); > + if (ret) { > + req->ki_ioext->ke_kern.ie_has = has; > + goto out_destroy; > + } > + has |= iet->type; > + } So instead of doing all this we'd test explicit bits and act accordingly. If they're trivial translations between userspace fields and iocb fields we could just do it inline in this helper that'd be more like iocb_parse_more_args(iocb, struct __user *ptr). For more complicated stuff, like the PI page pinning, it could call out to PI. > + user_ext = (struct io_extension __user *)iocb->aio_extension_ptr; Need a __force there? Has this been run through sparse? - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] io: define an interface for IO extensions
"Darrick J. Wong" writes: > Define a generic interface to allow userspace to attach metadata to an > IO operation. This interface will be used initially to implement > protection information (PI) pass through, though it ought to be usable > by anyone else desiring to extend the IO interface. It should not be > difficult to modify the non-AIO calls to use this mechanism. My main issue with this patch is determining what exactly gets returned to userspace when there is an issue in the teardown_extensions path. It looks like you'll get the first error propagated from io_teardown_extensions, others are ignored. Then, in aio_complete, if there was no error with the I/O, then you'll get the teardown error reported in event->res, otherwise you'll get it in event->res2. So, what are the valid errors returned by the teardown routine for extensions? How is the userspace app supposed to determine where the error came from, the I/O or a failure in the extension teardown? I think it may make sense to only use res2 for reporting io extension teardown failures. Any new code that will use extensions can certainly be written to check both res and res2, and this method would prevent the ambiguity I mentioned. Finally, I know this is an RFC, but please add some man-page changes to your patch set, and CC linux-man. Michael Kerrisk typically has valuable advice on new APIs. Cheers, Jeff > > Signed-off-by: Darrick J. Wong > --- > fs/aio.c | 180 > +- > include/linux/aio.h |7 ++ > include/uapi/linux/aio_abi.h | 15 +++- > 3 files changed, 197 insertions(+), 5 deletions(-) > > > diff --git a/fs/aio.c b/fs/aio.c > index 062a5f6..0c40bdc 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt; > static const struct file_operations aio_ring_fops; > static const struct address_space_operations aio_ctx_aops; > > +static int io_teardown_extensions(struct kiocb *req); > +static int io_setup_extensions(struct kiocb *req, int is_write, > +struct io_extension __user *ioext); > +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req); > + > static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) > { > struct qstr this = QSTR_INIT("[aio]", 5); > @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long > res2) > struct io_event *ev_page, *event; > unsigned long flags; > unsigned tail, pos; > + int ret; > + > + ret = io_teardown_extensions(iocb); > + if (ret) { > + if (!res) > + res = ret; > + else if (!res2) > + res2 = ret; > + else > + pr_err("error %d tearing down aio extensions\n", ret); > + } > > /* >* Special case handling for sync iocbs: > @@ -1350,15 +1366,167 @@ rw_common: > return 0; > } > > +/* IO extension code */ > +#define REQUIRED_STRUCTURE_SIZE(type, member)\ > + (offsetof(type, member) + sizeof(((type *)NULL)->member)) > +#define IO_EXT_SIZE(member) \ > + REQUIRED_STRUCTURE_SIZE(struct io_extension, member) > + > +struct io_extension_type { > + unsigned int type; > + unsigned int extension_struct_size; > + int (*setup_fn)(struct kiocb *, int is_write); > + int (*destroy_fn)(struct kiocb *); > +}; > + > +static struct io_extension_type extensions[] = { > + {IO_EXT_INVALID, 0, NULL, NULL}, > +}; > + > +static int is_write_iocb(struct iocb *iocb) > +{ > + switch (iocb->aio_lio_opcode) { > + case IOCB_CMD_PWRITE: > + case IOCB_CMD_PWRITEV: > + return 1; > + default: > + return 0; > + } > +} > + > +static int io_teardown_extensions(struct kiocb *req) > +{ > + struct io_extension_type *iet; > + int ret, ret2; > + > + if (req->ki_ioext == NULL) > + return 0; > + > + /* Shut down all the extensions */ > + ret = 0; > + for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) { > + if (!(req->ki_ioext->ke_kern.ie_has & iet->type)) > + continue; > + ret2 = iet->destroy_fn(req); > + if (ret2 && !ret) > + ret = ret2; > + } > + > + /* Copy out return values */ > + if (unlikely(copy_to_user(req->ki_ioext->ke_user, > + &req->ki_ioext->ke_kern, > + sizeof(struct io_extension { > + if (!ret) > + ret = -EFAULT; > + } > + > + kfree(req->ki_ioext); > + req->ki_ioext = NULL; > + return ret; > +} > + > +static int io_check_bufsize(__u64 has, __u64 size) > +{ > + struct io_extension_type *iet; > + __u64 all_flags = 0; > + > + for (iet = extensions; iet->type != IO_EXT_INVALID; iet++) { > + all_flags |= iet->type; > + if (!(has &
Re: [PATCH 1/6] fs/bio-integrity: remove duplicate code
> +static int bio_integrity_generate_verify(struct bio *bio, int operate) > { > + if (operate) > + sector = bio->bi_iter.bi_sector; > + else > + sector = bio->bi_integrity->bip_iter.bi_sector; > + if (operate) { > + bi->generate_fn(&bix); > + } else { > + ret = bi->verify_fn(&bix); > + if (ret) { > + kunmap_atomic(kaddr); > + return ret; > + } > + } I was glad to see this replaced with explicit sector and func arguments in later refactoring in the 6/ patch. But I don't think the function poiner casts in that 6/ patch are wise (Or even safe all the time, given crazy function pointer trampolines? Is that still a thing?). I'd have made a single walk_fn type that returns and have the non-returning iterators just return 0. - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH DONOTMERGE v2 0/6] userspace PI passthrough via AIO/DIO
On Mon, Mar 24, 2014 at 09:22:31AM -0700, Darrick J. Wong wrote: > This RFC provides a rough implementation of a mechanism to allow > userspace to attach protection information (e.g. T10 DIF) data to a > disk write and to receive the information alongside a disk read. I have some comments for you! :) Mostly about the interface up in aio. I don't have all that much to say about the bio/pi bits. > Patch #2 implements a generic IO extension interface so that we can > receive a struct io_extension from userspace containing the structure > size, a flag telling us which extensions we'd like to use (ie_has), > and (eventually) extension data. There's a small framework for > mapping ie_has bits to actual extensions. I still really don't think that we should be thinking of these as generic extensions. We're talking about arguments to syscalls. 's a small number of them with strong semantics because they're a part of the syscall ABI. I don't think we should implement them by iterating over per-field ops structs. Anyway, more in reply to the patches. - z -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] target/core: T10-Dif: check HW support capabilities
On Wed, 2014-04-02 at 14:43 +0300, sagi grimberg wrote: > On 4/1/2014 8:45 PM, Nicholas A. Bellinger wrote: > > On Tue, 2014-04-01 at 20:27 +0300, sagi grimberg wrote: > >> On 4/1/2014 8:09 PM, Martin K. Petersen wrote: > "Sagi" == Sagi Grimberg writes: > >>> Sagi> I originally wrote the code to support that. But it got left > >>> Sagi> behind since I figured it is not an interesting use-case. If your > >>> Sagi> beckend doesn't support T10-PI why should the target publish it > >>> Sagi> support it and ask the device to strip/insert it? I suppose it is > >>> Sagi> to allow the initiator to protect half-way, but I don't know how > >>> Sagi> interesting it is if the data is not stored with protection... > >>> > >>> That depends what you do on the backend. There are several devices out > >>> there that expose PI to the host but use a different protection scheme > >>> internally. And then synthesize PI on the host-facing side. Some even do > >>> T10 PI to an internal protection scheme and then back to T10 PI when > >>> talking to the disk drives in the back end. > >>> > >> Hey Martin, > >> > >> I understand, but even for internal different T10-PI schemes, is > >> stripping protection from incoming data > >> at the fabric level (and then do whatever with it in the backend level) > >> the right thing to do here? > >> I mean we basically lose protection across the PCI with this scheme > >> aren't we? > >> > > The WRITE_STRIP + READ_INSERT case would be still be useful for IBLOCK > > backends that don't support real hw PI, so that at least the protection > > can be in place for data movement between physical machines. > > > > Also, I think the amount of changes required to support this type of > > configuration in target-core are quite small. > > Not sure I understood your intention here. > Do you mean that the backstore doesn't support T10-PI, the transport > support T10-PI and target publish to fabric that it support T10-PI? > This will lead to the DIN_INSERT/DOUT_STRIP you are referring to right? Correct. > Is there any value to publish the fabric PI support if your backing > device doesn't support it? So yes, I think there is value is allowing a transport to publish PI support for a device, even when the backend does not support it. In terms of the possibilities for data corruption, moving payloads between physical hosts would certainly have a higher potential given the complexity and number of overall components involved in the transaction. If all unprotected devices should report PI (by default) when the transport supports it is a separate question.. ;) > > The opposite case (backstore support, transport no support) it makes > sense to NOT publish support and do half-way protection in SW > (the backstore is formatted with PI). > Correct. --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] target/core: T10-Dif: check HW support capabilities
On Wed, 2014-04-02 at 09:51 +0300, Sagi Grimberg wrote: > On 4/1/2014 8:45 PM, Nicholas A. Bellinger wrote: > > On Tue, 2014-04-01 at 20:27 +0300, sagi grimberg wrote: > >> On 4/1/2014 8:09 PM, Martin K. Petersen wrote: > "Sagi" == Sagi Grimberg writes: > >>> Sagi> I originally wrote the code to support that. But it got left > >>> Sagi> behind since I figured it is not an interesting use-case. If your > >>> Sagi> beckend doesn't support T10-PI why should the target publish it > >>> Sagi> support it and ask the device to strip/insert it? I suppose it is > >>> Sagi> to allow the initiator to protect half-way, but I don't know how > >>> Sagi> interesting it is if the data is not stored with protection... > >>> > >>> That depends what you do on the backend. There are several devices out > >>> there that expose PI to the host but use a different protection scheme > >>> internally. And then synthesize PI on the host-facing side. Some even do > >>> T10 PI to an internal protection scheme and then back to T10 PI when > >>> talking to the disk drives in the back end. > >>> > >> Hey Martin, > >> > >> I understand, but even for internal different T10-PI schemes, is > >> stripping protection from incoming data > >> at the fabric level (and then do whatever with it in the backend level) > >> the right thing to do here? > >> I mean we basically lose protection across the PCI with this scheme > >> aren't we? > >> > > The WRITE_STRIP + READ_INSERT case would be still be useful for IBLOCK > > backends that don't support real hw PI, so that at least the protection > > can be in place for data movement between physical machines. > > > > Also, I think the amount of changes required to support this type of > > configuration in target-core are quite small. > > So trying to understand how this will come to use. > Target will publish the fabric T10-PI support based only on the > transport configuration (not accounting the backing devices configuration). Yes, passing in the transport configuration for PI at transport_init_session() time seems to make the most sense here in order to address all fabric types. > Then upon each cmd the target will look on {backstore configuration, > PROTECT bit, transport configuration} - then will decide on protection > operation (STRIP/INSERT/PASS). > > Looks right? > Correct. I'm thinking it makes sense for target-core to perform the WRITE_INSERT + READ_STRIP (in software) when the transport does not directly support PI, but the backend has PI enabled. --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/23] scsi: Use pci_enable_msix_range() instead of pci_enable_msix()
On Wed, Apr 2, 2014 at 5:49 AM, Alexander Gordeev wrote: > On Wed, Mar 12, 2014 at 10:04:40PM -0600, Bjorn Helgaas wrote: >> Hi James, >> >> I think Alexander sent these to linux-scsi hoping that you would handle >> them, but I know it's a hassle because they depend on f7fc32c, which went >> in after the merge window. >> >> I'd be glad to review these and apply them through my tree, unless you want >> to do it. I'd like to get these merged in the v3.15 merge window so >> Alexander can move on to something else. I haven't checked for merge >> conflicts with scsi.git yet, but I assume they'd be pretty trivial if there >> are any. > > Hi Bjorn, > > It has shifted to v3.16, right? Yep, sorry, I meant to get to these, but was too busy, so they'll probably go in v3.16. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] blk-mq: do not initialize req->special
Drivers can reach their private data easily using the blk_mq_rq_to_pdu helper and don't need req->special. By not initializing it code can be simplified nicely, and we also shave off a few more instructions from the I/O path. Signed-off-by: Christoph Hellwig --- block/blk-flush.c | 10 ++ block/blk-mq.c | 15 ++- block/blk-mq.h |1 - drivers/block/null_blk.c |4 ++-- drivers/block/virtio_blk.c |6 +++--- 5 files changed, 9 insertions(+), 27 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index 43e6b47..9a0c427 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -306,22 +306,16 @@ static bool blk_kick_flush(struct request_queue *q) */ q->flush_pending_idx ^= 1; + blk_rq_init(q, q->flush_rq); if (q->mq_ops) { - struct blk_mq_ctx *ctx = first_rq->mq_ctx; - struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu); - - blk_mq_rq_init(hctx, q->flush_rq); - q->flush_rq->mq_ctx = ctx; - /* * Reuse the tag value from the fist waiting request, * with blk-mq the tag is generated during request * allocation and drivers can rely on it being inside * the range they asked for. */ + q->flush_rq->mq_ctx = first_rq->mq_ctx; q->flush_rq->tag = first_rq->tag; - } else { - blk_rq_init(q, q->flush_rq); } q->flush_rq->cmd_type = REQ_TYPE_FS; diff --git a/block/blk-mq.c b/block/blk-mq.c index 4274ee0..871acd6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -248,24 +248,13 @@ struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw, } EXPORT_SYMBOL(blk_mq_alloc_reserved_request); -/* - * Re-init and set pdu, if we have it - */ -void blk_mq_rq_init(struct blk_mq_hw_ctx *hctx, struct request *rq) -{ - blk_rq_init(hctx->queue, rq); - - if (hctx->cmd_size) - rq->special = blk_mq_rq_to_pdu(rq); -} - static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, struct request *rq) { const int tag = rq->tag; struct request_queue *q = rq->q; - blk_mq_rq_init(hctx, rq); + blk_rq_init(hctx->queue, rq); blk_mq_put_tag(hctx->tags, tag); blk_mq_queue_exit(q); @@ -1139,7 +1128,7 @@ static int blk_mq_init_rq_map(struct blk_mq_hw_ctx *hctx, left -= to_do * rq_size; for (j = 0; j < to_do; j++) { hctx->rqs[i] = p; - blk_mq_rq_init(hctx, hctx->rqs[i]); + blk_rq_init(hctx->queue, hctx->rqs[i]); p += rq_size; i++; } diff --git a/block/blk-mq.h b/block/blk-mq.h index ebbe6ba..238379a 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -27,7 +27,6 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_init_flush(struct request_queue *q); void blk_mq_drain_queue(struct request_queue *q); void blk_mq_free_queue(struct request_queue *q); -void blk_mq_rq_init(struct blk_mq_hw_ctx *hctx, struct request *rq); /* * CPU hotplug helpers diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 091b9ea..71df69d 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -226,7 +226,7 @@ static void null_cmd_end_timer(struct nullb_cmd *cmd) static void null_softirq_done_fn(struct request *rq) { - end_cmd(rq->special); + end_cmd(blk_mq_rq_to_pdu(rq)); } static inline void null_handle_cmd(struct nullb_cmd *cmd) @@ -311,7 +311,7 @@ static void null_request_fn(struct request_queue *q) static int null_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq) { - struct nullb_cmd *cmd = rq->special; + struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq); cmd->rq = rq; cmd->nq = hctx->driver_data; diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 0eace43..11e8f4b 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -112,7 +112,7 @@ static int __virtblk_add_req(struct virtqueue *vq, static inline void virtblk_request_done(struct request *req) { - struct virtblk_req *vbr = req->special; + struct virtblk_req *vbr = blk_mq_rq_to_pdu(req); int error = virtblk_result(vbr); if (req->cmd_type == REQ_TYPE_BLOCK_PC) { @@ -154,7 +154,7 @@ static void virtblk_done(struct virtqueue *vq) static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req) { struct virtio_blk *vblk = hctx->queue->queuedata; - struct virtblk_req *vbr = req->special; + struct virtblk_req *vbr = blk_mq_rq_to_pdu(req); unsigned long flags; unsigned int num; const bool last = (req->cmd_flags & R
[PATCH 1/5] blk-mq: initialize resid_len
Signed-off-by: Christoph Hellwig --- block/blk-mq.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index b1bcc61..44f738e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -350,6 +350,8 @@ static void blk_mq_start_request(struct request *rq, bool last) trace_block_rq_issue(q, rq); + rq->resid_len = blk_rq_bytes(rq); + /* * Just mark start time and set the started bit. Due to memory * ordering, we know we'll see the correct deadline as long as -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] blk-mq: make ->flush_rq fully transparent to drivers
Drivers shouldn't have to care about the block layer setting aside a request to implement the flush state machine. We already override the mq context and tag to make it more transparent, but so far haven't deal with the driver private data in the request. Make sure to override this as well, and while we're at it add a proper helper sitting in blk-mq.c that implements the full impersonation. Signed-off-by: Christoph Hellwig --- block/blk-flush.c | 12 ++-- block/blk-mq.c| 20 block/blk-mq.h|2 ++ 3 files changed, 24 insertions(+), 10 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index 9a0c427..b218f61 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -307,16 +307,8 @@ static bool blk_kick_flush(struct request_queue *q) q->flush_pending_idx ^= 1; blk_rq_init(q, q->flush_rq); - if (q->mq_ops) { - /* -* Reuse the tag value from the fist waiting request, -* with blk-mq the tag is generated during request -* allocation and drivers can rely on it being inside -* the range they asked for. -*/ - q->flush_rq->mq_ctx = first_rq->mq_ctx; - q->flush_rq->tag = first_rq->tag; - } + if (q->mq_ops) + blk_mq_clone_flush_request(q->flush_rq, first_rq); q->flush_rq->cmd_type = REQ_TYPE_FS; q->flush_rq->cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ; diff --git a/block/blk-mq.c b/block/blk-mq.c index 871acd6..289b7e5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -272,6 +272,26 @@ void blk_mq_free_request(struct request *rq) __blk_mq_free_request(hctx, ctx, rq); } +/* + * Clone all relevant state from a request that has been put on hold in + * the flush state machine into the preallocated flush request that hangs + * off the request queue. + * + * For a driver the flush request should be invisible, that's why we are + * impersonating the original request here. + */ +void blk_mq_clone_flush_request(struct request *flush_rq, + struct request *orig_rq) +{ + struct blk_mq_hw_ctx *hctx = + orig_rq->q->mq_ops->map_queue(orig_rq->q, orig_rq->mq_ctx->cpu); + + flush_rq->mq_ctx = orig_rq->mq_ctx; + flush_rq->tag = orig_rq->tag; + memcpy(blk_mq_rq_to_pdu(flush_rq), blk_mq_rq_to_pdu(orig_rq), + hctx->cmd_size); +} + bool blk_mq_end_io_partial(struct request *rq, int error, unsigned int nr_bytes) { if (blk_update_request(rq, error, nr_bytes)) diff --git a/block/blk-mq.h b/block/blk-mq.h index 238379a..7964dad 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -27,6 +27,8 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_init_flush(struct request_queue *q); void blk_mq_drain_queue(struct request_queue *q); void blk_mq_free_queue(struct request_queue *q); +void blk_mq_clone_flush_request(struct request *flush_rq, + struct request *orig_rq); /* * CPU hotplug helpers -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] blk-mq: fix blk_mq_end_io_partial
We need to pass the actual nr_bytes instead of all bytes in the request. Signed-off-by: Christoph Hellwig --- block/blk-mq.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 44f738e..4274ee0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -285,7 +285,7 @@ void blk_mq_free_request(struct request *rq) bool blk_mq_end_io_partial(struct request *rq, int error, unsigned int nr_bytes) { - if (blk_update_request(rq, error, blk_rq_bytes(rq))) + if (blk_update_request(rq, error, nr_bytes)) return true; blk_account_io_done(rq); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] blk-mq: add ->init_request and ->exit_request methods
The current blk_mq_init_commands/blk_mq_free_commands interface has a two problems: 1) Because only the constructor is passed to blk_mq_init_commands there is no easy way to clean up when a comman initialization failed. The current code simply leaks the allocations done in the constructor. 2) There is no good place to call blk_mq_free_commands: before blk_cleanup_queue there is no guarantee that all outstanding commands have completed, so we can't free them yet. After blk_cleanup_queue the queue has usually been freed. This can be worked around by grabbing an unconditional reference before calling blk_cleanup_queue and dropping it after blk_mq_free_commands is done, although that's not exatly pretty and driver writers are guaranteed to get it wrong sooner or later. Both issues are easily fixed by making the request constructor and destructor normal blk_mq_ops methods. Signed-off-by: Christoph Hellwig --- block/blk-mq.c | 105 ++-- drivers/block/virtio_blk.c | 23 +- include/linux/blk-mq.h | 14 +- 3 files changed, 55 insertions(+), 87 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 289b7e5..e0b65c4 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1005,74 +1005,20 @@ static void blk_mq_hctx_notify(void *data, unsigned long action, blk_mq_put_ctx(ctx); } -static int blk_mq_init_hw_commands(struct blk_mq_hw_ctx *hctx, - int (*init)(void *, struct blk_mq_hw_ctx *, - struct request *, unsigned int), - void *data) +static void blk_mq_free_rq_map(struct blk_mq_hw_ctx *hctx, void *driver_data) { - unsigned int i; - int ret = 0; - - for (i = 0; i < hctx->queue_depth; i++) { - struct request *rq = hctx->rqs[i]; - - ret = init(data, hctx, rq, i); - if (ret) - break; - } - - return ret; -} - -int blk_mq_init_commands(struct request_queue *q, -int (*init)(void *, struct blk_mq_hw_ctx *, - struct request *, unsigned int), -void *data) -{ - struct blk_mq_hw_ctx *hctx; - unsigned int i; - int ret = 0; - - queue_for_each_hw_ctx(q, hctx, i) { - ret = blk_mq_init_hw_commands(hctx, init, data); - if (ret) - break; - } - - return ret; -} -EXPORT_SYMBOL(blk_mq_init_commands); - -static void blk_mq_free_hw_commands(struct blk_mq_hw_ctx *hctx, - void (*free)(void *, struct blk_mq_hw_ctx *, - struct request *, unsigned int), - void *data) -{ - unsigned int i; + struct page *page; - for (i = 0; i < hctx->queue_depth; i++) { - struct request *rq = hctx->rqs[i]; + if (hctx->rqs && hctx->queue->mq_ops->exit_request) { + int i; - free(data, hctx, rq, i); + for (i = 0; i < hctx->queue_depth; i++) { + if (!hctx->rqs[i]) + continue; + hctx->queue->mq_ops->exit_request(driver_data, hctx, + hctx->rqs[i], i); + } } -} - -void blk_mq_free_commands(struct request_queue *q, - void (*free)(void *, struct blk_mq_hw_ctx *, - struct request *, unsigned int), - void *data) -{ - struct blk_mq_hw_ctx *hctx; - unsigned int i; - - queue_for_each_hw_ctx(q, hctx, i) - blk_mq_free_hw_commands(hctx, free, data); -} -EXPORT_SYMBOL(blk_mq_free_commands); - -static void blk_mq_free_rq_map(struct blk_mq_hw_ctx *hctx) -{ - struct page *page; while (!list_empty(&hctx->page_list)) { page = list_first_entry(&hctx->page_list, struct page, lru); @@ -1097,10 +1043,12 @@ static size_t order_to_size(unsigned int order) } static int blk_mq_init_rq_map(struct blk_mq_hw_ctx *hctx, - unsigned int reserved_tags, int node) + struct blk_mq_reg *reg, void *driver_data, int node) { + unsigned int reserved_tags = reg->reserved_tags; unsigned int i, j, entries_per_page, max_order = 4; size_t rq_size, left; + int error; INIT_LIST_HEAD(&hctx->page_list); @@ -1149,14 +1097,23 @@ static int blk_mq_init_rq_map(struct blk_mq_hw_ctx *hctx, for (j = 0; j < to_do; j++) { hctx->rqs[i] = p; blk_rq_init(hctx->queue, hctx->rqs[i]); + if (reg->ops->init_request) { + error = reg->ops->init_request(dr
blk-mq updates, V3
The first two patches are unchanged from the last two iterations of the series. Patches 3 and 4 are new: 3 is a preparation for 4 and other future work, and patch 4 makes sure that drivers don't have to implement special workarounds for the magic flush request. Patch 5 is an updated version of the earlier init/exit_request versions which should make everyone happy now as there are no special NULL hardware contexts or negative request numbers due to patch 4 sorting out the flush_rq mess. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] mpt3sas: Remove use of DEF_SCSI_QCMD
On Wed, Apr 02, 2014 at 03:41:15AM -0700, Christoph Hellwig wrote: > On Thu, Mar 27, 2014 at 04:40:31PM -0400, Matthew Wilcox wrote: > > Removing the host_lock from the I/O submission path gives a huge > > scalability improvement. > > This looks reasonable to me, but did you do a full audit that all state > is properly synchronized in the functions called from _scsih_qcmd_lck? > > If so that should be stated in the patch description. I did one for mpt2sas back in July 2011 when I first submitted this patch. http://marc.info/?l=linux-scsi&m=130980763617589&w=2 I must confess to having not done so for this iteration. The equivalent to this patch has been in RHEL since 6.2, so I hope that gives us some amount of confidence. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/23] scsi: Use pci_enable_msix_range() instead of pci_enable_msix()
On Wed, Mar 12, 2014 at 10:04:40PM -0600, Bjorn Helgaas wrote: > Hi James, > > I think Alexander sent these to linux-scsi hoping that you would handle > them, but I know it's a hassle because they depend on f7fc32c, which went > in after the merge window. > > I'd be glad to review these and apply them through my tree, unless you want > to do it. I'd like to get these merged in the v3.15 merge window so > Alexander can move on to something else. I haven't checked for merge > conflicts with scsi.git yet, but I assume they'd be pretty trivial if there > are any. Hi Bjorn, It has shifted to v3.16, right? > Bjorn -- Regards, Alexander Gordeev agord...@redhat.com -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] target/core: T10-Dif: check HW support capabilities
On 4/1/2014 8:45 PM, Nicholas A. Bellinger wrote: On Tue, 2014-04-01 at 20:27 +0300, sagi grimberg wrote: On 4/1/2014 8:09 PM, Martin K. Petersen wrote: "Sagi" == Sagi Grimberg writes: Sagi> I originally wrote the code to support that. But it got left Sagi> behind since I figured it is not an interesting use-case. If your Sagi> beckend doesn't support T10-PI why should the target publish it Sagi> support it and ask the device to strip/insert it? I suppose it is Sagi> to allow the initiator to protect half-way, but I don't know how Sagi> interesting it is if the data is not stored with protection... That depends what you do on the backend. There are several devices out there that expose PI to the host but use a different protection scheme internally. And then synthesize PI on the host-facing side. Some even do T10 PI to an internal protection scheme and then back to T10 PI when talking to the disk drives in the back end. Hey Martin, I understand, but even for internal different T10-PI schemes, is stripping protection from incoming data at the fabric level (and then do whatever with it in the backend level) the right thing to do here? I mean we basically lose protection across the PCI with this scheme aren't we? The WRITE_STRIP + READ_INSERT case would be still be useful for IBLOCK backends that don't support real hw PI, so that at least the protection can be in place for data movement between physical machines. Also, I think the amount of changes required to support this type of configuration in target-core are quite small. Not sure I understood your intention here. Do you mean that the backstore doesn't support T10-PI, the transport support T10-PI and target publish to fabric that it support T10-PI? This will lead to the DIN_INSERT/DOUT_STRIP you are referring to right? Is there any value to publish the fabric PI support if your backing device doesn't support it? The opposite case (backstore support, transport no support) it makes sense to NOT publish support and do half-way protection in SW (the backstore is formatted with PI). Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] fusion: Add free msg frames to the head, not tail of list
Looks good, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] mpt2sas: Add free smids to the head, not tail of list
Looks good, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] mpt2sas: Remove uses of serial_number
Looks good, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] mpt3sas: Remove use of DEF_SCSI_QCMD
On Thu, Mar 27, 2014 at 04:40:31PM -0400, Matthew Wilcox wrote: > Removing the host_lock from the I/O submission path gives a huge > scalability improvement. This looks reasonable to me, but did you do a full audit that all state is properly synchronized in the functions called from _scsih_qcmd_lck? If so that should be stated in the patch description. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] mpt3sas: Remove uses of serial_number
Looks good, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] tcm_qla2xxx: T10-Dif set harware capability
On 4/1/2014 8:40 PM, Nicholas A. Bellinger wrote: So no way to get it centralized? Yes, still working on how that might look.. I still don't understand the iscsi/iser constraint. Every other fabric aside from iscsi/iser could simply provide a TFO->get_fabric_prot(se_tpg) to query for supported PI. The reason why iscsi/iser is unique is because we can have different network portals (eg: iser protected + traditional iscsi unprotected) on the same se_tpg endpoint. I see. That seems solvable to me... perhaps the easiest thing is to just go ahead and support T10-PI in iscsi-tcp (SW)... Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/7] Performance improvements for LSI SCSI cards
CC few more LSI driver developers. > -Original Message- > From: Nicholas A. Bellinger [mailto:n...@linux-iscsi.org] > Sent: Monday, March 31, 2014 11:25 AM > To: Matthew Wilcox > Cc: linux-scsi@vger.kernel.org; wi...@linux.intel.com; Desai, Kashyap; > Saxena, Sumit; James Bottomley > Subject: Re: [PATCH 0/7] Performance improvements for LSI SCSI cards > > Hi Matthew, > > On Thu, 2014-03-27 at 16:40 -0400, Matthew Wilcox wrote: > > The host lock is a serious scalability problem on 2-socket and larger > > systems which are doing a lot of I/O. Before removing the temporary > > usgae of DEF_SCSI_QCMD, we need to remove all uses of serial_number. > > > > An unrelated performance issue is that reusing the most recent > > driver-specific data structure to track the I/O instead of the least > > recently used keeps the cache-hot lines in use, which is a nice > > performance improvement. It's already present in the mpt3sas driver, > > it just didn't make it into the fusion or mpt2sas drivers yet. > > > > Matthew Wilcox (7): > > mpt3sas: Remove uses of serial_number > > mpt3sas: Remove use of DEF_SCSI_QCMD > > mpt2sas: Remove uses of serial_number > > mpt2sas: Remove use of DEF_SCSI_QCMD > > mpt2sas: Add free smids to the head, not tail of list > > fusion: Add free msg frames to the head, not tail of list > > fusion: Remove use of DEF_SCSI_QCMD > > > > > > +1 to this long overdue series to enable host_lock-less mode with > mpt*sas + fusion. (CC'ing LSI folks + jejb) > > Reviewed-by: Nicholas Bellinger > N�r��yb�X��ǧv�^�){.n�+{���"�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: [RFC] blk-mq: support for shared tags
On Tue, Apr 01, 2014 at 05:16:21PM -0700, Matias Bjorling wrote: > Hi Christoph, > > Can you rebase it on top of 3.14. I have trouble applying it for testing. Hi Martin, the series is based on top of Jens' for-next branch. I've also pushed out a git tree to the blk-mq-share-tags.2 branch of git://git.infradead.org/users/hch/scsi.git to make testing and reviewing easier. > For nvme, there's need for two separate types of queues. The admin queue > (before initializing blk-mq) and the actual hardware queues. > > Should we allow the driver to get/put tags before initializing blk-mq? > Or let drivers implement their own framework? What do you mean with initializing blk-mq? We need to allocate data structures for sure, and I don't see much else in terms of initialization in blk-mq. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html