[PATCH v3 0/13] IB SRP initiator patches for kernel 3.11
The purpose of this InfiniBand SRP initiator patch series is as follows: - Make the SRP initiator driver better suited for use in a H.A. setup. Add fast_io_fail_tmo and dev_loss_tmo parameters. These can be used either to speed up failover or to avoid device removal when e.g. using initiator side mirroring. - Make the SRP initiator better suited for use on NUMA systems by making the HCA completion vector configurable. Changes since the v2 of the IB SRP initiator patches for kernel 3.11 patch series: - Improved documentation of the newly added sysfs parameters. - Limit fast_io_fail_tmo to SCSI_DEVICE_BLOCK_MAX_TIMEOUT. - Simplified the code for parsing values written into sysfs attributes. - Fixed a potential deadlock in the code added in scsi_transport_srp (invoking cancel_delayed_work() with the rport mutex held for work that needs the rport mutex itself). - Changed the default retry count back from 2 to 7 since there is not yet agreement about this change. - Dropped the patch that silences failed SCSI commands and also the patch that fixes a race between srp_queuecommand() and srp_claim_req() since there is no agreement about these patches. Changes since the v1 of the IB SRP initiator patches for kernel 3.11 patch series: - scsi_transport_srp: Allowed both fast_io_fail and dev_loss timeouts to be disabled. - scsi_transport_srp, srp_reconnect_rport(): switched from scsi_block_requests() to scsi_target_block() for blocking SCSI command processing temporarily. - scsi_transport_srp, srp_start_tl_fail_timers(): only block SCSI device command processing if the fast_io_fail timer is enabled. - Changed srp_abort() such that upon transport offline the value FAST_IO_FAIL is returned instead of SUCCESS. - Fixed a race condition in the maintain single connection patch: a new login after removal had started but before removal had finished still could create a duplicate connection. Fixed this by deferring removal from the target list until removal has finished. - Modified the error message in the same patch for reporting that a duplicate connection has been rejected. - Modified patch 2/15 such that all possible race conditions with srp_claim_req() are addressed. - Documented the comp_vector and tl_retry_count login string parameters. - Updated dev_loss_tmo and fast_io_fail_tmo documentation - mentioned off is a valid choice. Changes compared to v5 of the Make ib_srp better suited for H.A. purposes patch series: - Left out patches that are already upstream. - Made it possible to set dev_loss_tmo to off. This is useful in a setup using initiator side mirroring to avoid that new /dev/sd* names are reassigned after a failover or cable pull and reinsert. - Added kernel module parameters to ib_srp for configuring default values of the fast_io_fail_tmo and dev_loss_tmo parameters. - Added a patch from Dotan Barak that fixes a kernel oops during rmmod triggered by resource allocation failure at module load time. - Avoid duplicate connections by refusing relogins instead of dropping duplicate connections, as proposed by Sebastian Riemer. - Added a patch from Sebastian Riemer for failing SCSI commands silently. - Added a patch from Vu Pham to make the transport layer (IB RC) retry count configurable. - Made HCA completion vector configurable. Changes since v4: - Added a patch for removing SCSI devices upon a port down event Changes since v3: - Restored the dev_loss_tmo and fast_io_fail_tmo sysfs attributes. - Included a patch to fix an ib_srp crash that could be triggered by cable pulling. Changes since v2: - Addressed the v2 review comments. - Dropped the patches that have already been merged. - Dropped the patches for integration with multipathd. - Dropped the micro-optimization of the IB completion handlers. The individual patches in this series are as follows: 0001-IB-srp-Fix-remove_one-crash-due-to-resource-exhausti.patch 0002-IB-srp-Fix-race-between-srp_queuecommand-and-srp_cla.patch 0003-IB-srp-Avoid-that-srp_reset_host-is-skipped-after-a-.patch 0004-IB-srp-Fail-I-O-fast-if-target-offline.patch 0005-IB-srp-Skip-host-settle-delay.patch 0006-IB-srp-Maintain-a-single-connection-per-I_T-nexus.patch 0007-IB-srp-Keep-rport-as-long-as-the-IB-transport-layer.patch 0008-scsi_transport_srp-Add-transport-layer-error-handlin.patch 0009-IB-srp-Add-srp_terminate_io.patch 0010-IB-srp-Use-SRP-transport-layer-error-recovery.patch 0011-IB-srp-Start-timers-if-a-transport-layer-error-occur.patch 0012-IB-srp-Fail-SCSI-commands-silently.patch 0013-IB-srp-Make-HCA-completion-vector-configurable.patch 0014-IB-srp-Make-transport-layer-retry-count-configurable.patch 0015-IB-srp-Bump-driver-version-and-release-date.patch -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can not see/access devices on Marvell 88SE9485 + SiI 3726 PMP
On 01.07.2013 16:38, James Bottomley wrote: mv_sas is a libsas based driver. libsas doesn't have any support for SATA PMPs. When it was added they were left as a todo item but then in the field everyone deployed enterprise type SATA devices in SAS expander chassis, so PMP support just got forgotten. That's a pity, so it's unlikely libsas will get support for PMPs in the near future or at all? Sadly I'm not able to add the support myself. Do you know of any SAS controllers which do support SATA PMPs? I'm stuck with the PMPs, so I'm trying a SATA controller with an 88SX7042 next, although having a good SAS controller would be better in the long run. Greetings, Hajo -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 06/13] IB/srp: Keep rport as long as the IB transport layer
Keep the rport data structure around after srp_remove_host() has finished until cleanup of the IB transport layer has finished completely. This is necessary because later patches use the rport pointer inside the queuecommand callback. Without this patch accessing the rport from inside a queuecommand callback is racy because srp_remove_host() must be invoked before scsi_remove_host() and because the queuecommand callback may get invoked after srp_remove_host() has finished. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: James Bottomley jbottom...@parallels.com Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/ulp/srp/ib_srp.c |3 +++ drivers/infiniband/ulp/srp/ib_srp.h |1 + drivers/scsi/scsi_transport_srp.c | 18 ++ include/scsi/scsi_transport_srp.h |2 ++ 4 files changed, 24 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index f046e32..f65701d 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -526,11 +526,13 @@ static void srp_remove_target(struct srp_target_port *target) WARN_ON_ONCE(target-state != SRP_TARGET_REMOVED); srp_del_scsi_host_attr(target-scsi_host); + srp_rport_get(target-rport); srp_remove_host(target-scsi_host); scsi_remove_host(target-scsi_host); srp_disconnect_target(target); ib_destroy_cm_id(target-cm_id); srp_free_target_ib(target); + srp_rport_put(target-rport); srp_free_req_data(target); scsi_host_put(target-scsi_host); } @@ -1982,6 +1984,7 @@ static int srp_add_target(struct srp_host *host, struct srp_target_port *target) } rport-lld_data = target; + target-rport = rport; spin_lock(host-target_lock); list_add_tail(target-list, host-target_list); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 66fbedd..1817ed5 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -153,6 +153,7 @@ struct srp_target_port { u16 io_class; struct srp_host*srp_host; struct Scsi_Host *scsi_host; + struct srp_rport *rport; chartarget_name[32]; unsigned intscsi_id; unsigned intsg_tablesize; diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c index f379c7f..f7ba94a 100644 --- a/drivers/scsi/scsi_transport_srp.c +++ b/drivers/scsi/scsi_transport_srp.c @@ -185,6 +185,24 @@ static int srp_host_match(struct attribute_container *cont, struct device *dev) } /** + * srp_rport_get() - increment rport reference count + */ +void srp_rport_get(struct srp_rport *rport) +{ + get_device(rport-dev); +} +EXPORT_SYMBOL(srp_rport_get); + +/** + * srp_rport_put() - decrement rport reference count + */ +void srp_rport_put(struct srp_rport *rport) +{ + put_device(rport-dev); +} +EXPORT_SYMBOL(srp_rport_put); + +/** * srp_rport_add - add a SRP remote port to the device hierarchy * @shost: scsi host the remote port is connected to. * @ids: The port id for the remote port. diff --git a/include/scsi/scsi_transport_srp.h b/include/scsi/scsi_transport_srp.h index ff0f04a..5a2d2d1 100644 --- a/include/scsi/scsi_transport_srp.h +++ b/include/scsi/scsi_transport_srp.h @@ -38,6 +38,8 @@ extern struct scsi_transport_template * srp_attach_transport(struct srp_function_template *); extern void srp_release_transport(struct scsi_transport_template *); +extern void srp_rport_get(struct srp_rport *rport); +extern void srp_rport_put(struct srp_rport *rport); extern struct srp_rport *srp_rport_add(struct Scsi_Host *, struct srp_rport_identifiers *); extern void srp_rport_del(struct srp_rport *); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling
Add the necessary functions in the SRP transport module to allow an SRP initiator driver to implement transport layer error handling similar to the functionality already provided by the FC transport layer. This includes: - Support for implementing fast_io_fail_tmo, the time that should elapse after having detected a transport layer problem and before failing I/O. - Support for implementing dev_loss_tmo, the time that should elapse after having detected a transport layer problem and before removing a remote port. Signed-off-by: Bart Van Assche bvanass...@acm.org Cc: Roland Dreier rol...@purestorage.com Cc: James Bottomley jbottom...@parallels.com Cc: David Dillow dillo...@ornl.gov Cc: Vu Pham v...@mellanox.com Cc: Sebastian Riemer sebastian.rie...@profitbricks.com --- Documentation/ABI/stable/sysfs-transport-srp | 38 +++ drivers/scsi/scsi_transport_srp.c| 468 +- include/scsi/scsi_transport_srp.h| 62 +++- 3 files changed, 565 insertions(+), 3 deletions(-) diff --git a/Documentation/ABI/stable/sysfs-transport-srp b/Documentation/ABI/stable/sysfs-transport-srp index b36fb0d..52babb9 100644 --- a/Documentation/ABI/stable/sysfs-transport-srp +++ b/Documentation/ABI/stable/sysfs-transport-srp @@ -5,6 +5,24 @@ Contact: linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org Description: Instructs an SRP initiator to disconnect from a target and to remove all LUNs imported from that target. +What: /sys/class/srp_remote_ports/port-h:n/dev_loss_tmo +Date: October 1, 2013 +KernelVersion: 3.11 +Contact: linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org +Description: Number of seconds the SCSI layer will wait after a transport + layer error has been observed before removing a target port. + Zero means immediate removal. Setting this attribute to off + will disable this behavior. + +What: /sys/class/srp_remote_ports/port-h:n/fast_io_fail_tmo +Date: October 1, 2013 +KernelVersion: 3.11 +Contact: linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org +Description: Number of seconds the SCSI layer will wait after a transport + layer error has been observed before failing I/O. Zero means + failing I/O immediately. Setting this attribute to off will + disable this behavior. + What: /sys/class/srp_remote_ports/port-h:n/port_id Date: June 27, 2007 KernelVersion: 2.6.24 @@ -12,8 +30,28 @@ Contact: linux-scsi@vger.kernel.org Description: 16-byte local SRP port identifier in hexadecimal format. An example: 4c:49:4e:55:58:20:56:49:4f:00:00:00:00:00:00:00. +What: /sys/class/srp_remote_ports/port-h:n/reconnect_delay +Date: October 1, 2013 +KernelVersion: 3.11 +Contact: linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org +Description: Number of seconds the SCSI layer will wait after a reconnect + attempt failed before retrying. + What: /sys/class/srp_remote_ports/port-h:n/roles Date: June 27, 2007 KernelVersion: 2.6.24 Contact: linux-scsi@vger.kernel.org Description: Role of the remote port. Either SRP Initiator or SRP Target. + +What: /sys/class/srp_remote_ports/port-h:n/state +Date: October 1, 2013 +KernelVersion: 3.11 +Contact: linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org +Description: State of the transport layer used for communication with the + remote port. running if the transport layer is operational; + blocked if a transport layer error has been encountered but + the fail_io_fast_tmo timer has not yet fired; fail-fast + after the fail_io_fast_tmo timer has fired and before the + dev_loss_tmo timer has fired; lost after the + dev_loss_tmo timer has fired and before the port is finally + removed. diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c index f7ba94a..1b9ebd5 100644 --- a/drivers/scsi/scsi_transport_srp.c +++ b/drivers/scsi/scsi_transport_srp.c @@ -24,12 +24,15 @@ #include linux/err.h #include linux/slab.h #include linux/string.h +#include linux/delay.h #include scsi/scsi.h +#include scsi/scsi_cmnd.h #include scsi/scsi_device.h #include scsi/scsi_host.h #include scsi/scsi_transport.h #include scsi/scsi_transport_srp.h +#include scsi_priv.h #include scsi_transport_srp_internal.h struct srp_host_attrs { @@ -38,7 +41,7 @@ struct srp_host_attrs { #define to_srp_host_attrs(host)((struct srp_host_attrs *)(host)-shost_data) #define SRP_HOST_ATTRS 0 -#define SRP_RPORT_ATTRS 3 +#define SRP_RPORT_ATTRS 8 struct srp_internal { struct scsi_transport_template t; @@ -54,6 +57,26 @@ struct srp_internal { #definedev_to_rport(d) container_of(d, struct srp_rport, dev)
Re: [PATCH v3 0/13] IB SRP initiator patches for kernel 3.11
On 03/07/2013 15:41, Bart Van Assche wrote: [...] Bart, The individual patches in this series are as follows: 0001-IB-srp-Fix-remove_one-crash-due-to-resource-exhausti.patch 0002-IB-srp-Fix-race-between-srp_queuecommand-and-srp_cla.patch 0003-IB-srp-Avoid-that-srp_reset_host-is-skipped-after-a-.patch 0004-IB-srp-Fail-I-O-fast-if-target-offline.patch 0005-IB-srp-Skip-host-settle-delay.patch 0006-IB-srp-Maintain-a-single-connection-per-I_T-nexus.patch 0007-IB-srp-Keep-rport-as-long-as-the-IB-transport-layer.patch 0008-scsi_transport_srp-Add-transport-layer-error-handlin.patch 0009-IB-srp-Add-srp_terminate_io.patch 0010-IB-srp-Use-SRP-transport-layer-error-recovery.patch 0011-IB-srp-Start-timers-if-a-transport-layer-error-occur.patch 0012-IB-srp-Fail-SCSI-commands-silently.patch 0013-IB-srp-Make-HCA-completion-vector-configurable.patch 0014-IB-srp-Make-transport-layer-retry-count-configurable.patch 0015-IB-srp-Bump-driver-version-and-release-date.patch Some of these patches were already picked by Roland (SB), I would suggest that you post V4 and drop the ones which were accepted. e8ca413 IB/srp: Bump driver version and release date 4b5e5f4 IB/srp: Make HCA completion vector configurable 96fc248 IB/srp: Maintain a single connection per I_T nexus 99e1c13 IB/srp: Fail I/O fast if target offline 2742c1d IB/srp: Skip host settle delay 086f44f IB/srp: Avoid skipping srp_reset_host() after a transport error 1fe0cb8 IB/srp: Fix remove_one crash due to resource exhaustion Also, Would help if you use the --cover-letter of git format-patch and the resulted cover letter (patch 0/N) as it has standard content which you can enhance and place your additions. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/13] IB SRP initiator patches for kernel 3.11
On 07/03/13 15:38, Or Gerlitz wrote: Some of these patches were already picked by Roland (SB), I would suggest that you post V4 and drop the ones which were accepted. One of the patches that is already in Roland's tree and that was in v1 of this series has been split into two patches in v2 and v3 of this series. So I'd like to hear from Roland what he prefers himself - that I drop the patches that are already in his tree or that Roland updates his tree with the most recently posted patches. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
atomic write T10 standards
On 07/03/2013 11:00 AM, James Bottomley wrote: On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote: On 07/03/2013 10:38 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 10:34:04) As I was out walking Skeeter this morning, I was thinking a bit about the new T10 atomic write proposal that Chris spoke about some time back. Specifically, I think that we would see a value only if the atomic write was also durable - if not, we need to always issue a SYNCHRONIZE_CACHE command which would mean it really is not effectively more useful than a normal write? Did I understand the proposal correctly? If I did, should we poke the usual T10 posse to nudge them (David Black, Fred Knight, etc?)... I don't think the atomic writes should be a special case here. We've already got the cache flush and fua machinery and should just apply it on top of the atomic constructs... -chris I should have sent this to the linux-scsi list I suppose, but wanted clarity before embarrassing myself :) Yes, it is a better to have a wider audience Adding in linux-scsi If we have to use fua/flush after an atomic write, what makes it atomic? Why not just use a normal write? It does not seem to add anything that write + flush/fua does? It adds the all or nothing that we can use to commit journal entries without having to worry about atomicity. The guarantee is that everything makes it or nothing does. I still don't see the difference in write + SYNC_CACHE versus atomic write + SYNC_CACHE. If the write is atomic and not durable, it is not really usable as a hard promise until after we flush it somehow. In theory, if we got ordered tags working to ensure transaction vs data ordering, this would mean we wouldn't have to flush at all because the disk image would always be journal consistent ... a bit like the old soft update scheme. James Why not have the atomic write actually imply that it is atomic and durable for just that command? Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can not see/access devices on Marvell 88SE9485 + SiI 3726 PMP
On Wed, 2013-07-03 at 14:52 +0200, Hajo Möller wrote: On 01.07.2013 16:38, James Bottomley wrote: mv_sas is a libsas based driver. libsas doesn't have any support for SATA PMPs. When it was added they were left as a todo item but then in the field everyone deployed enterprise type SATA devices in SAS expander chassis, so PMP support just got forgotten. That's a pity, so it's unlikely libsas will get support for PMPs in the near future or at all? Sadly I'm not able to add the support myself. Do you know of any SAS controllers which do support SATA PMPs? I'm stuck with the PMPs, so I'm trying a SATA controller with an 88SX7042 next, although having a good SAS controller would be better in the long run. No, I'm afraid not. Libsas has been released for nearly seven years and no-one has noticed in all that time, which means it's a really rare thing to do. Everyone who uses SAS controllers usually has either a direct connection or an expander. There's no real reason to buy expensive kit and then plug it into a cheap port switch. That said, it's possible that some of the firmware SAT blob based controllers, like the MPT ones might have PMP support ... but because of the rarity it's likely to be buggy. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: On 07/03/2013 11:00 AM, James Bottomley wrote: On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote: On 07/03/2013 10:38 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 10:34:04) As I was out walking Skeeter this morning, I was thinking a bit about the new T10 atomic write proposal that Chris spoke about some time back. Specifically, I think that we would see a value only if the atomic write was also durable - if not, we need to always issue a SYNCHRONIZE_CACHE command which would mean it really is not effectively more useful than a normal write? Did I understand the proposal correctly? If I did, should we poke the usual T10 posse to nudge them (David Black, Fred Knight, etc?)... I don't think the atomic writes should be a special case here. We've already got the cache flush and fua machinery and should just apply it on top of the atomic constructs... -chris I should have sent this to the linux-scsi list I suppose, but wanted clarity before embarrassing myself :) Yes, it is a better to have a wider audience Adding in linux-scsi If we have to use fua/flush after an atomic write, what makes it atomic? Why not just use a normal write? It does not seem to add anything that write + flush/fua does? It adds the all or nothing that we can use to commit journal entries without having to worry about atomicity. The guarantee is that everything makes it or nothing does. I still don't see the difference in write + SYNC_CACHE versus atomic write + SYNC_CACHE. If the write is atomic and not durable, it is not really usable as a hard promise until after we flush it somehow. In theory, if we got ordered tags working to ensure transaction vs data ordering, this would mean we wouldn't have to flush at all because the disk image would always be journal consistent ... a bit like the old soft update scheme. James Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling
On Wed, 2013-07-03 at 14:54 +0200, Bart Van Assche wrote: +int srp_tmo_valid(int fast_io_fail_tmo, int dev_loss_tmo) +{ + return (fast_io_fail_tmo 0 || dev_loss_tmo 0 || + fast_io_fail_tmo dev_loss_tmo) + fast_io_fail_tmo = SCSI_DEVICE_BLOCK_MAX_TIMEOUT + dev_loss_tmo LONG_MAX / HZ ? 0 : -EINVAL; +} +EXPORT_SYMBOL_GPL(srp_tmo_valid); This would have been more readable: int srp_tmo_valid(int fast_io_fail_tmp, int dev_loss_tmo) { /* Fast IO fail must be off, or no greater than the max timeout */ if (fast_io_fail_tmo SCSI_DEVICE_BLOCK_MAX_TIMEOUT) return -EINVAL; /* Device timeout must be off, or fit into jiffies */ if (dev_loss_tmo = LONG_MAX / HZ) return -EINVAL; /* Fast IO must trigger before device loss, or one of the * timeouts must be disabled. */ if (fast_io_fail_tmo 0 || dev_loss_tmo 0) return 0; if (fast_io_fail dev_loss_tmo) return 0; return -EINVAL; } Though, now that I've unpacked it -- I don't think it is OK for dev_loss_tmo to be off, but fast IO to be on? That drops another conditional. Also, FC caps dev_loss_tmo at SCSI_DEVICE_BLOCK_MAX_TIMEOUT if fail_io_fast_tmo is off; I agree with your reasoning about leaving it unlimited if fast fail is on, but does that still hold if it is off? -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Right? Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] scsi: ufs: Fix broken task management command implementation
On 7/2/2013 9:21 PM, Santosh Y wrote: On Fri, Jun 28, 2013 at 5:02 PM, Sujit Reddy Thumma sthu...@codeaurora.org wrote: On 6/27/2013 4:49 PM, Santosh Y wrote: + spin_lock_irqsave(host-host_lock, flags); task_req_descp = hba-utmrdl_base_addr; task_req_descp += free_slot; @@ -2353,38 +2387,39 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba, (struct utp_upiu_task_req *) task_req_descp-task_req_upiu; task_req_upiup-header.dword_0 = UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0, - lrbp-lun, lrbp-task_tag); + lun_id, free_slot); Actually it still doesn't fix the problem. The*task tag* used here should be unique across the SCSI/Query and Task Managment UPIUs. I am sorry, I didn't get that. Why should it be unique across the SCSI/Query? For example, if a machine supports 32 request slots and 8 task management slots, then the task management command tag can be anything out of 8 slots. The spec(ufs 1.1) has the requirement under '10.5.2 Basic Header Format'-'Task Tag'. Couple of devices I came across had similar behavior. The tracking of UPIUs --even belonging to a separate group-- seemed to be based on the 'task tag' value rather than 'type of UPIU'-'task tag'. Thanks for the clarification. So to make the task tag unique we should do something like below: @@ -2940,9 +2941,10 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id, /* Configure task request UPIU */ task_req_upiup = (struct utp_upiu_task_req *) task_req_descp-task_req_upiu; + task_tag = hba-nutrs + free_slot; task_req_upiup-header.dword_0 = UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0, - lun_id, free_slot); + lun_id, task_tag); Will this work for the devices you came across? -- Regards, Sujit -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] scsi: ufs: Fix broken task management command implementation
On Wed, Jul 3, 2013 at 9:22 PM, Sujit Reddy Thumma sthu...@codeaurora.org wrote: On 7/2/2013 9:21 PM, Santosh Y wrote: On Fri, Jun 28, 2013 at 5:02 PM, Sujit Reddy Thumma sthu...@codeaurora.org wrote: On 6/27/2013 4:49 PM, Santosh Y wrote: + spin_lock_irqsave(host-host_lock, flags); task_req_descp = hba-utmrdl_base_addr; task_req_descp += free_slot; @@ -2353,38 +2387,39 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba, (struct utp_upiu_task_req *) task_req_descp-task_req_upiu; task_req_upiup-header.dword_0 = UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0, - lrbp-lun, lrbp-task_tag); + lun_id, free_slot); Actually it still doesn't fix the problem. The*task tag* used here should be unique across the SCSI/Query and Task Managment UPIUs. I am sorry, I didn't get that. Why should it be unique across the SCSI/Query? For example, if a machine supports 32 request slots and 8 task management slots, then the task management command tag can be anything out of 8 slots. The spec(ufs 1.1) has the requirement under '10.5.2 Basic Header Format'-'Task Tag'. Couple of devices I came across had similar behavior. The tracking of UPIUs --even belonging to a separate group-- seemed to be based on the 'task tag' value rather than 'type of UPIU'-'task tag'. Thanks for the clarification. So to make the task tag unique we should do something like below: @@ -2940,9 +2941,10 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id, /* Configure task request UPIU */ task_req_upiup = (struct utp_upiu_task_req *) task_req_descp-task_req_upiu; + task_tag = hba-nutrs + free_slot; Yes, this was exactly my internal fix at the time :-) task_req_upiup-header.dword_0 = UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0, - lun_id, free_slot); + lun_id, task_tag); Will this work for the devices you came across? -- Regards, Sujit -- ~Santosh -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] scsi: ufs: Fix broken task management command implementation
On 7/3/2013 9:53 PM, Santosh Y wrote: On Wed, Jul 3, 2013 at 9:22 PM, Sujit Reddy Thumma sthu...@codeaurora.org wrote: On 7/2/2013 9:21 PM, Santosh Y wrote: On Fri, Jun 28, 2013 at 5:02 PM, Sujit Reddy Thumma sthu...@codeaurora.org wrote: On 6/27/2013 4:49 PM, Santosh Y wrote: + spin_lock_irqsave(host-host_lock, flags); task_req_descp = hba-utmrdl_base_addr; task_req_descp += free_slot; @@ -2353,38 +2387,39 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba, (struct utp_upiu_task_req *) task_req_descp-task_req_upiu; task_req_upiup-header.dword_0 = UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0, - lrbp-lun, lrbp-task_tag); + lun_id, free_slot); Actually it still doesn't fix the problem. The*task tag* used here should be unique across the SCSI/Query and Task Managment UPIUs. I am sorry, I didn't get that. Why should it be unique across the SCSI/Query? For example, if a machine supports 32 request slots and 8 task management slots, then the task management command tag can be anything out of 8 slots. The spec(ufs 1.1) has the requirement under '10.5.2 Basic Header Format'-'Task Tag'. Couple of devices I came across had similar behavior. The tracking of UPIUs --even belonging to a separate group-- seemed to be based on the 'task tag' value rather than 'type of UPIU'-'task tag'. Thanks for the clarification. So to make the task tag unique we should do something like below: @@ -2940,9 +2941,10 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba, int lun_id, int task_id, /* Configure task request UPIU */ task_req_upiup = (struct utp_upiu_task_req *) task_req_descp-task_req_upiu; + task_tag = hba-nutrs + free_slot; Yes, this was exactly my internal fix at the time :-) Okay. I will update the patch with this. Thanks. task_req_upiup-header.dword_0 = UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0, - lun_id, free_slot); + lun_id, task_tag); Will this work for the devices you came across? -- Regards, Sujit -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 3/4] scsi: ufs: Fix device and host reset methods
On 7/3/2013 11:19 AM, Santosh Y wrote: + +/** + * ufshcd_eh_device_reset_handler - device reset handler registered to + *scsi layer. + * @cmd - SCSI command pointer + * + * Returns SUCCESS/FAILED + */ +static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd) +{ + struct ufs_hba *hba; + int err; + unsigned long flags; + + hba = shost_priv(cmd-device-host); + + spin_lock_irqsave(hba-host-host_lock, flags); + if (hba-ufshcd_state == UFSHCD_STATE_RESET) { + dev_warn(hba-dev, %s: reset in progress\n, __func__); + err = SUCCESS; + spin_unlock_irqrestore(hba-host-host_lock, flags); + goto out; It is better to wait here until the state changes to 'operational' or 'error' before returning success. Okay. Sounds good. -- Regards, Sujit -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 4/4] scsi: ufs: Improve UFS fatal error handling
+ +/** + * ufshcd_fatal_err_handler - handle fatal errors + * @work: pointer to work structure */ static void ufshcd_fatal_err_handler(struct work_struct *work) { struct ufs_hba *hba; + unsigned long flags; + u32 err_xfer = 0; + u32 err_tm = 0; + int err; + hba = container_of(work, struct ufs_hba, feh_workq); - /* check if reset is already in progress */ - if (hba-ufshcd_state != UFSHCD_STATE_RESET) - ufshcd_do_reset(hba); + spin_lock_irqsave(hba-host-host_lock, flags); + if (hba-ufshcd_state == UFSHCD_STATE_RESET) { + /* complete processed requests and exit */ + ufshcd_transfer_req_compl(hba); + ufshcd_tmc_handler(hba); + spin_unlock_irqrestore(hba-host-host_lock, flags); + return; I could not go through this patch yet, please check if it needs to wait here until the state is 'operational' or 'error'. The 'reset' state might be due to the device reset also. + } + + hba-ufshcd_state = UFSHCD_STATE_RESET; + ufshcd_error_autopsy_transfer_req(hba, err_xfer); + ufshcd_error_autopsy_task_req(hba, err_tm); + -- ~Santosh -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling
On 07/03/13 19:27, David Dillow wrote: On Wed, 2013-07-03 at 18:00 +0200, Bart Van Assche wrote: The combination of dev_loss_tmo off and reconnect_delay 0 worked fine in my tests. An I/O failure was detected shortly after the cable to the target was pulled. I/O resumed shortly after the cable to the target was reinserted. Perhaps I don't understand your answer -- I'm asking about dev_loss_tmo 0, and fast_io_fail_tmo = 0. The other transports do not allow this scenario, and I'm asking if it makes sense for SRP to allow it. But now that you mention reconnect_delay, what is the meaning of that when it is negative? That's not in the documentation. And should it be considered in srp_tmo_valid() -- are there values of reconnect_delay that cause problems? None of the combinations that can be configured from user space can bring the kernel in trouble. If reconnect_delay = 0 that means that the time-based reconnect mechanism is disabled. I'm starting to get a bit concerned about this patch -- can you, Vu, and Sebastian comment on the testing you have done? All combinations of reconnect_delay, fast_io_fail_tmo and dev_loss_tmo that result in different behavior have been tested. Also, FC caps dev_loss_tmo at SCSI_DEVICE_BLOCK_MAX_TIMEOUT if fail_io_fast_tmo is off; I agree with your reasoning about leaving it unlimited if fast fail is on, but does that still hold if it is off? I think setting dev_loss_tmo to a large value only makes sense if the value of reconnect_delay is not too large. Setting both to a large value would result in slow recovery after a transport layer failure has been corrected. So you agree it should be capped? I can't tell from your response. Not all combinations of reconnect_delay / fail_io_fast_tmo / dev_loss_tmo result in useful behavior. It is up to the user to choose a meaningful combination. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On 07/03/2013 11:54 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? Can we really trust all devices to make something atomic that is not durable :) ? thanks! ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
Quoting Ric Wheeler (2013-07-03 14:31:59) On 07/03/2013 11:54 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. This is only partially true, since you're extending the sata drive model into atomics, and the devices implementing atomics are (so far anyway) are not sata. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. In sata speak, it could go down as atomic + FUA + NCQ. In practice this is going to be in fusionio, nvme devices and big storage arrays, all of which we can expect to have proper knobs for lies about IO that isn't really done yet. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? Can we really trust all devices to make something atomic that is not durable :) ? Today's usage is mostly O_DIRECT, which really should be FUA. Long term we can hope people will find more interesting uses. Either way the point is that an atomic write is a grouping mechanism, and if the standards people want to control fuaness in a separate bit, that's really fine. -chris -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling
On Wed, 2013-07-03 at 20:24 +0200, Bart Van Assche wrote: On 07/03/13 19:27, David Dillow wrote: On Wed, 2013-07-03 at 18:00 +0200, Bart Van Assche wrote: The combination of dev_loss_tmo off and reconnect_delay 0 worked fine in my tests. An I/O failure was detected shortly after the cable to the target was pulled. I/O resumed shortly after the cable to the target was reinserted. Perhaps I don't understand your answer -- I'm asking about dev_loss_tmo 0, and fast_io_fail_tmo = 0. The other transports do not allow this scenario, and I'm asking if it makes sense for SRP to allow it. But now that you mention reconnect_delay, what is the meaning of that when it is negative? That's not in the documentation. And should it be considered in srp_tmo_valid() -- are there values of reconnect_delay that cause problems? None of the combinations that can be configured from user space can bring the kernel in trouble. If reconnect_delay = 0 that means that the time-based reconnect mechanism is disabled. Then it should use the same semantics as the other attributes, and have the user store off to turn it off. And I'm getting the strong sense that the answer to my question about fast_io_fail_tmo = 0 when dev_loss_tmo is that we should not allow that combination, even if it doesn't break the kernel. If it doesn't make sense, there is no reason to create an opportunity for user confusion. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
On 07/03/2013 02:54 PM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 14:31:59) On 07/03/2013 11:54 AM, Chris Mason wrote: Quoting Ric Wheeler (2013-07-03 11:42:38) On 07/03/2013 11:37 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote: On 07/03/2013 11:22 AM, James Bottomley wrote: On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote: Why not have the atomic write actually imply that it is atomic and durable for just that command? I don't understand why you think you need guaranteed durability for every journal transaction? That's what causes us performance problems because we have to pause on every transaction commit. We require durability for explicit flushes, obviously, but we could achieve far better performance if we could just let the filesystem updates stream to the disk and rely on atomic writes making sure the journal entries were all correct. The reason we require durability for journal entries today is to ensure caching effects don't cause the journal to lie or be corrupt. Why would we use atomic writes for things that don't need to be durable? Avoid a torn page write seems to be the only real difference here if you use the atomic operations and don't have durability... It's not just about torn pages: Journal entries are big complex beasts. They can be megabytes big (at least on xfs). If we can guarantee all or nothing atomicity in the entire journal entry write it permits a more streaming design of the filesystem writeout path. James Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush Yes. But just because we need the flush here doesn't mean we need the flush for every single atomic write. -chris The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. This is only partially true, since you're extending the sata drive model into atomics, and the devices implementing atomics are (so far anyway) are not sata. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. In sata speak, it could go down as atomic + FUA + NCQ. In practice this is going to be in fusionio, nvme devices and big storage arrays, all of which we can expect to have proper knobs for lies about IO that isn't really done yet. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? Can we really trust all devices to make something atomic that is not durable :) ? Today's usage is mostly O_DIRECT, which really should be FUA. Long term we can hope people will find more interesting uses. Either way the point is that an atomic write is a grouping mechanism, and if the standards people want to control fuaness in a separate bit, that's really fine. -chris That makes sense to me - happy to have that bit a bit to indicate durability in the atomic operation... Ric -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling
David Dillow wrote: On Wed, 2013-07-03 at 20:24 +0200, Bart Van Assche wrote: On 07/03/13 19:27, David Dillow wrote: On Wed, 2013-07-03 at 18:00 +0200, Bart Van Assche wrote: The combination of dev_loss_tmo off and reconnect_delay 0 worked fine in my tests. An I/O failure was detected shortly after the cable to the target was pulled. I/O resumed shortly after the cable to the target was reinserted. Perhaps I don't understand your answer -- I'm asking about dev_loss_tmo 0, and fast_io_fail_tmo = 0. The other transports do not allow this scenario, and I'm asking if it makes sense for SRP to allow it. But now that you mention reconnect_delay, what is the meaning of that when it is negative? That's not in the documentation. And should it be considered in srp_tmo_valid() -- are there values of reconnect_delay that cause problems? None of the combinations that can be configured from user space can bring the kernel in trouble. If reconnect_delay = 0 that means that the time-based reconnect mechanism is disabled. Then it should use the same semantics as the other attributes, and have the user store off to turn it off. And I'm getting the strong sense that the answer to my question about fast_io_fail_tmo = 0 when dev_loss_tmo is that we should not allow that combination, even if it doesn't break the kernel. If it doesn't make sense, there is no reason to create an opportunity for user confusion. Hello Dave, when dev_loss_tmo expired, srp not only removes the rport but also removes the associated scsi_host. One may wish to set fast_io_fail_tmo =0 for I/Os to fail-over fast to other paths, and dev_loss_tmo off to keep the scsi_host around until the target coming back. -vu -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On Thu, Jun 20, 2013 at 04:17:13PM -0400, Matthew Wilcox wrote: A paper at FAST2012 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed out the performance overhead of taking interrupts for low-latency block I/Os. The solution the author investigated was to spin waiting for each I/O to complete. This is inefficient as Linux submits many I/Os which are not latency-sensitive, and even when we do submit latency-sensitive I/Os (eg swap-in), we frequently submit several I/Os before waiting. This RFC takes a different approach, only spinning when we would otherwise sleep. To implement this, I add an 'io_poll' function pointer to backing_dev_info. I include a sample implementation for the NVMe driver. Next, I add an io_wait() function which will call io_poll() if it is set. It falls back to calling io_schedule() if anything goes wrong with io_poll() or the task exceeds its timeslice. Finally, all that is left is to judiciously replace calls to io_schedule() with calls to io_wait(). I think I've covered the main contenders with sleep_on_page(), sleep_on_buffer() and the DIO path. I've measured the performance benefits of this with a Chatham NVMe prototype device and a simple # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100 The latency of each I/O reduces by about 2.5us (from around 8.0us to around 5.5us). This matches up quite well with the performance numbers shown in the FAST2012 paper (which used a similar device). Hi Matthew, I'm wondering where the 2.5us latency cut comes from. I did a simple test. In my xeon 3.4G CPU, one cpu can do about 2M/s context switch of applications. Assuming switching to idle is faster, so switching to idle and back should take less than 1us. Does the 2.5us latency cut mostly come from deep idle state latency? if so, maybe set a lower pm_qos value or have a better idle governer to prevent cpu entering deep idle state can help too. Thanks, Shaohua -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: atomic write T10 standards
Ric Wheeler, on 07/03/2013 11:31 AM wrote: Journals are normally big (128MB or so?) - I don't think that this is unique to xfs. We're mixing a bunch of concepts here. The filesystems have a lot of different requirements, and atomics are just one small part. Creating a new file often uses resources freed by past files. So deleting the old must be ordered against allocating the new. They are really separate atomic units but you can't handle them completely independently. If our existing journal commit is: * write the data blocks for a transaction * flush * write the commit block for the transaction * flush Which part of this does and atomic write help? We would still need at least: * atomic write of data blocks commit blocks * flush No necessary. Consider a case, when you are creating many small files in a big directory. Consider that every such operation needs 3 actions: add new directory entry, get free space and write data there. If 1 atomic write (scattered) command is used for each operation and you order them between each other, if needed, in some way, e.g. by using ORDERED SCSI attribute or queue draining, you don't need any intermediate flushes. Only one final flush would be sufficient. In case of crash simply some of the new files would disappear, but everything would be fully consistent, so the only needed recovery would be to recreate them. The catch is that our current flush mechanisms are still pretty brute force and act across either the whole device or in a temporal (everything flushed before this is acked) way. I still see it would be useful to have the atomic write really be atomic and durable just for that IO - no flush needed. Can you give a sequence for the use case for the non-durable atomic write that would not need a sync? See above. Can we really trust all devices to make something atomic that is not durable :) ? Sure, if application allows that and the atomicity property itself is durable, why not? Vlad P.S. With atomic writes there's no need in a journal, no? -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html