[PATCH v3 0/13] IB SRP initiator patches for kernel 3.11

2013-07-03 Thread Bart Van Assche

The purpose of this InfiniBand SRP initiator patch series is as follows:
- Make the SRP initiator driver better suited for use in a H.A. setup.
  Add fast_io_fail_tmo and dev_loss_tmo parameters. These can be used
  either to speed up failover or to avoid device removal when e.g. using
  initiator side mirroring.
- Make the SRP initiator better suited for use on NUMA systems by
  making the HCA completion vector configurable.

Changes since the v2 of the IB SRP initiator patches for kernel 3.11 
patch series:

- Improved documentation of the newly added sysfs parameters.
- Limit fast_io_fail_tmo to SCSI_DEVICE_BLOCK_MAX_TIMEOUT.
- Simplified the code for parsing values written into sysfs attributes.
- Fixed a potential deadlock in the code added in scsi_transport_srp
  (invoking cancel_delayed_work() with the rport mutex held for work
   that needs the rport mutex itself).
- Changed the default retry count back from 2 to 7 since there is not
  yet agreement about this change.
- Dropped the patch that silences failed SCSI commands and also the
  patch that fixes a race between srp_queuecommand() and
  srp_claim_req() since there is no agreement about these patches.

Changes since the v1 of the IB SRP initiator patches for kernel 3.11 
patch series:

- scsi_transport_srp: Allowed both fast_io_fail and dev_loss timeouts
  to be disabled.
- scsi_transport_srp, srp_reconnect_rport(): switched from
  scsi_block_requests() to scsi_target_block() for blocking SCSI command
  processing temporarily.
- scsi_transport_srp, srp_start_tl_fail_timers(): only block SCSI device
  command processing if the fast_io_fail timer is enabled.
- Changed srp_abort() such that upon transport offline the value
  FAST_IO_FAIL is returned instead of SUCCESS.
- Fixed a race condition in the maintain single connection patch: a
  new login after removal had started but before removal had finished
  still could create a duplicate connection. Fixed this by deferring
  removal from the target list until removal has finished.
- Modified the error message in the same patch for reporting that a
  duplicate connection has been rejected.
- Modified patch 2/15 such that all possible race conditions with
  srp_claim_req() are addressed.
- Documented the comp_vector and tl_retry_count login string parameters.
- Updated dev_loss_tmo and fast_io_fail_tmo documentation - mentioned
  off is a valid choice.

Changes compared to v5 of the Make ib_srp better suited for H.A. 
purposes patch series:

- Left out patches that are already upstream.
- Made it possible to set dev_loss_tmo to off. This is useful in a
  setup using initiator side mirroring to avoid that new /dev/sd* names
  are reassigned after a failover or cable pull and reinsert.
- Added kernel module parameters to ib_srp for configuring default
  values of the fast_io_fail_tmo and dev_loss_tmo parameters.
- Added a patch from Dotan Barak that fixes a kernel oops during rmmod
  triggered by resource allocation failure at module load time.
- Avoid duplicate connections by refusing relogins instead of dropping
  duplicate connections, as proposed by Sebastian Riemer.
- Added a patch from Sebastian Riemer for failing SCSI commands
  silently.
- Added a patch from Vu Pham to make the transport layer (IB RC) retry
  count configurable.
- Made HCA completion vector configurable.

Changes since v4:
- Added a patch for removing SCSI devices upon a port down event

Changes since v3:
- Restored the dev_loss_tmo and fast_io_fail_tmo sysfs attributes.
- Included a patch to fix an ib_srp crash that could be triggered by
  cable pulling.

Changes since v2:
- Addressed the v2 review comments.
- Dropped the patches that have already been merged.
- Dropped the patches for integration with multipathd.
- Dropped the micro-optimization of the IB completion handlers.

The individual patches in this series are as follows:
0001-IB-srp-Fix-remove_one-crash-due-to-resource-exhausti.patch
0002-IB-srp-Fix-race-between-srp_queuecommand-and-srp_cla.patch
0003-IB-srp-Avoid-that-srp_reset_host-is-skipped-after-a-.patch
0004-IB-srp-Fail-I-O-fast-if-target-offline.patch
0005-IB-srp-Skip-host-settle-delay.patch
0006-IB-srp-Maintain-a-single-connection-per-I_T-nexus.patch
0007-IB-srp-Keep-rport-as-long-as-the-IB-transport-layer.patch
0008-scsi_transport_srp-Add-transport-layer-error-handlin.patch
0009-IB-srp-Add-srp_terminate_io.patch
0010-IB-srp-Use-SRP-transport-layer-error-recovery.patch
0011-IB-srp-Start-timers-if-a-transport-layer-error-occur.patch
0012-IB-srp-Fail-SCSI-commands-silently.patch
0013-IB-srp-Make-HCA-completion-vector-configurable.patch
0014-IB-srp-Make-transport-layer-retry-count-configurable.patch
0015-IB-srp-Bump-driver-version-and-release-date.patch
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can not see/access devices on Marvell 88SE9485 + SiI 3726 PMP

2013-07-03 Thread Hajo Möller
On 01.07.2013 16:38, James Bottomley wrote:
 mv_sas is a libsas based driver.  libsas doesn't have any support for
 SATA PMPs.  When it was added they were left as a todo item but then in
 the field everyone deployed enterprise type SATA devices in SAS expander
 chassis, so PMP support just got forgotten.

That's a pity, so it's unlikely libsas will get support for PMPs in the
near future or at all?  Sadly I'm not able to add the support myself.

Do you know of any SAS controllers which do support SATA PMPs?
I'm stuck with the PMPs, so I'm trying a SATA controller with an
88SX7042 next, although having a good SAS controller would be better in
the long run.

Greetings,
Hajo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/13] IB/srp: Keep rport as long as the IB transport layer

2013-07-03 Thread Bart Van Assche
Keep the rport data structure around after srp_remove_host() has
finished until cleanup of the IB transport layer has finished
completely. This is necessary because later patches use the rport
pointer inside the queuecommand callback. Without this patch
accessing the rport from inside a queuecommand callback is racy
because srp_remove_host() must be invoked before scsi_remove_host()
and because the queuecommand callback may get invoked after
srp_remove_host() has finished.

Signed-off-by: Bart Van Assche bvanass...@acm.org
Cc: Roland Dreier rol...@purestorage.com
Cc: James Bottomley jbottom...@parallels.com
Cc: David Dillow dillo...@ornl.gov
Cc: Vu Pham v...@mellanox.com
Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
---
 drivers/infiniband/ulp/srp/ib_srp.c |3 +++
 drivers/infiniband/ulp/srp/ib_srp.h |1 +
 drivers/scsi/scsi_transport_srp.c   |   18 ++
 include/scsi/scsi_transport_srp.h   |2 ++
 4 files changed, 24 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index f046e32..f65701d 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -526,11 +526,13 @@ static void srp_remove_target(struct srp_target_port 
*target)
WARN_ON_ONCE(target-state != SRP_TARGET_REMOVED);
 
srp_del_scsi_host_attr(target-scsi_host);
+   srp_rport_get(target-rport);
srp_remove_host(target-scsi_host);
scsi_remove_host(target-scsi_host);
srp_disconnect_target(target);
ib_destroy_cm_id(target-cm_id);
srp_free_target_ib(target);
+   srp_rport_put(target-rport);
srp_free_req_data(target);
scsi_host_put(target-scsi_host);
 }
@@ -1982,6 +1984,7 @@ static int srp_add_target(struct srp_host *host, struct 
srp_target_port *target)
}
 
rport-lld_data = target;
+   target-rport = rport;
 
spin_lock(host-target_lock);
list_add_tail(target-list, host-target_list);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h 
b/drivers/infiniband/ulp/srp/ib_srp.h
index 66fbedd..1817ed5 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -153,6 +153,7 @@ struct srp_target_port {
u16 io_class;
struct srp_host*srp_host;
struct Scsi_Host   *scsi_host;
+   struct srp_rport   *rport;
chartarget_name[32];
unsigned intscsi_id;
unsigned intsg_tablesize;
diff --git a/drivers/scsi/scsi_transport_srp.c 
b/drivers/scsi/scsi_transport_srp.c
index f379c7f..f7ba94a 100644
--- a/drivers/scsi/scsi_transport_srp.c
+++ b/drivers/scsi/scsi_transport_srp.c
@@ -185,6 +185,24 @@ static int srp_host_match(struct attribute_container 
*cont, struct device *dev)
 }
 
 /**
+ * srp_rport_get() - increment rport reference count
+ */
+void srp_rport_get(struct srp_rport *rport)
+{
+   get_device(rport-dev);
+}
+EXPORT_SYMBOL(srp_rport_get);
+
+/**
+ * srp_rport_put() - decrement rport reference count
+ */
+void srp_rport_put(struct srp_rport *rport)
+{
+   put_device(rport-dev);
+}
+EXPORT_SYMBOL(srp_rport_put);
+
+/**
  * srp_rport_add - add a SRP remote port to the device hierarchy
  * @shost: scsi host the remote port is connected to.
  * @ids:   The port id for the remote port.
diff --git a/include/scsi/scsi_transport_srp.h 
b/include/scsi/scsi_transport_srp.h
index ff0f04a..5a2d2d1 100644
--- a/include/scsi/scsi_transport_srp.h
+++ b/include/scsi/scsi_transport_srp.h
@@ -38,6 +38,8 @@ extern struct scsi_transport_template *
 srp_attach_transport(struct srp_function_template *);
 extern void srp_release_transport(struct scsi_transport_template *);
 
+extern void srp_rport_get(struct srp_rport *rport);
+extern void srp_rport_put(struct srp_rport *rport);
 extern struct srp_rport *srp_rport_add(struct Scsi_Host *,
   struct srp_rport_identifiers *);
 extern void srp_rport_del(struct srp_rport *);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling

2013-07-03 Thread Bart Van Assche
Add the necessary functions in the SRP transport module to allow
an SRP initiator driver to implement transport layer error handling
similar to the functionality already provided by the FC transport
layer. This includes:
- Support for implementing fast_io_fail_tmo, the time that should
  elapse after having detected a transport layer problem and
  before failing I/O.
- Support for implementing dev_loss_tmo, the time that should
  elapse after having detected a transport layer problem and
  before removing a remote port.

Signed-off-by: Bart Van Assche bvanass...@acm.org
Cc: Roland Dreier rol...@purestorage.com
Cc: James Bottomley jbottom...@parallels.com
Cc: David Dillow dillo...@ornl.gov
Cc: Vu Pham v...@mellanox.com
Cc: Sebastian Riemer sebastian.rie...@profitbricks.com
---
 Documentation/ABI/stable/sysfs-transport-srp |   38 +++
 drivers/scsi/scsi_transport_srp.c|  468 +-
 include/scsi/scsi_transport_srp.h|   62 +++-
 3 files changed, 565 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-transport-srp 
b/Documentation/ABI/stable/sysfs-transport-srp
index b36fb0d..52babb9 100644
--- a/Documentation/ABI/stable/sysfs-transport-srp
+++ b/Documentation/ABI/stable/sysfs-transport-srp
@@ -5,6 +5,24 @@ Contact:   linux-scsi@vger.kernel.org, 
linux-r...@vger.kernel.org
 Description:   Instructs an SRP initiator to disconnect from a target and to
remove all LUNs imported from that target.
 
+What:  /sys/class/srp_remote_ports/port-h:n/dev_loss_tmo
+Date:  October 1, 2013
+KernelVersion: 3.11
+Contact:   linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org
+Description:   Number of seconds the SCSI layer will wait after a transport
+   layer error has been observed before removing a target port.
+   Zero means immediate removal. Setting this attribute to off
+   will disable this behavior.
+
+What:  /sys/class/srp_remote_ports/port-h:n/fast_io_fail_tmo
+Date:  October 1, 2013
+KernelVersion: 3.11
+Contact:   linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org
+Description:   Number of seconds the SCSI layer will wait after a transport
+   layer error has been observed before failing I/O. Zero means
+   failing I/O immediately. Setting this attribute to off will
+   disable this behavior.
+
 What:  /sys/class/srp_remote_ports/port-h:n/port_id
 Date:  June 27, 2007
 KernelVersion: 2.6.24
@@ -12,8 +30,28 @@ Contact: linux-scsi@vger.kernel.org
 Description:   16-byte local SRP port identifier in hexadecimal format. An
example: 4c:49:4e:55:58:20:56:49:4f:00:00:00:00:00:00:00.
 
+What:  /sys/class/srp_remote_ports/port-h:n/reconnect_delay
+Date:  October 1, 2013
+KernelVersion: 3.11
+Contact:   linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org
+Description:   Number of seconds the SCSI layer will wait after a reconnect
+   attempt failed before retrying.
+
 What:  /sys/class/srp_remote_ports/port-h:n/roles
 Date:  June 27, 2007
 KernelVersion: 2.6.24
 Contact:   linux-scsi@vger.kernel.org
 Description:   Role of the remote port. Either SRP Initiator or SRP Target.
+
+What:  /sys/class/srp_remote_ports/port-h:n/state
+Date:  October 1, 2013
+KernelVersion: 3.11
+Contact:   linux-scsi@vger.kernel.org, linux-r...@vger.kernel.org
+Description:   State of the transport layer used for communication with the
+   remote port. running if the transport layer is operational;
+   blocked if a transport layer error has been encountered but
+   the fail_io_fast_tmo timer has not yet fired; fail-fast
+   after the fail_io_fast_tmo timer has fired and before the
+   dev_loss_tmo timer has fired; lost after the
+   dev_loss_tmo timer has fired and before the port is finally
+   removed.
diff --git a/drivers/scsi/scsi_transport_srp.c 
b/drivers/scsi/scsi_transport_srp.c
index f7ba94a..1b9ebd5 100644
--- a/drivers/scsi/scsi_transport_srp.c
+++ b/drivers/scsi/scsi_transport_srp.c
@@ -24,12 +24,15 @@
 #include linux/err.h
 #include linux/slab.h
 #include linux/string.h
+#include linux/delay.h
 
 #include scsi/scsi.h
+#include scsi/scsi_cmnd.h
 #include scsi/scsi_device.h
 #include scsi/scsi_host.h
 #include scsi/scsi_transport.h
 #include scsi/scsi_transport_srp.h
+#include scsi_priv.h
 #include scsi_transport_srp_internal.h
 
 struct srp_host_attrs {
@@ -38,7 +41,7 @@ struct srp_host_attrs {
 #define to_srp_host_attrs(host)((struct srp_host_attrs 
*)(host)-shost_data)
 
 #define SRP_HOST_ATTRS 0
-#define SRP_RPORT_ATTRS 3
+#define SRP_RPORT_ATTRS 8
 
 struct srp_internal {
struct scsi_transport_template t;
@@ -54,6 +57,26 @@ struct srp_internal {
 
 #definedev_to_rport(d) container_of(d, struct srp_rport, dev)
 

Re: [PATCH v3 0/13] IB SRP initiator patches for kernel 3.11

2013-07-03 Thread Or Gerlitz

On 03/07/2013 15:41, Bart Van Assche wrote:


[...]

Bart,


The individual patches in this series are as follows:
0001-IB-srp-Fix-remove_one-crash-due-to-resource-exhausti.patch
0002-IB-srp-Fix-race-between-srp_queuecommand-and-srp_cla.patch
0003-IB-srp-Avoid-that-srp_reset_host-is-skipped-after-a-.patch
0004-IB-srp-Fail-I-O-fast-if-target-offline.patch
0005-IB-srp-Skip-host-settle-delay.patch
0006-IB-srp-Maintain-a-single-connection-per-I_T-nexus.patch
0007-IB-srp-Keep-rport-as-long-as-the-IB-transport-layer.patch
0008-scsi_transport_srp-Add-transport-layer-error-handlin.patch
0009-IB-srp-Add-srp_terminate_io.patch
0010-IB-srp-Use-SRP-transport-layer-error-recovery.patch
0011-IB-srp-Start-timers-if-a-transport-layer-error-occur.patch
0012-IB-srp-Fail-SCSI-commands-silently.patch
0013-IB-srp-Make-HCA-completion-vector-configurable.patch
0014-IB-srp-Make-transport-layer-retry-count-configurable.patch
0015-IB-srp-Bump-driver-version-and-release-date.patch


Some of these patches were already picked by Roland (SB), I would 
suggest that you

post V4 and drop the ones which were accepted.

e8ca413 IB/srp: Bump driver version and release date
4b5e5f4 IB/srp: Make HCA completion vector configurable
96fc248 IB/srp: Maintain a single connection per I_T nexus
99e1c13 IB/srp: Fail I/O fast if target offline
2742c1d IB/srp: Skip host settle delay
086f44f IB/srp: Avoid skipping srp_reset_host() after a transport error
1fe0cb8 IB/srp: Fix remove_one crash due to resource exhaustion

Also, Would help if you use the --cover-letter of git format-patch and
the resulted cover letter  (patch 0/N) as it has standard content which
you can enhance and place your additions.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/13] IB SRP initiator patches for kernel 3.11

2013-07-03 Thread Bart Van Assche

On 07/03/13 15:38, Or Gerlitz wrote:

Some of these patches were already picked by Roland (SB), I would
suggest that you post V4 and drop the ones which were accepted.


One of the patches that is already in Roland's tree and that was in v1 
of this series has been split into two patches in v2 and v3 of this 
series. So I'd like to hear from Roland what he prefers himself - that I 
drop the patches that are already in his tree or that Roland updates his 
tree with the most recently posted patches.


Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 11:00 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote:

On 07/03/2013 10:38 AM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 10:34:04)

As I was out walking Skeeter this morning, I was thinking a bit about the new
T10 atomic write proposal that Chris spoke about some time back.

Specifically, I think that we would see a value only if the atomic write was
also durable - if not, we need to always issue a SYNCHRONIZE_CACHE command which
would mean it really is not effectively more useful than a normal write?

Did I understand the proposal correctly?  If I did, should we poke the usual T10
posse to nudge them (David Black, Fred Knight, etc?)...

I don't think the atomic writes should be a special case here.  We've
already got the cache flush and fua machinery and should just apply it
on top of the atomic constructs...

-chris


I should have sent this to the linux-scsi list I suppose, but wanted clarity
before embarrassing myself :)

Yes, it is a better to have a wider audience


Adding in linux-scsi




If we have to use fua/flush after an atomic write, what makes it atomic?  Why
not just use a normal write?

It does not seem to add anything that write + flush/fua does?

It adds the all or nothing that we can use to commit journal entries
without having to worry about atomicity.  The guarantee is that
everything makes it or nothing does.


I still don't see the difference in write + SYNC_CACHE versus atomic write + 
SYNC_CACHE.


If the write is atomic and not durable, it is not really usable as a hard 
promise until after we flush it somehow.


In theory, if we got ordered tags working to ensure transaction vs data
ordering, this would mean we wouldn't have to flush at all because the
disk image would always be journal consistent ... a bit like the old
soft update scheme.

James



Why not have the atomic write actually imply that it is atomic and durable for 
just that command?


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can not see/access devices on Marvell 88SE9485 + SiI 3726 PMP

2013-07-03 Thread James Bottomley
On Wed, 2013-07-03 at 14:52 +0200, Hajo Möller wrote:
 On 01.07.2013 16:38, James Bottomley wrote:
  mv_sas is a libsas based driver.  libsas doesn't have any support for
  SATA PMPs.  When it was added they were left as a todo item but then in
  the field everyone deployed enterprise type SATA devices in SAS expander
  chassis, so PMP support just got forgotten.
 
 That's a pity, so it's unlikely libsas will get support for PMPs in the
 near future or at all?  Sadly I'm not able to add the support myself.
 
 Do you know of any SAS controllers which do support SATA PMPs?
 I'm stuck with the PMPs, so I'm trying a SATA controller with an
 88SX7042 next, although having a good SAS controller would be better in
 the long run.

No, I'm afraid not.  Libsas has been released for nearly seven years and
no-one has noticed in all that time, which means it's a really rare
thing to do.  Everyone who uses SAS controllers usually has either a
direct connection or an expander.  There's no real reason to buy
expensive kit and then plug it into a cheap port switch.  That said,
it's possible that some of the firmware SAT blob based controllers, like
the MPT ones might have PMP support ... but because of the rarity it's
likely to be buggy.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread James Bottomley
On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
 On 07/03/2013 11:00 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 10:56 -0400, Ric Wheeler wrote:
  On 07/03/2013 10:38 AM, Chris Mason wrote:
  Quoting Ric Wheeler (2013-07-03 10:34:04)
  As I was out walking Skeeter this morning, I was thinking a bit about 
  the new
  T10 atomic write proposal that Chris spoke about some time back.
 
  Specifically, I think that we would see a value only if the atomic write 
  was
  also durable - if not, we need to always issue a SYNCHRONIZE_CACHE 
  command which
  would mean it really is not effectively more useful than a normal write?
 
  Did I understand the proposal correctly?  If I did, should we poke the 
  usual T10
  posse to nudge them (David Black, Fred Knight, etc?)...
  I don't think the atomic writes should be a special case here.  We've
  already got the cache flush and fua machinery and should just apply it
  on top of the atomic constructs...
 
  -chris
 
  I should have sent this to the linux-scsi list I suppose, but wanted 
  clarity
  before embarrassing myself :)
  Yes, it is a better to have a wider audience
 
 Adding in linux-scsi
 
 
  If we have to use fua/flush after an atomic write, what makes it atomic?  
  Why
  not just use a normal write?
 
  It does not seem to add anything that write + flush/fua does?
  It adds the all or nothing that we can use to commit journal entries
  without having to worry about atomicity.  The guarantee is that
  everything makes it or nothing does.
 
 I still don't see the difference in write + SYNC_CACHE versus atomic write + 
 SYNC_CACHE.
 
 If the write is atomic and not durable, it is not really usable as a hard 
 promise until after we flush it somehow.
 
  In theory, if we got ordered tags working to ensure transaction vs data
  ordering, this would mean we wouldn't have to flush at all because the
  disk image would always be journal consistent ... a bit like the old
  soft update scheme.
 
  James
 
 
 Why not have the atomic write actually imply that it is atomic and durable 
 for 
 just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

James


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling

2013-07-03 Thread David Dillow
On Wed, 2013-07-03 at 14:54 +0200, Bart Van Assche wrote:
 +int srp_tmo_valid(int fast_io_fail_tmo, int dev_loss_tmo)
 +{
 + return (fast_io_fail_tmo  0 || dev_loss_tmo  0 ||
 + fast_io_fail_tmo  dev_loss_tmo) 
 + fast_io_fail_tmo = SCSI_DEVICE_BLOCK_MAX_TIMEOUT 
 + dev_loss_tmo  LONG_MAX / HZ ? 0 : -EINVAL;
 +}
 +EXPORT_SYMBOL_GPL(srp_tmo_valid);

This would have been more readable:

int srp_tmo_valid(int fast_io_fail_tmp, int dev_loss_tmo)
{
/* Fast IO fail must be off, or no greater than the max timeout */
if (fast_io_fail_tmo  SCSI_DEVICE_BLOCK_MAX_TIMEOUT)
return -EINVAL;

/* Device timeout must be off, or fit into jiffies */
if (dev_loss_tmo = LONG_MAX / HZ)
return -EINVAL;

/* Fast IO must trigger before device loss, or one of the
 * timeouts must be disabled.
 */
if (fast_io_fail_tmo  0 || dev_loss_tmo  0)
return 0;
if (fast_io_fail  dev_loss_tmo)
return 0;

return -EINVAL;  
}

Though, now that I've unpacked it -- I don't think it is OK for
dev_loss_tmo to be off, but fast IO to be on? That drops another
conditional.

Also, FC caps dev_loss_tmo at SCSI_DEVICE_BLOCK_MAX_TIMEOUT if
fail_io_fast_tmo is off; I agree with your reasoning about leaving it
unlimited if fast fail is on, but does that still hold if it is off?



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread James Bottomley
On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
 On 07/03/2013 11:22 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
  Why not have the atomic write actually imply that it is atomic and durable 
  for
  just that command?
  I don't understand why you think you need guaranteed durability for
  every journal transaction?  That's what causes us performance problems
  because we have to pause on every transaction commit.
 
  We require durability for explicit flushes, obviously, but we could
  achieve far better performance if we could just let the filesystem
  updates stream to the disk and rely on atomic writes making sure the
  journal entries were all correct.  The reason we require durability for
  journal entries today is to ensure caching effects don't cause the
  journal to lie or be corrupt.
 
 Why would we use atomic writes for things that don't need to be
 durable?
 
 Avoid a torn page write seems to be the only real difference here if
 you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James
 

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 11:37 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:

On 07/03/2013 11:22 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:

Why not have the atomic write actually imply that it is atomic and durable for
just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

Why would we use atomic writes for things that don't need to be
durable?

Avoid a torn page write seems to be the only real difference here if
you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James
  



Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

Right?

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] scsi: ufs: Fix broken task management command implementation

2013-07-03 Thread Sujit Reddy Thumma

On 7/2/2013 9:21 PM, Santosh Y wrote:

On Fri, Jun 28, 2013 at 5:02 PM, Sujit Reddy Thumma
sthu...@codeaurora.org wrote:

On 6/27/2013 4:49 PM, Santosh Y wrote:



+   spin_lock_irqsave(host-host_lock, flags);
 task_req_descp = hba-utmrdl_base_addr;
 task_req_descp += free_slot;

@@ -2353,38 +2387,39 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
 (struct utp_upiu_task_req *)
task_req_descp-task_req_upiu;
 task_req_upiup-header.dword_0 =
 UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
- lrbp-lun,
lrbp-task_tag);
+   lun_id, free_slot);


Actually it still doesn't fix the problem. The*task tag*  used here

should be unique across the SCSI/Query and Task Managment UPIUs.



I am sorry, I didn't get that. Why should it be unique across the
SCSI/Query? For example, if a machine supports 32 request slots and 8 task
management slots, then the task management command tag can be anything out
of 8 slots.



The spec(ufs 1.1) has the requirement under  '10.5.2 Basic Header
Format'-'Task Tag'.
Couple of devices I came across had similar behavior. The tracking of
UPIUs --even belonging to a separate group-- seemed to be based on the
'task tag' value rather than 'type of UPIU'-'task tag'.


Thanks for the clarification. So to make the task tag unique we should 
do something like below:


@@ -2940,9 +2941,10 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba 
*hba, int lun_id, int task_id,

/* Configure task request UPIU */
task_req_upiup =
(struct utp_upiu_task_req *) task_req_descp-task_req_upiu;
+   task_tag = hba-nutrs + free_slot;
task_req_upiup-header.dword_0 =
UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
-   lun_id, free_slot);
+   lun_id, task_tag);

Will this work for the devices you came across?


--
Regards,
Sujit
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Chris Mason
Quoting Ric Wheeler (2013-07-03 11:42:38)
 On 07/03/2013 11:37 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
  On 07/03/2013 11:22 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
  Why not have the atomic write actually imply that it is atomic and 
  durable for
  just that command?
  I don't understand why you think you need guaranteed durability for
  every journal transaction?  That's what causes us performance problems
  because we have to pause on every transaction commit.
 
  We require durability for explicit flushes, obviously, but we could
  achieve far better performance if we could just let the filesystem
  updates stream to the disk and rely on atomic writes making sure the
  journal entries were all correct.  The reason we require durability for
  journal entries today is to ensure caching effects don't cause the
  journal to lie or be corrupt.
  Why would we use atomic writes for things that don't need to be
  durable?
 
  Avoid a torn page write seems to be the only real difference here if
  you use the atomic operations and don't have durability...
  It's not just about torn pages: Journal entries are big complex beasts.
  They can be megabytes big (at least on xfs).  If we can guarantee all or
  nothing atomicity in the entire journal entry write it permits a more
  streaming design of the filesystem writeout path.
 
  James

 
 
 Journals are normally big (128MB or so?) - I don't think that this is unique 
 to xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.

 
 If our existing journal commit is:
 
 * write the data blocks for a transaction
 * flush
 * write the commit block for the transaction
 * flush
 
 Which part of this does and atomic write help?
 
 We would still need at least:
 
 * atomic write of data blocks  commit blocks
 * flush

Yes.  But just because we need the flush here doesn't mean we need the
flush for every single atomic write.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] scsi: ufs: Fix broken task management command implementation

2013-07-03 Thread Santosh Y
On Wed, Jul 3, 2013 at 9:22 PM, Sujit Reddy Thumma
sthu...@codeaurora.org wrote:
 On 7/2/2013 9:21 PM, Santosh Y wrote:

 On Fri, Jun 28, 2013 at 5:02 PM, Sujit Reddy Thumma
 sthu...@codeaurora.org wrote:

 On 6/27/2013 4:49 PM, Santosh Y wrote:


 +   spin_lock_irqsave(host-host_lock, flags);
  task_req_descp = hba-utmrdl_base_addr;
  task_req_descp += free_slot;

 @@ -2353,38 +2387,39 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
  (struct utp_upiu_task_req *)
 task_req_descp-task_req_upiu;
  task_req_upiup-header.dword_0 =
  UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
 - lrbp-lun,
 lrbp-task_tag);
 +   lun_id, free_slot);


 Actually it still doesn't fix the problem. The*task tag*  used here

 should be unique across the SCSI/Query and Task Managment UPIUs.



 I am sorry, I didn't get that. Why should it be unique across the
 SCSI/Query? For example, if a machine supports 32 request slots and 8
 task
 management slots, then the task management command tag can be anything
 out
 of 8 slots.


 The spec(ufs 1.1) has the requirement under  '10.5.2 Basic Header
 Format'-'Task Tag'.
 Couple of devices I came across had similar behavior. The tracking of
 UPIUs --even belonging to a separate group-- seemed to be based on the
 'task tag' value rather than 'type of UPIU'-'task tag'.


 Thanks for the clarification. So to make the task tag unique we should do
 something like below:

 @@ -2940,9 +2941,10 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba,
 int lun_id, int task_id,
 /* Configure task request UPIU */
 task_req_upiup =

 (struct utp_upiu_task_req *) task_req_descp-task_req_upiu;
 +   task_tag = hba-nutrs + free_slot;

Yes, this was exactly my internal fix at the time :-)


 task_req_upiup-header.dword_0 =
 UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
 -   lun_id, free_slot);
 +   lun_id, task_tag);

 Will this work for the devices you came across?


 --
 Regards,
 Sujit



-- 
~Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] scsi: ufs: Fix broken task management command implementation

2013-07-03 Thread Sujit Reddy Thumma

On 7/3/2013 9:53 PM, Santosh Y wrote:

On Wed, Jul 3, 2013 at 9:22 PM, Sujit Reddy Thumma
sthu...@codeaurora.org wrote:

On 7/2/2013 9:21 PM, Santosh Y wrote:


On Fri, Jun 28, 2013 at 5:02 PM, Sujit Reddy Thumma
sthu...@codeaurora.org wrote:


On 6/27/2013 4:49 PM, Santosh Y wrote:




+   spin_lock_irqsave(host-host_lock, flags);
  task_req_descp = hba-utmrdl_base_addr;
  task_req_descp += free_slot;

@@ -2353,38 +2387,39 @@ ufshcd_issue_tm_cmd(struct ufs_hba *hba,
  (struct utp_upiu_task_req *)
task_req_descp-task_req_upiu;
  task_req_upiup-header.dword_0 =
  UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
- lrbp-lun,
lrbp-task_tag);
+   lun_id, free_slot);



Actually it still doesn't fix the problem. The*task tag*  used here

should be unique across the SCSI/Query and Task Managment UPIUs.




I am sorry, I didn't get that. Why should it be unique across the
SCSI/Query? For example, if a machine supports 32 request slots and 8
task
management slots, then the task management command tag can be anything
out
of 8 slots.



The spec(ufs 1.1) has the requirement under  '10.5.2 Basic Header
Format'-'Task Tag'.
Couple of devices I came across had similar behavior. The tracking of
UPIUs --even belonging to a separate group-- seemed to be based on the
'task tag' value rather than 'type of UPIU'-'task tag'.



Thanks for the clarification. So to make the task tag unique we should do
something like below:

@@ -2940,9 +2941,10 @@ static int ufshcd_issue_tm_cmd(struct ufs_hba *hba,
int lun_id, int task_id,
 /* Configure task request UPIU */
 task_req_upiup =

 (struct utp_upiu_task_req *) task_req_descp-task_req_upiu;
+   task_tag = hba-nutrs + free_slot;


Yes, this was exactly my internal fix at the time :-)


Okay. I will update the patch with this. Thanks.





 task_req_upiup-header.dword_0 =
 UPIU_HEADER_DWORD(UPIU_TRANSACTION_TASK_REQ, 0,
-   lun_id, free_slot);
+   lun_id, task_tag);

Will this work for the devices you came across?




--
Regards,
Sujit
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/4] scsi: ufs: Fix device and host reset methods

2013-07-03 Thread Sujit Reddy Thumma

On 7/3/2013 11:19 AM, Santosh Y wrote:

+
+/**
+ * ufshcd_eh_device_reset_handler - device reset handler registered to
+ *scsi layer.
+ * @cmd - SCSI command pointer
+ *
+ * Returns SUCCESS/FAILED
+ */
+static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
+{
+   struct ufs_hba *hba;
+   int err;
+   unsigned long flags;
+
+   hba = shost_priv(cmd-device-host);
+
+   spin_lock_irqsave(hba-host-host_lock, flags);
+   if (hba-ufshcd_state == UFSHCD_STATE_RESET) {
+   dev_warn(hba-dev, %s: reset in progress\n, __func__);
+   err = SUCCESS;
+   spin_unlock_irqrestore(hba-host-host_lock, flags);
+   goto out;

It is better to wait here until the state changes to 'operational' or
'error' before returning success.


Okay. Sounds good.

--
Regards,
Sujit
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/4] scsi: ufs: Improve UFS fatal error handling

2013-07-03 Thread Santosh Y
 +
 +/**
 + * ufshcd_fatal_err_handler - handle fatal errors
 + * @work: pointer to work structure
   */
  static void ufshcd_fatal_err_handler(struct work_struct *work)
  {
 struct ufs_hba *hba;
 +   unsigned long flags;
 +   u32 err_xfer = 0;
 +   u32 err_tm = 0;
 +   int err;
 +
 hba = container_of(work, struct ufs_hba, feh_workq);

 -   /* check if reset is already in progress */
 -   if (hba-ufshcd_state != UFSHCD_STATE_RESET)
 -   ufshcd_do_reset(hba);
 +   spin_lock_irqsave(hba-host-host_lock, flags);
 +   if (hba-ufshcd_state == UFSHCD_STATE_RESET) {
 +   /* complete processed requests and exit */
 +   ufshcd_transfer_req_compl(hba);
 +   ufshcd_tmc_handler(hba);
 +   spin_unlock_irqrestore(hba-host-host_lock, flags);
 +   return;

I could not go through this patch yet, please check if it needs to
wait here until the state is 'operational' or 'error'.
The 'reset' state might be due to the device reset also.

 +   }
 +
 +   hba-ufshcd_state = UFSHCD_STATE_RESET;
 +   ufshcd_error_autopsy_transfer_req(hba, err_xfer);
 +   ufshcd_error_autopsy_task_req(hba, err_tm);
 +


-- 
~Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling

2013-07-03 Thread Bart Van Assche

On 07/03/13 19:27, David Dillow wrote:

On Wed, 2013-07-03 at 18:00 +0200, Bart Van Assche wrote:

The combination of dev_loss_tmo off and reconnect_delay  0 worked fine
in my tests. An I/O failure was detected shortly after the cable to the
target was pulled. I/O resumed shortly after the cable to the target was
reinserted.


Perhaps I don't understand your answer -- I'm asking about dev_loss_tmo
 0, and fast_io_fail_tmo = 0. The other transports do not allow this
scenario, and I'm asking if it makes sense for SRP to allow it.

But now that you mention reconnect_delay, what is the meaning of that
when it is negative? That's not in the documentation. And should it be
considered in srp_tmo_valid() -- are there values of reconnect_delay
that cause problems?


None of the combinations that can be configured from user space can 
bring the kernel in trouble. If reconnect_delay = 0 that means that the 
time-based reconnect mechanism is disabled.



I'm starting to get a bit concerned about this patch -- can you, Vu, and
Sebastian comment on the testing you have done?


All combinations of reconnect_delay, fast_io_fail_tmo and dev_loss_tmo 
that result in different behavior have been tested.



Also, FC caps dev_loss_tmo at SCSI_DEVICE_BLOCK_MAX_TIMEOUT if
fail_io_fast_tmo is off; I agree with your reasoning about leaving it
unlimited if fast fail is on, but does that still hold if it is off?


I think setting dev_loss_tmo to a large value only makes sense if the
value of reconnect_delay is not too large. Setting both to a large value
would result in slow recovery after a transport layer failure has been
corrected.


So you agree it should be capped? I can't tell from your response.


Not all combinations of reconnect_delay / fail_io_fast_tmo / 
dev_loss_tmo result in useful behavior. It is up to the user to choose a 
meaningful combination.


Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 11:54 AM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 11:42:38)

On 07/03/2013 11:37 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:

On 07/03/2013 11:22 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:

Why not have the atomic write actually imply that it is atomic and durable for
just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

Why would we use atomic writes for things that don't need to be
durable?

Avoid a torn page write seems to be the only real difference here if
you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James
   


Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.


If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

Yes.  But just because we need the flush here doesn't mean we need the
flush for every single atomic write.

-chris



The catch is that our current flush mechanisms are still pretty brute force and 
act across either the whole device or in a temporal (everything flushed before 
this is acked) way.


I still see it would be useful to have the atomic write really be atomic and 
durable just for that IO - no flush needed.


Can you give a sequence for the use case for the non-durable atomic write that 
would not need a sync? Can we really trust all devices to make something atomic 
that is not durable :) ?


thanks!

ric


--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Chris Mason
Quoting Ric Wheeler (2013-07-03 14:31:59)
 On 07/03/2013 11:54 AM, Chris Mason wrote:
  Quoting Ric Wheeler (2013-07-03 11:42:38)
  On 07/03/2013 11:37 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
  On 07/03/2013 11:22 AM, James Bottomley wrote:
  On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
  Why not have the atomic write actually imply that it is atomic and 
  durable for
  just that command?
  I don't understand why you think you need guaranteed durability for
  every journal transaction?  That's what causes us performance problems
  because we have to pause on every transaction commit.
 
  We require durability for explicit flushes, obviously, but we could
  achieve far better performance if we could just let the filesystem
  updates stream to the disk and rely on atomic writes making sure the
  journal entries were all correct.  The reason we require durability for
  journal entries today is to ensure caching effects don't cause the
  journal to lie or be corrupt.
  Why would we use atomic writes for things that don't need to be
  durable?
 
  Avoid a torn page write seems to be the only real difference here if
  you use the atomic operations and don't have durability...
  It's not just about torn pages: Journal entries are big complex beasts.
  They can be megabytes big (at least on xfs).  If we can guarantee all or
  nothing atomicity in the entire journal entry write it permits a more
  streaming design of the filesystem writeout path.
 
  James
 
 
  Journals are normally big (128MB or so?) - I don't think that this is 
  unique to xfs.
  We're mixing a bunch of concepts here.  The filesystems have a lot of
  different requirements, and atomics are just one small part.
 
  Creating a new file often uses resources freed by past files.  So
  deleting the old must be ordered against allocating the new.  They are
  really separate atomic units but you can't handle them completely
  independently.
 
  If our existing journal commit is:
 
  * write the data blocks for a transaction
  * flush
  * write the commit block for the transaction
  * flush
 
  Which part of this does and atomic write help?
 
  We would still need at least:
 
  * atomic write of data blocks  commit blocks
  * flush
  Yes.  But just because we need the flush here doesn't mean we need the
  flush for every single atomic write.
 
  -chris
 
 
 The catch is that our current flush mechanisms are still pretty brute force 
 and 
 act across either the whole device or in a temporal (everything flushed 
 before 
 this is acked) way.

This is only partially true, since you're extending the sata drive model
into atomics, and the devices implementing atomics are (so far anyway)
are not sata.

 
 I still see it would be useful to have the atomic write really be atomic and 
 durable just for that IO - no flush needed.

In sata speak, it could go down as atomic + FUA + NCQ.  In practice this
is going to be in fusionio, nvme devices and big storage arrays, all of
which we can expect to have proper knobs for lies about IO that isn't
really done yet.

 
 Can you give a sequence for the use case for the non-durable atomic write 
 that 
 would not need a sync? Can we really trust all devices to make something 
 atomic 
 that is not durable :) ?

Today's usage is mostly O_DIRECT, which really should be FUA.  Long term
we can hope people will find more interesting uses.

Either way the point is that an atomic write is a grouping mechanism,
and if the standards people want to control fuaness in a separate bit,
that's really fine.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling

2013-07-03 Thread David Dillow
On Wed, 2013-07-03 at 20:24 +0200, Bart Van Assche wrote:
 On 07/03/13 19:27, David Dillow wrote:
  On Wed, 2013-07-03 at 18:00 +0200, Bart Van Assche wrote:
  The combination of dev_loss_tmo off and reconnect_delay  0 worked fine
  in my tests. An I/O failure was detected shortly after the cable to the
  target was pulled. I/O resumed shortly after the cable to the target was
  reinserted.
 
  Perhaps I don't understand your answer -- I'm asking about dev_loss_tmo
   0, and fast_io_fail_tmo = 0. The other transports do not allow this
  scenario, and I'm asking if it makes sense for SRP to allow it.
 
  But now that you mention reconnect_delay, what is the meaning of that
  when it is negative? That's not in the documentation. And should it be
  considered in srp_tmo_valid() -- are there values of reconnect_delay
  that cause problems?
 
 None of the combinations that can be configured from user space can 
 bring the kernel in trouble. If reconnect_delay = 0 that means that the 
 time-based reconnect mechanism is disabled.

Then it should use the same semantics as the other attributes, and have
the user store off to turn it off.

And I'm getting the strong sense that the answer to my question about
fast_io_fail_tmo = 0 when dev_loss_tmo is that we should not allow that
combination, even if it doesn't break the kernel. If it doesn't make
sense, there is no reason to create an opportunity for user confusion.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Ric Wheeler

On 07/03/2013 02:54 PM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 14:31:59)

On 07/03/2013 11:54 AM, Chris Mason wrote:

Quoting Ric Wheeler (2013-07-03 11:42:38)

On 07/03/2013 11:37 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:

On 07/03/2013 11:22 AM, James Bottomley wrote:

On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:

Why not have the atomic write actually imply that it is atomic and durable for
just that command?

I don't understand why you think you need guaranteed durability for
every journal transaction?  That's what causes us performance problems
because we have to pause on every transaction commit.

We require durability for explicit flushes, obviously, but we could
achieve far better performance if we could just let the filesystem
updates stream to the disk and rely on atomic writes making sure the
journal entries were all correct.  The reason we require durability for
journal entries today is to ensure caching effects don't cause the
journal to lie or be corrupt.

Why would we use atomic writes for things that don't need to be
durable?

Avoid a torn page write seems to be the only real difference here if
you use the atomic operations and don't have durability...

It's not just about torn pages: Journal entries are big complex beasts.
They can be megabytes big (at least on xfs).  If we can guarantee all or
nothing atomicity in the entire journal entry write it permits a more
streaming design of the filesystem writeout path.

James



Journals are normally big (128MB or so?) - I don't think that this is unique to 
xfs.

We're mixing a bunch of concepts here.  The filesystems have a lot of
different requirements, and atomics are just one small part.

Creating a new file often uses resources freed by past files.  So
deleting the old must be ordered against allocating the new.  They are
really separate atomic units but you can't handle them completely
independently.


If our existing journal commit is:

* write the data blocks for a transaction
* flush
* write the commit block for the transaction
* flush

Which part of this does and atomic write help?

We would still need at least:

* atomic write of data blocks  commit blocks
* flush

Yes.  But just because we need the flush here doesn't mean we need the
flush for every single atomic write.

-chris


The catch is that our current flush mechanisms are still pretty brute force and
act across either the whole device or in a temporal (everything flushed before
this is acked) way.

This is only partially true, since you're extending the sata drive model
into atomics, and the devices implementing atomics are (so far anyway)
are not sata.


I still see it would be useful to have the atomic write really be atomic and
durable just for that IO - no flush needed.

In sata speak, it could go down as atomic + FUA + NCQ.  In practice this
is going to be in fusionio, nvme devices and big storage arrays, all of
which we can expect to have proper knobs for lies about IO that isn't
really done yet.


Can you give a sequence for the use case for the non-durable atomic write that
would not need a sync? Can we really trust all devices to make something atomic
that is not durable :) ?

Today's usage is mostly O_DIRECT, which really should be FUA.  Long term
we can hope people will find more interesting uses.

Either way the point is that an atomic write is a grouping mechanism,
and if the standards people want to control fuaness in a separate bit,
that's really fine.

-chris



That makes sense to me - happy to have that bit a bit to indicate durability in 
the atomic operation...


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/13] scsi_transport_srp: Add transport layer error handling

2013-07-03 Thread Vu Pham

David Dillow wrote:

On Wed, 2013-07-03 at 20:24 +0200, Bart Van Assche wrote:
  

On 07/03/13 19:27, David Dillow wrote:


On Wed, 2013-07-03 at 18:00 +0200, Bart Van Assche wrote:
  

The combination of dev_loss_tmo off and reconnect_delay  0 worked fine
in my tests. An I/O failure was detected shortly after the cable to the
target was pulled. I/O resumed shortly after the cable to the target was
reinserted.


Perhaps I don't understand your answer -- I'm asking about dev_loss_tmo
 0, and fast_io_fail_tmo = 0. The other transports do not allow this
scenario, and I'm asking if it makes sense for SRP to allow it.

But now that you mention reconnect_delay, what is the meaning of that
when it is negative? That's not in the documentation. And should it be
considered in srp_tmo_valid() -- are there values of reconnect_delay
that cause problems?
  
None of the combinations that can be configured from user space can 
bring the kernel in trouble. If reconnect_delay = 0 that means that the 
time-based reconnect mechanism is disabled.



Then it should use the same semantics as the other attributes, and have
the user store off to turn it off.

And I'm getting the strong sense that the answer to my question about
fast_io_fail_tmo = 0 when dev_loss_tmo is that we should not allow that
combination, even if it doesn't break the kernel. If it doesn't make
sense, there is no reason to create an opportunity for user confusion.
  

Hello Dave,

when dev_loss_tmo expired, srp not only removes the rport but also 
removes the associated scsi_host.
One may wish to set fast_io_fail_tmo =0 for I/Os to fail-over fast to 
other paths, and dev_loss_tmo off to keep the scsi_host around until the 
target coming back.


-vu
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-07-03 Thread Shaohua Li
On Thu, Jun 20, 2013 at 04:17:13PM -0400, Matthew Wilcox wrote:
 
 A paper at FAST2012
 (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
 out the performance overhead of taking interrupts for low-latency block
 I/Os.  The solution the author investigated was to spin waiting for each
 I/O to complete.  This is inefficient as Linux submits many I/Os which
 are not latency-sensitive, and even when we do submit latency-sensitive
 I/Os (eg swap-in), we frequently submit several I/Os before waiting.
 
 This RFC takes a different approach, only spinning when we would
 otherwise sleep.  To implement this, I add an 'io_poll' function pointer
 to backing_dev_info.  I include a sample implementation for the NVMe
 driver.  Next, I add an io_wait() function which will call io_poll()
 if it is set.  It falls back to calling io_schedule() if anything goes
 wrong with io_poll() or the task exceeds its timeslice.  Finally, all
 that is left is to judiciously replace calls to io_schedule() with
 calls to io_wait().  I think I've covered the main contenders with
 sleep_on_page(), sleep_on_buffer() and the DIO path.
 
 I've measured the performance benefits of this with a Chatham NVMe
 prototype device and a simple
 # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=100
 The latency of each I/O reduces by about 2.5us (from around 8.0us to
 around 5.5us).  This matches up quite well with the performance numbers
 shown in the FAST2012 paper (which used a similar device).

Hi Matthew,
I'm wondering where the 2.5us latency cut comes from. I did a simple test. In
my xeon 3.4G CPU, one cpu can do about 2M/s context switch of applications.
Assuming switching to idle is faster, so switching to idle and back should take
less than 1us. Does the 2.5us latency cut mostly come from deep idle state
latency? if so, maybe set a lower pm_qos value or have a better idle governer
to prevent cpu entering deep idle state can help too.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: atomic write T10 standards

2013-07-03 Thread Vladislav Bolkhovitin
Ric Wheeler, on 07/03/2013 11:31 AM wrote:
 Journals are normally big (128MB or so?) - I don't think that this is 
 unique to xfs.
 We're mixing a bunch of concepts here.  The filesystems have a lot of
 different requirements, and atomics are just one small part.

 Creating a new file often uses resources freed by past files.  So
 deleting the old must be ordered against allocating the new.  They are
 really separate atomic units but you can't handle them completely
 independently.

 If our existing journal commit is:

 * write the data blocks for a transaction
 * flush
 * write the commit block for the transaction
 * flush

 Which part of this does and atomic write help?

 We would still need at least:

 * atomic write of data blocks  commit blocks
 * flush

No necessary.

Consider a case, when you are creating many small files in a big directory. 
Consider
that every such operation needs 3 actions: add new directory entry, get free 
space and
write data there. If 1 atomic write (scattered) command is used for each 
operation and
you order them between each other, if needed, in some way, e.g. by using 
ORDERED SCSI
attribute or queue draining, you don't need any intermediate flushes. Only one 
final
flush would be sufficient. In case of crash simply some of the new files would
disappear, but everything would be fully consistent, so the only needed 
recovery
would be to recreate them.

 The catch is that our current flush mechanisms are still pretty brute force 
 and 
 act across either the whole device or in a temporal (everything flushed 
 before 
 this is acked) way.
 
 I still see it would be useful to have the atomic write really be atomic and 
 durable just for that IO - no flush needed.
 
 Can you give a sequence for the use case for the non-durable atomic write 
 that 
 would not need a sync?

See above.

 Can we really trust all devices to make something atomic 
 that is not durable :) ?

Sure, if application allows that and the atomicity property itself is durable, 
why not?

Vlad

P.S. With atomic writes there's no need in a journal, no?
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html