date:20140410

RE: [PATCH 52/55] scsi: Move prototype declaration to header file megaraid/megaraid_sas.h from megaraid/megaraid_sas_fusion.c

2014-04-10 Thread Saxena, Sumit



>-Original Message-
>From: Rashika Kheria [mailto:rashika.khe...@gmail.com]
>Sent: Sunday, March 30, 2014 12:19 AM
>To: linux-ker...@vger.kernel.org
>Cc: DL-MegaRAID Linux; James E.J. Bottomley; linux-scsi@vger.kernel.org;
>j...@joshtriplett.org
>Subject: [PATCH 52/55] scsi: Move prototype declaration to header file
>megaraid/megaraid_sas.h from megaraid/megaraid_sas_fusion.c
>
>Move prototype declaration of function to header file
>megaraid/megaraid_sas.h from megaraid/megaraid_sas_fusion.c because it is
>used by more than one file.
>
>This eliminates the following warning in megaraid/megaraid_sas_fp.c:
>drivers/scsi/megaraid/megaraid_sas_fp.c:1223:5: warning: no previous
>prototype for ‘get_updated_dev_handle’ [-Wmissing-prototypes]
>
>Signed-off-by: Rashika Kheria 
>Reviewed-by: Josh Triplett 
>---
> drivers/scsi/megaraid/megaraid_sas.h|3 +++
> drivers/scsi/megaraid/megaraid_sas_fusion.c |2 --
> 2 files changed, 3 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/scsi/megaraid/megaraid_sas.h
>b/drivers/scsi/megaraid/megaraid_sas.h
>index 3b0afb4..17fe706 100644
>--- a/drivers/scsi/megaraid/megaraid_sas.h
>+++ b/drivers/scsi/megaraid/megaraid_sas.h
>@@ -1737,6 +1737,9 @@ megasas_check_and_restore_queue_depth(struct
>megasas_instance *instance);  void megasas_free_cmds(struct
>megasas_instance *instance);  int megasas_alloc_cmds(struct
>megasas_instance *instance);
>
>+u16 get_updated_dev_handle(struct LD_LOAD_BALANCE_INFO *lbInfo,
>+ struct IO_REQUEST_INFO *in_info);
>+
> u8
> MR_BuildRaidContext(struct megasas_instance *instance,
>   struct IO_REQUEST_INFO *io_info,
>diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>index ce6219c..b3d79f4 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>@@ -63,8 +63,6 @@ wait_and_poll(struct megasas_instance *instance, struct
>megasas_cmd *cmd);  int  megasas_clear_intr_fusion(struct
>megasas_register_set __iomem *regs);
>
>-u16 get_updated_dev_handle(struct LD_LOAD_BALANCE_INFO *lbInfo,
>- struct IO_REQUEST_INFO *in_info);
> int megasas_transition_to_ready(struct megasas_instance *instance, int ocr);
>
> extern u32 megasas_dbg_lvl;

Acked-by: Sumit Saxena 

-Sumit
>--
>1.7.9.5
>

N�r��yb�X��ǧv�^�)޺{.n�+{���"�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH 51/55] scsi: Move prototype declaration to header file megaraid/megaraid_sas_fusion.h from megaraid/megaraid_sas_base.c

2014-04-10 Thread Saxena, Sumit



>-Original Message-
>From: Rashika Kheria [mailto:rashika.khe...@gmail.com]
>Sent: Sunday, March 30, 2014 12:18 AM
>To: linux-ker...@vger.kernel.org
>Cc: DL-MegaRAID Linux; James E.J. Bottomley; linux-scsi@vger.kernel.org;
>j...@joshtriplett.org
>Subject: [PATCH 51/55] scsi: Move prototype declaration to header file
>megaraid/megaraid_sas_fusion.h from megaraid/megaraid_sas_base.c
>
>Move prototype declarations of functions to header file
>megaraid/megaraid_sas_fusion.h from megaraid/megaraid_sas_base.c
>because they are used by more than one file.
>
>This eliminates the following type of warnings in
>megaraid/megaraid_sas_fusion.c:
>drivers/scsi/megaraid/megaraid_sas_fusion.c:2170:1: warning: no previous
>prototype for ‘megasas_release_fusion’ [-Wmissing-prototypes]
>
>Signed-off-by: Rashika Kheria 
>Reviewed-by: Josh Triplett 
>---
> drivers/scsi/megaraid/megaraid_sas_base.c   |   13 -
> drivers/scsi/megaraid/megaraid_sas_fusion.h |   14 ++
> 2 files changed, 14 insertions(+), 13 deletions(-)
>
>diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
>b/drivers/scsi/megaraid/megaraid_sas_base.c
>index 9768deee..0ad386b 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_base.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
>@@ -160,21 +160,8 @@ u32
> megasas_build_and_issue_cmd(struct megasas_instance *instance,
>   struct scsi_cmnd *scmd);
> static void megasas_complete_cmd_dpc(unsigned long instance_addr); -void
>-megasas_release_fusion(struct megasas_instance *instance); -int -
>megasas_ioc_init_fusion(struct megasas_instance *instance); -void -
>megasas_free_cmds_fusion(struct megasas_instance *instance);
>-u8
>-megasas_get_map_info(struct megasas_instance *instance); -int -
>megasas_sync_map_info(struct megasas_instance *instance);  int
>wait_and_poll(struct megasas_instance *instance, struct megasas_cmd
>*cmd); -void megasas_reset_reply_desc(struct megasas_instance *instance);
>-int megasas_reset_fusion(struct Scsi_Host *shost); -void
>megasas_fusion_ocr_wq(struct work_struct *work);
>
> static void
> megasas_issue_dcmd(struct megasas_instance *instance, struct
>megasas_cmd *cmd) diff --git
>a/drivers/scsi/megaraid/megaraid_sas_fusion.h
>b/drivers/scsi/megaraid/megaraid_sas_fusion.h
>index 35a5139..01e5ab3 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_fusion.h
>+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.h
>@@ -760,4 +760,18 @@ union desc_value {
>   } u;
> };
>
>+void
>+megasas_release_fusion(struct megasas_instance *instance); int
>+megasas_ioc_init_fusion(struct megasas_instance *instance); void
>+megasas_free_cmds_fusion(struct megasas_instance *instance);
>+u8
>+megasas_get_map_info(struct megasas_instance *instance); int
>+megasas_sync_map_info(struct megasas_instance *instance); void
>+megasas_reset_reply_desc(struct megasas_instance *instance); int
>+megasas_reset_fusion(struct Scsi_Host *shost); void
>+megasas_fusion_ocr_wq(struct work_struct *work);
>+
> #endif /* _MEGARAID_SAS_FUSION_H_ */

Acked-by: Sumit Saxena 

-Sumit
>--
>1.7.9.5
>

RE: [PATCH 30/55] scsi: Mark functions as static in megaraid/megaraid_sas_fp.c

2014-04-10 Thread Saxena, Sumit



>-Original Message-
>From: Rashika Kheria [mailto:rashika.khe...@gmail.com]
>Sent: Saturday, March 29, 2014 11:48 PM
>To: linux-ker...@vger.kernel.org
>Cc: DL-MegaRAID Linux; James E.J. Bottomley; linux-scsi@vger.kernel.org;
>j...@joshtriplett.org
>Subject: [PATCH 30/55] scsi: Mark functions as static in
>megaraid/megaraid_sas_fp.c
>
>Mark functions as static in megaraid/megaraid_sas_fp.c because they are not
>used outside this file.
>
>This eliminates the following warning in megaraid/megaraid_sas_fp.c:
>drivers/scsi/megaraid/megaraid_sas_fp.c:80:5: warning: no previous
>prototype for ‘mega_mod64’ [-Wmissing-prototypes]
>drivers/scsi/megaraid/megaraid_sas_fp.c:98:5: warning: no previous
>prototype for ‘mega_div64_32’ [-Wmissing-prototypes]
>drivers/scsi/megaraid/megaraid_sas_fp.c:206:5: warning: no previous
>prototype for ‘MR_GetSpanBlock’ [-Wmissing-prototypes]
>drivers/scsi/megaraid/megaraid_sas_fp.c:341:5: warning: no previous
>prototype for ‘mr_spanset_get_span_block’ [-Wmissing-prototypes]
>drivers/scsi/megaraid/megaraid_sas_fp.c:582:4: warning: no previous
>prototype for ‘get_arm’ [-Wmissing-prototypes]
>drivers/scsi/megaraid/megaraid_sas_fp.c:705:4: warning: no previous
>prototype for ‘MR_GetPhyParams’ [-Wmissing-prototypes]
>drivers/scsi/megaraid/megaraid_sas_fp.c:1196:4: warning: no previous
>prototype for ‘megasas_get_best_arm’ [-Wmissing-prototypes]
>
>Signed-off-by: Rashika Kheria 
>Reviewed-by: Josh Triplett 
>---
> drivers/scsi/megaraid/megaraid_sas_fp.c |   26 ++
> 1 file changed, 14 insertions(+), 12 deletions(-)
>
>diff --git a/drivers/scsi/megaraid/megaraid_sas_fp.c
>b/drivers/scsi/megaraid/megaraid_sas_fp.c
>index e24b6eb..83d5f74 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_fp.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_fp.c
>@@ -77,7 +77,7 @@ static u8 mr_spanset_get_phy_params(struct
>megasas_instance *instance, u32 ld,  static u64 get_row_from_strip(struct
>megasas_instance *instance, u32 ld,
>   u64 strip, struct MR_FW_RAID_MAP_ALL *map);
>
>-u32 mega_mod64(u64 dividend, u32 divisor)
>+static u32 mega_mod64(u64 dividend, u32 divisor)
> {
>   u64 d;
>   u32 remainder;
>@@ -95,7 +95,7 @@ u32 mega_mod64(u64 dividend, u32 divisor)
>  *
>  * @return quotient
>  **/
>-u64 mega_div64_32(uint64_t dividend, uint32_t divisor)
>+static u64 mega_div64_32(uint64_t dividend, uint32_t divisor)
> {
>   u32 remainder;
>   u64 d;
>@@ -203,7 +203,7 @@ u8 MR_ValidateMapInfo(struct megasas_instance
>*instance)
>   return 1;
> }
>
>-u32 MR_GetSpanBlock(u32 ld, u64 row, u64 *span_blk,
>+static u32 MR_GetSpanBlock(u32 ld, u64 row, u64 *span_blk,
>   struct MR_FW_RAID_MAP_ALL *map)
> {
>   struct MR_SPAN_BLOCK_INFO *pSpanBlock = MR_LdSpanInfoGet(ld,
>map); @@ -338,7 +338,7 @@ static int getSpanInfo(struct
>MR_FW_RAID_MAP_ALL *map, PLD_SPAN_INFO ldSpanInfo)
> *div_error   - Devide error code.
> */
>
>-u32 mr_spanset_get_span_block(struct megasas_instance *instance,
>+static u32 mr_spanset_get_span_block(struct megasas_instance *instance,
>   u32 ld, u64 row, u64 *span_blk, struct
>MR_FW_RAID_MAP_ALL *map)  {
>   struct fusion_context *fusion = instance->ctrl_context; @@ -579,8
>+579,8 @@ static u32 get_arm_from_strip(struct megasas_instance
>*instance,  }
>
> /* This Function will return Phys arm */
>-u8 get_arm(struct megasas_instance *instance, u32 ld, u8 span, u64 stripe,
>-  struct MR_FW_RAID_MAP_ALL *map)
>+static u8 get_arm(struct megasas_instance *instance, u32 ld, u8 span,
>+u64 stripe, struct MR_FW_RAID_MAP_ALL *map)
> {
>   struct MR_LD_RAID  *raid = MR_LdRaidGet(ld, map);
>   /* Need to check correct default value */ @@ -702,10 +702,11 @@
>static u8 mr_spanset_get_phy_params(struct megasas_instance *instance,
>u32 ld,
> *span  - Span number
> *block - Absolute Block number in the physical disk
> */
>-u8 MR_GetPhyParams(struct megasas_instance *instance, u32 ld, u64
>stripRow,
>-  u16 stripRef, struct IO_REQUEST_INFO *io_info,
>-  struct RAID_CONTEXT *pRAID_Context,
>-  struct MR_FW_RAID_MAP_ALL *map)
>+static u8 MR_GetPhyParams(struct megasas_instance *instance, u32 ld,
>+u64 stripRow, u16 stripRef,
>+struct IO_REQUEST_INFO *io_info,
>+struct RAID_CONTEXT *pRAID_Context,
>+struct MR_FW_RAID_MAP_ALL *map)
> {
>   struct MR_LD_RAID  *raid = MR_LdRaidGet(ld, map);
>   u32 pd, arRef;
>@@ -1193,8 +1194,9 @@ mr_update_load_balance_params(struct
>MR_FW_RAID_MAP_ALL *map,
>   }
> }
>
>-u8 megasas_get_best_arm(struct LD_LOAD_BALANCE_INFO *lbInfo, u8 arm,
>u64 block,
>-  u32 count)
>+static u8 megasas_get_best_arm(struct LD_LOAD_BALANCE_INFO *lbInfo,
>u8 arm,
>+ u64 block,
>+ u32 count)
> {
>   u16 pend0, pe

RE: [PATCH 29/55] scsi: Mark functions as static in megaraid/megaraid_sas_fusion.c

2014-04-10 Thread Saxena, Sumit



>-Original Message-
>From: Rashika Kheria [mailto:rashika.khe...@gmail.com]
>Sent: Saturday, March 29, 2014 11:47 PM
>To: linux-ker...@vger.kernel.org
>Cc: DL-MegaRAID Linux; James E.J. Bottomley; linux-scsi@vger.kernel.org;
>j...@joshtriplett.org
>Subject: [PATCH 29/55] scsi: Mark functions as static in
>megaraid/megaraid_sas_fusion.c
>
>Mark functions as static in megaraid/megaraid_sas_fusion.c because they are
>not used outside this file.
>
>This eliminates the warnings of following type in
>megaraid/megaraid_sas_fusion.c:
>drivers/scsi/megaraid/megaraid_sas_fusion.c:91:1: warning: no previous
>prototype for ‘megasas_enable_intr_fusion’ [-Wmissing-prototypes]
>
>Signed-off-by: Rashika Kheria 
>Reviewed-by: Josh Triplett 
>---
> drivers/scsi/megaraid/megaraid_sas_fusion.c |   39 ++-
>
> 1 file changed, 20 insertions(+), 19 deletions(-)
>
>diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>index 2806d6d..ce6219c 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>@@ -74,7 +74,7 @@ extern int resetwaittime;
>  * megasas_enable_intr_fusion -   Enables interrupts
>  * @regs: MFI register set
>  */
>-void
>+static void
> megasas_enable_intr_fusion(struct megasas_instance *instance)  {
>   struct megasas_register_set __iomem *regs; @@ -94,7 +94,7 @@
>megasas_enable_intr_fusion(struct megasas_instance *instance)
>  * megasas_disable_intr_fusion - Disables interrupt
>  * @regs:  MFI register set
>  */
>-void
>+static void
> megasas_disable_intr_fusion(struct megasas_instance *instance)  {
>   u32 mask = 0x;
>@@ -134,8 +134,8 @@ megasas_clear_intr_fusion(struct
>megasas_register_set __iomem *regs)
>  *
>  * Returns a free command from the pool
>  */
>-struct megasas_cmd_fusion *megasas_get_cmd_fusion(struct
>megasas_instance
>-*instance)
>+static struct megasas_cmd_fusion *megasas_get_cmd_fusion(
>+struct megasas_instance *instance)
> {
>   unsigned long flags;
>   struct fusion_context *fusion =
>@@ -363,7 +363,7 @@ static int megasas_create_frame_pool_fusion(struct
>megasas_instance *instance)
>  * and is used as SMID of the cmd.
>  * SMID value range is from 1 to max_fw_cmds.
>  */
>-int
>+static int
> megasas_alloc_cmds_fusion(struct megasas_instance *instance)  {
>   int i, j, count;
>@@ -919,7 +919,7 @@ megasas_display_intel_branding(struct
>megasas_instance *instance)
>  *
>  * This is the main function for initializing firmware.
>  */
>-u32
>+static u32
> megasas_init_adapter_fusion(struct megasas_instance *instance)  {
>   struct megasas_register_set __iomem *reg_set; @@ -1037,7 +1037,7
>@@ fail_alloc_mfi_cmds:
>  * @frame_count : Number of frames for the command
>  * @regs :MFI register set
>  */
>-void
>+static void
> megasas_fire_cmd_fusion(struct megasas_instance *instance,
>   dma_addr_t req_desc_lo,
>   u32 req_desc_hi,
>@@ -1059,7 +1059,7 @@ megasas_fire_cmd_fusion(struct megasas_instance
>*instance,
>  * @ext_status :  ext status of cmd returned by FW
>  */
>
>-void
>+static void
> map_cmd_status(struct megasas_cmd_fusion *cmd, u8 status, u8
>ext_status)  {
>
>@@ -1199,7 +1199,7 @@ megasas_make_sgl_fusion(struct megasas_instance
>*instance,
>  *
>  * Used to set the PD LBA in CDB for FP IOs
>  */
>-void
>+static void
> megasas_set_pd_lba(struct MPI2_RAID_SCSI_IO_REQUEST *io_request, u8
>cdb_len,
>  struct IO_REQUEST_INFO *io_info, struct scsi_cmnd *scp,
>  struct MR_FW_RAID_MAP_ALL *local_map_ptr, u32
>ref_tag) @@ -1376,7 +1376,7 @@ megasas_set_pd_lba(struct
>MPI2_RAID_SCSI_IO_REQUEST *io_request, u8 cdb_len,
>  * Prepares the io_request and chain elements (sg_frame) for IO
>  * The IO can be for PD (Fast Path) or LD
>  */
>-void
>+static void
> megasas_build_ldio_fusion(struct megasas_instance *instance,
> struct scsi_cmnd *scp,
> struct megasas_cmd_fusion *cmd)
>@@ -1678,7 +1678,7 @@ NonFastPath:
>  * Invokes helper functions to prepare request frames
>  * and sets flags appropriate for IO/Non-IO cmd
>  */
>-int
>+static int
> megasas_build_io_fusion(struct megasas_instance *instance,
>   struct scsi_cmnd *scp,
>   struct megasas_cmd_fusion *cmd)
>@@ -1749,7 +1749,7 @@ megasas_build_io_fusion(struct megasas_instance
>*instance,
>   return 0;
> }
>
>-union MEGASAS_REQUEST_DESCRIPTOR_UNION *
>+static union MEGASAS_REQUEST_DESCRIPTOR_UNION *
> megasas_get_request_descriptor(struct megasas_instance *instance, u16
>index)  {
>   u8 *p;
>@@ -1829,7 +1829,7 @@ megasas_build_and_issue_cmd_fusion(struct
>megasas_instance *instance,
>  * @instance: Adapter soft state
>  * Completes all comman

RE: [PATCH trivial 1/3] megaraid_sas: Spelling s/intance/instance/

2014-04-10 Thread Saxena, Sumit



>-Original Message-
>From: Geert Uytterhoeven [mailto:ge...@linux-m68k.org]
>Sent: Tuesday, March 25, 2014 2:07 AM
>To: Jiri Kosina
>Cc: linux-ker...@vger.kernel.org; Geert Uytterhoeven; DL-MegaRAID Linux;
>linux-scsi@vger.kernel.org
>Subject: [PATCH trivial 1/3] megaraid_sas: Spelling s/intance/instance/
>
>From: Geert Uytterhoeven 
>
>Signed-off-by: Geert Uytterhoeven 
>Cc: Neela Syam Kolli 
>Cc: linux-scsi@vger.kernel.org
>---
> drivers/scsi/megaraid/megaraid_sas_base.c   |2 +-
> drivers/scsi/megaraid/megaraid_sas_fusion.c |2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
>b/drivers/scsi/megaraid/megaraid_sas_base.c
>index 3b7ad10497fe..10082678077c 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_base.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
>@@ -3865,7 +3865,7 @@ fail_ready_state:
>
> /**
>  * megasas_release_mfi -  Reverses the FW initialization
>- * @intance:  Adapter soft state
>+ * @instance: Adapter soft state
>  */
> static void megasas_release_mfi(struct megasas_instance *instance)  { diff --
>git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>index f6555921fd7a..f7d68f65f974 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>@@ -2164,7 +2164,7 @@ megasas_issue_dcmd_fusion(struct
>megasas_instance *instance,
>
> /**
>  * megasas_release_fusion -   Reverses the FW initialization
>- * @intance:  Adapter soft state
>+ * @instance: Adapter soft state
>  */
> void
> megasas_release_fusion(struct megasas_instance *instance)

Acked-by: Sumit Saxena 

-Sumit
>--
>1.7.9.5
>

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Hannes Reinecke

On 04/10/2014 10:36 PM, James Bottomley wrote:
> On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
>> On 04/10/2014 05:31 PM, Alan Stern wrote:
>>> On Thu, 10 Apr 2014, Hannes Reinecke wrote:
>>>
 On 04/10/2014 12:58 PM, Andreas Reis wrote:
> That patch appears to work in preventing the crashes, judged on one
> repeated appearance of the bug.
>
> dmesg had the usual
> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> xhci_hcd
> [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> with disabled ep 880427b829c0
> [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> with disabled ep 880427b82a08
> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
>
> repeated five times, followed by one
> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> recovery
>
> and then as often as something tried to read from it:
> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
>
> The stick could then be properly un- and remounted (the latter if it
> had been physically replugged) without issue � for the bug to
> reoccur after one to three minutes. I tried this three times, no
> dmesg difference except the ep addresses varied on two of that.
>
 Was this just that patch you've tested with or the entire patch series?

 If the latter, Alan, is this the expected outcome?
>>>
>>> Yes, it is.  The same thing should happen with the entire patch series.
>>>
 I would've thought the error recover should _not_ run into
 offlining devices here, but rather the device should be recovered
 eventually.
>>>
>>> The command times out, it is aborted, and the command is retried.  The
>>> same thing happens, and we repeat five times.  Eventually the SCSI core
>>> gives up and declares the device to be offline.
>>>
>> Hmm. Ok. If you are fine with it who am I to argue here.
>> James, shall I resent the patch series?
> 
> You mean the one patch?  No, it's OK, I have it.
> 
> It's still not complete, though, as I've said a couple of times.  The
> problem is that we have abort memory on any eh command as well, which
> this doesn't fix.
> 
> The scenario is abort command, set flag, abort completes, send TUR, TUR
> doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
> will skip the abort because the flag is set and move straight to reset.
> 
> The fix is this, I can just add it as well.
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 771c16b..7516e2c 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct 
> scsi_eh_save *ses,
>   ses->prot_op = scmd->prot_op;
>  
>   scmd->prot_op = SCSI_PROT_NORMAL;
> + scmd->eh_eflags = 0;
>   scmd->cmnd = ses->eh_cmnd;
>   memset(scmd->cmnd, 0, BLK_MAX_CDB);
>   memset(&scmd->sdb, 0, sizeof(scmd->sdb));
> 
> 
Oh yes, that is correct.

Acked-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

aic94xx: maybe uninitialized variable in asd_process_ctrl_a_user

2014-04-10 Thread Filipe Brandenburger

Hi James,

While building a recent kernel with -Werror I found this warning:

drivers/scsi/aic94xx/aic94xx_sds.c: In function 'asd_read_flash':
drivers/scsi/aic94xx/aic94xx_sds.c:597:21: error: 'offs' may be used
uninitialized in this function [-Werror=maybe-uninitialized]
drivers/scsi/aic94xx/aic94xx_sds.c:985:6: note: 'offs' was declared here

This looks like a valid complaint from the compiler, since in
asd_process_ctrl_a_user if the call to asd_find_flash_de fails (and
returns -ENOENT) then offs will not be set, but that will not prevent
the variable to be later passed to the call to asd_read_flash_seg
later in that same function.

Would you please have a look at it? Let me know if there's a more
appropriate way to report these issues (e.g. bug tracker.)

Thanks,
Filipe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Baoquan He

On 04/10/14 at 04:34pm, Jiang Liu wrote:
> Hi Baoquan,
>   Could you please help to give output of "lspci -"?
> Is device "hpsa :03:00.0" a legacy PCI device(non-PCIe)?
> It may have relationship with IOMMU driver.
> Thanks!
> Gerry

Well, the machine bug was reported on is a AMD machine, and it doesn't
have the IOMMU problem. David saw there are some DMAR errors, it should
be a intel machine which use the VT-d.

> 
> On 2014/4/10 12:03, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> > 
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
> >> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> >>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
>  On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> >> [+linux-scsi]
> >> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> >>> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
>  Hi,
> 
>  The kernel is 3.14.0+ which is pulled just now.
> >>>
> >>> Cc'ing more people.
> >>>
> >>> While the hpsa driver appears to be involved in some way, I'm sure if
> >>> this is a related issue, but as of today's pull I'm getting another
> >>> problem that causes my DL980 not to come up.
> >>>
> >>> *Massive* amounts of:
> >>>
> >>> DMAR:[fault reason 02] Present bit in context entry is clear
> >>> dmar: DRHD: handling fault status reg 602
> >>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >>>
> >>> Then:
> >>>
> >>> hpsa :03:00.0: Controller lockup detected: 0x
> >>> ...
> >>> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> >>> ...
> >>>
> >>> Screenshot of the actual LOCKUP:
> >>> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >>>
> >>> While I haven't bisected, things worked fine until at least until 
> >>> commit
> >>> 39de65aa2c3e (April 2nd).
> >>>
> >>> Any ideas?
> >>
> >> Well, it's either a DMA remapping issue or a hpsa one.  Your assertion
> >> that everything worked fine until 39de65aa2c3e would tend to vindicate
> >> hpsa,
> 
>  Hmm here you mean DMA, right?
> >>>
> >>> No, it vindicates the hpsa changes ... they don't seem to be causing
> >>> problems until something goes wrong with dma remapping.
> >>>
> > because all the hpsa changes went in before that under
> > Missing crucial info:
> >
> > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >
> >> Merge: 3e75c6d b2bff6c
> >> Author: Linus Torvalds 
> >> Date:   Tue Apr 1 18:49:04 2014 -0700
> >>
> >> Merge tag 'scsi-misc' of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >>
> >> can you revalidate that this commit works OK just to make sure?
> 
>  Ok so I don't see those DMA messages and system starts just fine. I'm
>  thinking perhaps something broke after the IO mmu stuff in commit
>  3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
>  causing the CPU stalls and just blame hpsa in the path as a side effect?
> 
>  /me goes out to try the commit.
> >>>
> >>> That's my guess.  The DMAR messages are DMA remapping issues caused in
> >>> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> >>> indicating the IOMMU is calling for a mapping address before it can
> >>> satisfy the driver read request, which is causing the hang apparently in
> >>> the hpsa driver.
> >>>
> >>> I've added linux-pci to the cc; I think they deal with iommu issues on
> >>> x86.
> >>
> >> So that merge commit appears to be the culprit, I see both the DMA
> >> messages and the lockup blaming hpsa...
> > 
> > My understanding so far (please correct me if I'm wrong):
> > 
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] hpsa: fix uninitialized trans_support in hpsa_put_ctlr_into_performant_mode()

2014-04-10 Thread Baoquan He

This patch works for me.

Tested-by: Baoquan He 

Thanks
Baoquan

On 04/10/14 at 05:17pm, scame...@beardog.cce.hp.com wrote:
> 
> Without this, you'll see a null pointer dereference in
> hpsa_enter_performant_mode().
> 
> Signed-off-by: Stephen M. Cameron 
> ---
>  drivers/scsi/hpsa.c |4 
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 8cf4a0c..ef4dfdd 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
> ctlr_info *h)
>   if (hpsa_simple_mode)
>   return;
>  
> + trans_support = readl(&(h->cfgtable->TransportSupport));
> + if (!(trans_support & PERFORMANT_MODE))
> + return;
> +
>   /* Check for I/O accelerator mode support */
>   if (trans_support & CFGTBL_Trans_io_accel1) {
>   transMethod |= CFGTBL_Trans_io_accel1 |
> -- 
> 1.7.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] blk-mq: move request structures into struct blk_mq_tags

2014-04-10 Thread Jens Axboe


On 2014-04-10 04:01, Christoph Hellwig wrote:

On Wed, Apr 09, 2014 at 10:23:32AM -0600, Jens Axboe wrote:

This should go into block/blk-mq-tag.h.


Ok.


We might as well leave this, the mtip32xx conversion ends up using it. So
if we pull it now, it'll just be reintroduced shortly.


It's back in the latest revision of the patch, just taking a
struct blk_mq_tag pointer now so that it can be used by SCSI as well.

I've also changed an opencode variant of it to use the helper.


Great. Will you send out an updated patchset?

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: esp_scsi QTAG in FAS216

2014-04-10 Thread Michael Schmitz

Hello Kars,


>> > I've never seen a formula for any ESP or FAS chip for the timeout
>> > other than the one mentioned in huge comment in
>> > esp_set_clock_params(), although I do see the 7668 instead of 8192
>> > factor being used in the old NCR53C9x driver.
>>
>> I haven't gone far enough back in the 53C9x revision history to be
>> certain. but it would seem to me that Kars de Jong added that FAS
>> special case.
>>
>> Can you confirm that, Kars? Any recollection as to the reason?
>
> That is the value that's in the data manual of the Symbios Logic
> SYM53CF94/96-2 (the actual chip that's in my Amiga SCSI controller).
>
> Funny, according to the QLogic FAS2x6 manual the value should be 7682
> for FAS216/216U/236/236U chips...
>
> I don't think it's all that important. It only means that the actual
> selection timeout used by the chip will be slightly shorter than it is
> supposed to be.

Thanks for checking that - I agree that it might not amount to much.

The more important issue is the one about the one-byte reconnect
message. Does the manual speak to that particular issue? Any hint on
how we could enable SCSI-2 features on chip init?

Can you point me to a source for the manuals if possible?

Cheers,

  Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Baoquan He

On 04/10/14 at 04:34pm, Jiang Liu wrote:
> Hi Baoquan,
>   Could you please help to give output of "lspci -"?
> Is device "hpsa :03:00.0" a legacy PCI device(non-PCIe)?
> It may have relationship with IOMMU driver.
> Thanks!
> Gerry

Hi,

I just saw your mail now. Do you still need the output of "lspci -"
on my test machine? 

In fact, I didn't see the DMAR error related to intel vt-d issues.

If the output is helpful, I can make a latest build to do this.

Thanks
Baoquan

> 
> On 2014/4/10 12:03, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> > 
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
> >> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> >>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
>  On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> >> [+linux-scsi]
> >> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> >>> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
>  Hi,
> 
>  The kernel is 3.14.0+ which is pulled just now.
> >>>
> >>> Cc'ing more people.
> >>>
> >>> While the hpsa driver appears to be involved in some way, I'm sure if
> >>> this is a related issue, but as of today's pull I'm getting another
> >>> problem that causes my DL980 not to come up.
> >>>
> >>> *Massive* amounts of:
> >>>
> >>> DMAR:[fault reason 02] Present bit in context entry is clear
> >>> dmar: DRHD: handling fault status reg 602
> >>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >>>
> >>> Then:
> >>>
> >>> hpsa :03:00.0: Controller lockup detected: 0x
> >>> ...
> >>> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> >>> ...
> >>>
> >>> Screenshot of the actual LOCKUP:
> >>> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >>>
> >>> While I haven't bisected, things worked fine until at least until 
> >>> commit
> >>> 39de65aa2c3e (April 2nd).
> >>>
> >>> Any ideas?
> >>
> >> Well, it's either a DMA remapping issue or a hpsa one.  Your assertion
> >> that everything worked fine until 39de65aa2c3e would tend to vindicate
> >> hpsa,
> 
>  Hmm here you mean DMA, right?
> >>>
> >>> No, it vindicates the hpsa changes ... they don't seem to be causing
> >>> problems until something goes wrong with dma remapping.
> >>>
> > because all the hpsa changes went in before that under
> > Missing crucial info:
> >
> > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >
> >> Merge: 3e75c6d b2bff6c
> >> Author: Linus Torvalds 
> >> Date:   Tue Apr 1 18:49:04 2014 -0700
> >>
> >> Merge tag 'scsi-misc' of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >>
> >> can you revalidate that this commit works OK just to make sure?
> 
>  Ok so I don't see those DMA messages and system starts just fine. I'm
>  thinking perhaps something broke after the IO mmu stuff in commit
>  3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
>  causing the CPU stalls and just blame hpsa in the path as a side effect?
> 
>  /me goes out to try the commit.
> >>>
> >>> That's my guess.  The DMAR messages are DMA remapping issues caused in
> >>> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> >>> indicating the IOMMU is calling for a mapping address before it can
> >>> satisfy the driver read request, which is causing the hang apparently in
> >>> the hpsa driver.
> >>>
> >>> I've added linux-pci to the cc; I think they deal with iommu issues on
> >>> x86.
> >>
> >> So that merge commit appears to be the culprit, I see both the DMA
> >> messages and the lockup blaming hpsa...
> > 
> > My understanding so far (please correct me if I'm wrong):
> > 
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] async scsi resume for 3.15

2014-04-10 Thread Dan Williams

Hi Linus,

James might still be in the process of sending this your way.  However,
given the proximity to -rc1, my reasoning for sending this directly is:

1/ It provides a tangible speed up for a non-esoteric use case (laptop
resume):

  
https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach

2/ You already pulled the first half of this enabling from Tejun.
Quoting Tejun's ATA pull request:

  "Dan finishes the patchset to make libata PM operations
   asynchronous.  Combined with one patch being routed through scsi,
   this should speed resume measurably."

3/ As far as I can tell it is acceptable to James:

  http://marc.info/?l=linux-scsi&m=139499409510791&w=2
  http://marc.info/?l=linux-scsi&m=139508044602605&w=2
  http://marc.info/?l=linux-scsi&m=139536062515216&w=2

4/ I promised Todd I would get it upstream before he returns from
vacation.

Please pull, thank you.

--
Dan

The following changes since commit 455c6fdbd219161bd09b1165f11699d6d73de11c:

  Linux 3.14 (2014-03-30 20:40:15 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci async-scsi-resume

for you to fetch changes up to 3c31b52f96f7b559d950b16113c0f68c72a1985e:

  scsi: async sd resume (2014-04-10 15:30:35 -0700)


Dan Williams (1):
  scsi: async sd resume

 drivers/scsi/Kconfig |   3 ++
 drivers/scsi/scsi.c  |   9 
 drivers/scsi/scsi_pm.c   | 128 ---
 drivers/scsi/scsi_priv.h |   2 +
 drivers/scsi/scsi_scan.c |   2 +-
 drivers/scsi/sd.c|   1 +
 6 files changed, 115 insertions(+), 30 deletions(-)


8<-
>From 3c31b52f96f7b559d950b16113c0f68c72a1985e Mon Sep 17 00:00:00 2001
From: Dan Williams 
Date: Thu, 10 Apr 2014 15:30:35 -0700
Subject: [PATCH] scsi: async sd resume

async_schedule() sd resume work to allow disks and other devices to
resume in parallel.

This moves the entirety of scsi_device resume to an async context to
ensure that scsi_device_resume() remains ordered with respect to the
completion of the start/stop command.  For the duration of the resume,
new command submissions (that do not originate from the scsi-core) will
be deferred (BLKPREP_DEFER).

It adds a new ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain) as a container
of these operations.  Like scsi_sd_probe_domain it is flushed at
sd_remove() time to ensure async ops do not continue past the
end-of-life of the sdev.  The implementation explicitly refrains from
reusing scsi_sd_probe_domain directly for this purpose as it is flushed
at the end of dpm_resume(), potentially defeating some of the benefit.
Given sdevs are quiesced it is permissible for these resume operations
to bleed past the async_synchronize_full() calls made by the driver
core.

We defer the resolution of which pm callback to call until
scsi_dev_type_{suspend|resume} time and guarantee that the callback
parameter is never NULL.  With this in place the type of resume
operation is encoded in the async function identifier.

There is a concern that async resume could trigger PSU overload.  In the
enterprise, storage enclosures enforce staggered spin-up regardless of
what the kernel does making async scanning safe by default.  Outside of
that context a user can disable asynchronous scanning via a kernel
command line or CONFIG_SCSI_SCAN_ASYNC.  Honor that setting when
deciding whether to do resume asynchronously.

Inspired by Todd's analysis and initial proposal [2]:
https://01.org/suspendresume/blogs/tebrandt/2013/hard-disk-resume-optimization-simpler-approach

Cc: Len Brown 
Cc: Phillip Susi 
[alan: bug fix and clean up suggestion]
Acked-by: Alan Stern 
Suggested-by: Todd Brandt 
[djbw: kick all resume work to the async queue]
Signed-off-by: Dan Williams 
---
 drivers/scsi/Kconfig |   3 ++
 drivers/scsi/scsi.c  |   9 
 drivers/scsi/scsi_pm.c   | 128 ---
 drivers/scsi/scsi_priv.h |   2 +
 drivers/scsi/scsi_scan.c |   2 +-
 drivers/scsi/sd.c|   1 +
 6 files changed, 115 insertions(+), 30 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index c8bd092..02832d6 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -263,6 +263,9 @@ config SCSI_SCAN_ASYNC
  You can override this choice by specifying "scsi_mod.scan=sync"
  or async on the kernel's command line.
 
+ Note that this setting also affects whether resuming from
+ system suspend will be performed asynchronously.
+
 menu "SCSI Transports"
depends on SCSI
 
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d8afec8..1b345bf 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -91,6 +91,15 @@ EXPORT_SYMBOL(scsi_logging_level);
 ASYNC_DOMAIN(scsi_sd_probe_domain);
 EXPORT_SYMBOL(scsi_sd_probe_domain);
 
+/*
+ * Separate domain (from scsi_sd_probe_domain) to maximize the benef

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Shuah Khan

On Thu, Apr 10, 2014 at 2:45 PM,   wrote:
>> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
>>
>> Yes, specifically (finally done bisecting):
>>
>> commit 2e45528930388658603ea24d49cf52867b928d3e
>> Author: Jiang Liu 
>> Date:   Wed Feb 19 14:07:36 2014 +0800
>>
>> iommu/vt-d: Unify the way to process DMAR device scope array
>>
>> Now we have a PCI bus notification based mechanism to update DMAR
>> device scope array, we could extend the mechanism to support boot
>> time initialization too, which will help to unify and simplify
>> the implementation.
>>
>> Signed-off-by: Jiang Liu 
>> Signed-off-by: Joerg Roedel 
>
> My git bisect appears to be converging on something else, something
> within the hpsa patches that I sent up recently, unfortunately for
> me.  Will let you all know when it converges.
>

This smells very much like the problem that was solved couple of years
ago for SI domain. It is likely that path is broken with the DMAR
device scope array change. Please take a look to see if the following
no longer occurs. Looks like BIOS could be expecting this RMRR to be
still mapped.

   /*
 * We want to prevent any device associated with an RMRR from
 * getting placed into the SI Domain. This is done because
 * problems exist when devices are moved in and out of domains
 * and their respective RMRR info is lost. We exempt USB devices
 * from this process due to their usage of RMRRs that are known
 * to not be needed after BIOS hand-off to OS.
 */
if (device_has_rmrr(dev) &&
(pdev->class >> 8) != PCI_CLASS_SERIAL_USB)
return 0;

-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] hpsa: fix uninitialized trans_support in hpsa_put_ctlr_into_performant_mode()

2014-04-10 Thread Davidlohr Bueso

On Thu, 2014-04-10 at 17:17 -0500, scame...@beardog.cce.hp.com wrote:
> Without this, you'll see a null pointer dereference in
> hpsa_enter_performant_mode().

So I'm not surprised that this patch doesn't solve the problem I am
seeing with DMAR and the hpsa driver hard lockup.

In any case it should address Baoquan's original report, so it confirms
that it is in fact two different sets of issues.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] hpsa: fix uninitialized trans_support in hpsa_put_ctlr_into_performant_mode()

2014-04-10 Thread scameron


Without this, you'll see a null pointer dereference in
hpsa_enter_performant_mode().

Signed-off-by: Stephen M. Cameron 
---
 drivers/scsi/hpsa.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 8cf4a0c..ef4dfdd 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
ctlr_info *h)
if (hpsa_simple_mode)
return;
 
+   trans_support = readl(&(h->cfgtable->TransportSupport));
+   if (!(trans_support & PERFORMANT_MODE))
+   return;
+
/* Check for I/O accelerator mode support */
if (trans_support & CFGTBL_Trans_io_accel1) {
transMethod |= CFGTBL_Trans_io_accel1 |
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa NULL pointer in hpsa_enter_performant_mode()

2014-04-10 Thread scameron

On Thu, Apr 10, 2014 at 04:20:46PM -0500, scame...@beardog.cce.hp.com wrote:
> On Thu, Apr 10, 2014 at 02:53:30PM -0600, Bjorn Helgaas wrote:
> > [subject changed]
> > 
> > On Thu, Apr 10, 2014 at 2:45 PM,   wrote:
> > > On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> > >> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> > >> > [+cc Joerg, iommu list]
> > >> >
> > >> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  
> > >> > wrote:
> > >> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> > >> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > >> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > >> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > >> > >> > > > [+linux-scsi]
> > >> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > >> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > >> > >> > > > > > Hi,
> > >> > >> > > > > >
> > >> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > >> > >> > > > >
> > >> > >> > > > > Cc'ing more people.
> > >> > >> > > > >
> > >> > >> > > > > While the hpsa driver appears to be involved in some way, 
> > >> > >> > > > > I'm sure if
> > >> > >> > > > > this is a related issue, but as of today's pull I'm getting 
> > >> > >> > > > > another
> > >> > >> > > > > problem that causes my DL980 not to come up.
> > >> > >> > > > >
> > >> > >> > > > > *Massive* amounts of:
> > >> > >> > > > >
> > >> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > >> > >> > > > > dmar: DRHD: handling fault status reg 602
> > >> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> > >> > >> > > > > 7f61e000
> > >> > >> > > > >
> > >> > >> > > > > Then:
> > >> > >> > > > >
> > >> > >> > > > > hpsa :03:00.0: Controller lockup detected: 0x
> > >> > >> > > > > ...
> > >> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > >> > >> > > > > ...
> > >> > >> > > > >
> > >> > >> > > > > Screenshot of the actual LOCKUP:
> > >> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >> > >> > > > >
> > >> > >> > > > > While I haven't bisected, things worked fine until at least 
> > >> > >> > > > > until commit
> > >> > >> > > > > 39de65aa2c3e (April 2nd).
> > >> > >> > > > >
> > >> > >> > > > > Any ideas?
> > >> > >> > > >
> > >> > >> > > > Well, it's either a DMA remapping issue or a hpsa one.  Your 
> > >> > >> > > > assertion
> > >> > >> > > > that everything worked fine until 39de65aa2c3e would tend to 
> > >> > >> > > > vindicate
> > >> > >> > > > hpsa,
> > >> > >> >
> > >> > >> > Hmm here you mean DMA, right?
> > >> > >>
> > >> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> > >> > >> problems until something goes wrong with dma remapping.
> > >> > >>
> > >> > >> > > because all the hpsa changes went in before that under
> > >> > >> > > Missing crucial info:
> > >> > >> > >
> > >> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >> > >> > >
> > >> > >> > > > Merge: 3e75c6d b2bff6c
> > >> > >> > > > Author: Linus Torvalds 
> > >> > >> > > > Date:   Tue Apr 1 18:49:04 2014 -0700
> > >> > >> > > >
> > >> > >> > > > Merge tag 'scsi-misc' of
> > >> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >> > >> > > >
> > >> > >> > > > can you revalidate that this commit works OK just to make 
> > >> > >> > > > sure?
> > >> > >> >
> > >> > >> > Ok so I don't see those DMA messages and system starts just fine. 
> > >> > >> > I'm
> > >> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> > >> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be 
> > >> > >> > indirectly
> > >> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
> > >> > >> > effect?
> > >> > >> >
> > >> > >> > /me goes out to try the commit.
> > >> > >>
> > >> > >> That's my guess.  The DMAR messages are DMA remapping issues caused 
> > >> > >> in
> > >> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> > >> > >> indicating the IOMMU is calling for a mapping address before it can
> > >> > >> satisfy the driver read request, which is causing the hang 
> > >> > >> apparently in
> > >> > >> the hpsa driver.
> > >> > >>
> > >> > >> I've added linux-pci to the cc; I think they deal with iommu issues 
> > >> > >> on
> > >> > >> x86.
> > >> > >
> > >> > > So that merge commit appears to be the culprit, I see both the DMA
> > >> > > messages and the lockup blaming hpsa...
> > >> >
> > >> > My understanding so far (please correct me if I'm wrong):
> > >> >
> > >> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > >> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > >
> > > ^^^ this one, 1a0b6abaea78, did not work for me, crashing in
> > > hpsa_enter_performant mode() which was surprsing to me as I am
> > > pretty sure I tried on this very same machine I'm using now
> > > (DL360p wi

Re: hpsa NULL pointer in hpsa_enter_performant_mode()

2014-04-10 Thread scameron

On Thu, Apr 10, 2014 at 02:53:30PM -0600, Bjorn Helgaas wrote:
> [subject changed]
> 
> On Thu, Apr 10, 2014 at 2:45 PM,   wrote:
> > On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> >> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> >> > [+cc Joerg, iommu list]
> >> >
> >> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
> >> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> >> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> >> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> >> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> >> > >> > > > [+linux-scsi]
> >> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> >> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> >> > >> > > > > > Hi,
> >> > >> > > > > >
> >> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> >> > >> > > > >
> >> > >> > > > > Cc'ing more people.
> >> > >> > > > >
> >> > >> > > > > While the hpsa driver appears to be involved in some way, I'm 
> >> > >> > > > > sure if
> >> > >> > > > > this is a related issue, but as of today's pull I'm getting 
> >> > >> > > > > another
> >> > >> > > > > problem that causes my DL980 not to come up.
> >> > >> > > > >
> >> > >> > > > > *Massive* amounts of:
> >> > >> > > > >
> >> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> >> > >> > > > > dmar: DRHD: handling fault status reg 602
> >> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> >> > >> > > > > 7f61e000
> >> > >> > > > >
> >> > >> > > > > Then:
> >> > >> > > > >
> >> > >> > > > > hpsa :03:00.0: Controller lockup detected: 0x
> >> > >> > > > > ...
> >> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> >> > >> > > > > ...
> >> > >> > > > >
> >> > >> > > > > Screenshot of the actual LOCKUP:
> >> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >> > >> > > > >
> >> > >> > > > > While I haven't bisected, things worked fine until at least 
> >> > >> > > > > until commit
> >> > >> > > > > 39de65aa2c3e (April 2nd).
> >> > >> > > > >
> >> > >> > > > > Any ideas?
> >> > >> > > >
> >> > >> > > > Well, it's either a DMA remapping issue or a hpsa one.  Your 
> >> > >> > > > assertion
> >> > >> > > > that everything worked fine until 39de65aa2c3e would tend to 
> >> > >> > > > vindicate
> >> > >> > > > hpsa,
> >> > >> >
> >> > >> > Hmm here you mean DMA, right?
> >> > >>
> >> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> >> > >> problems until something goes wrong with dma remapping.
> >> > >>
> >> > >> > > because all the hpsa changes went in before that under
> >> > >> > > Missing crucial info:
> >> > >> > >
> >> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >> > >> > >
> >> > >> > > > Merge: 3e75c6d b2bff6c
> >> > >> > > > Author: Linus Torvalds 
> >> > >> > > > Date:   Tue Apr 1 18:49:04 2014 -0700
> >> > >> > > >
> >> > >> > > > Merge tag 'scsi-misc' of
> >> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >> > >> > > >
> >> > >> > > > can you revalidate that this commit works OK just to make sure?
> >> > >> >
> >> > >> > Ok so I don't see those DMA messages and system starts just fine. 
> >> > >> > I'm
> >> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> >> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> >> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
> >> > >> > effect?
> >> > >> >
> >> > >> > /me goes out to try the commit.
> >> > >>
> >> > >> That's my guess.  The DMAR messages are DMA remapping issues caused in
> >> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> >> > >> indicating the IOMMU is calling for a mapping address before it can
> >> > >> satisfy the driver read request, which is causing the hang apparently 
> >> > >> in
> >> > >> the hpsa driver.
> >> > >>
> >> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> >> > >> x86.
> >> > >
> >> > > So that merge commit appears to be the culprit, I see both the DMA
> >> > > messages and the lockup blaming hpsa...
> >> >
> >> > My understanding so far (please correct me if I'm wrong):
> >> >
> >> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> >> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> >
> > ^^^ this one, 1a0b6abaea78, did not work for me, crashing in
> > hpsa_enter_performant mode() which was surprsing to me as I am
> > pretty sure I tried on this very same machine I'm using now
> > (DL360p with P420, P430 and P420i) with 3.14-rc-something plus
> > all the hpsa patches that I thought were merged in.
> 
> I think we have to completely different issues mixed together in this
> thread, so I changed the subject here.

Thanks.

> 
> The reports above for 39de65aa2c3e, 1a0b6abaea78, were for a DMA fault.
> 
> The original message from Baoquan He was for

hpsa NULL pointer in hpsa_enter_performant_mode()

2014-04-10 Thread Bjorn Helgaas

[subject changed]

On Thu, Apr 10, 2014 at 2:45 PM,   wrote:
> On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
>> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
>> > [+cc Joerg, iommu list]
>> >
>> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
>> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
>> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
>> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
>> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
>> > >> > > > [+linux-scsi]
>> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
>> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
>> > >> > > > > > Hi,
>> > >> > > > > >
>> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
>> > >> > > > >
>> > >> > > > > Cc'ing more people.
>> > >> > > > >
>> > >> > > > > While the hpsa driver appears to be involved in some way, I'm 
>> > >> > > > > sure if
>> > >> > > > > this is a related issue, but as of today's pull I'm getting 
>> > >> > > > > another
>> > >> > > > > problem that causes my DL980 not to come up.
>> > >> > > > >
>> > >> > > > > *Massive* amounts of:
>> > >> > > > >
>> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
>> > >> > > > > dmar: DRHD: handling fault status reg 602
>> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
>> > >> > > > > 7f61e000
>> > >> > > > >
>> > >> > > > > Then:
>> > >> > > > >
>> > >> > > > > hpsa :03:00.0: Controller lockup detected: 0x
>> > >> > > > > ...
>> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
>> > >> > > > > ...
>> > >> > > > >
>> > >> > > > > Screenshot of the actual LOCKUP:
>> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
>> > >> > > > >
>> > >> > > > > While I haven't bisected, things worked fine until at least 
>> > >> > > > > until commit
>> > >> > > > > 39de65aa2c3e (April 2nd).
>> > >> > > > >
>> > >> > > > > Any ideas?
>> > >> > > >
>> > >> > > > Well, it's either a DMA remapping issue or a hpsa one.  Your 
>> > >> > > > assertion
>> > >> > > > that everything worked fine until 39de65aa2c3e would tend to 
>> > >> > > > vindicate
>> > >> > > > hpsa,
>> > >> >
>> > >> > Hmm here you mean DMA, right?
>> > >>
>> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
>> > >> problems until something goes wrong with dma remapping.
>> > >>
>> > >> > > because all the hpsa changes went in before that under
>> > >> > > Missing crucial info:
>> > >> > >
>> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
>> > >> > >
>> > >> > > > Merge: 3e75c6d b2bff6c
>> > >> > > > Author: Linus Torvalds 
>> > >> > > > Date:   Tue Apr 1 18:49:04 2014 -0700
>> > >> > > >
>> > >> > > > Merge tag 'scsi-misc' of
>> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
>> > >> > > >
>> > >> > > > can you revalidate that this commit works OK just to make sure?
>> > >> >
>> > >> > Ok so I don't see those DMA messages and system starts just fine. I'm
>> > >> > thinking perhaps something broke after the IO mmu stuff in commit
>> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
>> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
>> > >> > effect?
>> > >> >
>> > >> > /me goes out to try the commit.
>> > >>
>> > >> That's my guess.  The DMAR messages are DMA remapping issues caused in
>> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
>> > >> indicating the IOMMU is calling for a mapping address before it can
>> > >> satisfy the driver read request, which is causing the hang apparently in
>> > >> the hpsa driver.
>> > >>
>> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
>> > >> x86.
>> > >
>> > > So that merge commit appears to be the culprit, I see both the DMA
>> > > messages and the lockup blaming hpsa...
>> >
>> > My understanding so far (please correct me if I'm wrong):
>> >
>> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
>> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
>
> ^^^ this one, 1a0b6abaea78, did not work for me, crashing in
> hpsa_enter_performant mode() which was surprsing to me as I am
> pretty sure I tried on this very same machine I'm using now
> (DL360p with P420, P430 and P420i) with 3.14-rc-something plus
> all the hpsa patches that I thought were merged in.

I think we have to completely different issues mixed together in this
thread, so I changed the subject here.

The reports above for 39de65aa2c3e, 1a0b6abaea78, were for a DMA fault.

The original message from Baoquan He was for a NULL pointer
dereference in hpsa_enter_performant_mode(), which is very likely the
same problem you're seeing, Steve.

I changed the subject to "hpsa NULL pointer in
hpsa_enter_performant_mode()", so hopefully we can chase the NULL
pointer issue there and leave the original, already long thread, for
the DMA fault iss

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread scameron

On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> > 
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > >> > > > [+linux-scsi]
> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > >> > > > >
> > >> > > > > Cc'ing more people.
> > >> > > > >
> > >> > > > > While the hpsa driver appears to be involved in some way, I'm 
> > >> > > > > sure if
> > >> > > > > this is a related issue, but as of today's pull I'm getting 
> > >> > > > > another
> > >> > > > > problem that causes my DL980 not to come up.
> > >> > > > >
> > >> > > > > *Massive* amounts of:
> > >> > > > >
> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > >> > > > > dmar: DRHD: handling fault status reg 602
> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> > >> > > > > 7f61e000
> > >> > > > >
> > >> > > > > Then:
> > >> > > > >
> > >> > > > > hpsa :03:00.0: Controller lockup detected: 0x
> > >> > > > > ...
> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > >> > > > > ...
> > >> > > > >
> > >> > > > > Screenshot of the actual LOCKUP:
> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >> > > > >
> > >> > > > > While I haven't bisected, things worked fine until at least 
> > >> > > > > until commit
> > >> > > > > 39de65aa2c3e (April 2nd).
> > >> > > > >
> > >> > > > > Any ideas?
> > >> > > >
> > >> > > > Well, it's either a DMA remapping issue or a hpsa one.  Your 
> > >> > > > assertion
> > >> > > > that everything worked fine until 39de65aa2c3e would tend to 
> > >> > > > vindicate
> > >> > > > hpsa,
> > >> >
> > >> > Hmm here you mean DMA, right?
> > >>
> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> > >> problems until something goes wrong with dma remapping.
> > >>
> > >> > > because all the hpsa changes went in before that under
> > >> > > Missing crucial info:
> > >> > >
> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >> > >
> > >> > > > Merge: 3e75c6d b2bff6c
> > >> > > > Author: Linus Torvalds 
> > >> > > > Date:   Tue Apr 1 18:49:04 2014 -0700
> > >> > > >
> > >> > > > Merge tag 'scsi-misc' of
> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >> > > >
> > >> > > > can you revalidate that this commit works OK just to make sure?
> > >> >
> > >> > Ok so I don't see those DMA messages and system starts just fine. I'm
> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
> > >> > effect?
> > >> >
> > >> > /me goes out to try the commit.
> > >>
> > >> That's my guess.  The DMAR messages are DMA remapping issues caused in
> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> > >> indicating the IOMMU is calling for a mapping address before it can
> > >> satisfy the driver read request, which is causing the hang apparently in
> > >> the hpsa driver.
> > >>
> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> > >> x86.
> > >
> > > So that merge commit appears to be the culprit, I see both the DMA
> > > messages and the lockup blaming hpsa...
> > 
> > My understanding so far (please correct me if I'm wrong):
> > 
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")

^^^ this one, 1a0b6abaea78, did not work for me, crashing in
hpsa_enter_performant mode() which was surprsing to me as I am
pretty sure I tried on this very same machine I'm using now
(DL360p with P420, P430 and P420i) with 3.14-rc-something plus
all the hpsa patches that I thought were merged in.

But now I am seeing:

 [] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
RSP: 0018:88042c515a78  EFLAGS: 00010297
RAX:  RBX: 88042c65 RCX: 0004
RDX:  RSI: 0001 RDI: 
RBP: 88042c515b48 R08:  R09: 8af03cc0
R10:  R11: 0001 R12: 88042c515a98
R13: 6104 R14: 88042c515ad8 R15: a0001630
FS:  7f86f7a38700() GS:88043f56() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
usb 1-1.6: new low-speed USB device number 3 using ehci-pci
CR2:  CR3: 00042c4c3000 CR4: 000407e0
Stack:
 00

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread James Bottomley

On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
> On 04/10/2014 05:31 PM, Alan Stern wrote:
> > On Thu, 10 Apr 2014, Hannes Reinecke wrote:
> >
> >> On 04/10/2014 12:58 PM, Andreas Reis wrote:
> >>> That patch appears to work in preventing the crashes, judged on one
> >>> repeated appearance of the bug.
> >>>
> >>> dmesg had the usual
> >>> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> >>> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> >>> xhci_hcd
> >>> [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> >>> with disabled ep 880427b829c0
> >>> [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> >>> with disabled ep 880427b82a08
> >>> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> >>>
> >>> repeated five times, followed by one
> >>> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> >>> recovery
> >>>
> >>> and then as often as something tried to read from it:
> >>> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> >>>
> >>> The stick could then be properly un- and remounted (the latter if it
> >>> had been physically replugged) without issue � for the bug to
> >>> reoccur after one to three minutes. I tried this three times, no
> >>> dmesg difference except the ep addresses varied on two of that.
> >>>
> >> Was this just that patch you've tested with or the entire patch series?
> >>
> >> If the latter, Alan, is this the expected outcome?
> >
> > Yes, it is.  The same thing should happen with the entire patch series.
> >
> >> I would've thought the error recover should _not_ run into
> >> offlining devices here, but rather the device should be recovered
> >> eventually.
> >
> > The command times out, it is aborted, and the command is retried.  The
> > same thing happens, and we repeat five times.  Eventually the SCSI core
> > gives up and declares the device to be offline.
> >
> Hmm. Ok. If you are fine with it who am I to argue here.
> James, shall I resent the patch series?

You mean the one patch?  No, it's OK, I have it.

It's still not complete, though, as I've said a couple of times.  The
problem is that we have abort memory on any eh command as well, which
this doesn't fix.

The scenario is abort command, set flag, abort completes, send TUR, TUR
doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
will skip the abort because the flag is set and move straight to reset.

The fix is this, I can just add it as well.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..7516e2c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct 
scsi_eh_save *ses,
ses->prot_op = scmd->prot_op;
 
scmd->prot_op = SCSI_PROT_NORMAL;
+   scmd->eh_eflags = 0;
scmd->cmnd = ses->eh_cmnd;
memset(scmd->cmnd, 0, BLK_MAX_CDB);
memset(&scmd->sdb, 0, sizeof(scmd->sdb));


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Update Maintainers for IBM Power 842, vscsi, and vfc drivers

2014-04-10 Thread Brian King

On 04/09/2014 01:32 PM, Nathan Fontenot wrote:
> Update the MAINTAINERS file to indicate the current maintainers
> for the IBM Power 842 Compression driver, IBM Power Virtual SCSI
> driver and the IBM Power Virtual FC Driver.
> 
> Signed-off-by: Nathan Fontenot 
> ---

Acked-by: Brian King 

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Hannes Reinecke


On 04/10/2014 05:31 PM, Alan Stern wrote:

On Thu, 10 Apr 2014, Hannes Reinecke wrote:


On 04/10/2014 12:58 PM, Andreas Reis wrote:

That patch appears to work in preventing the crashes, judged on one
repeated appearance of the bug.

dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
xhci_hcd
[  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b829c0
[  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b82a08
[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error
recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it
had been physically replugged) without issue � for the bug to
reoccur after one to three minutes. I tried this three times, no
dmesg difference except the ep addresses varied on two of that.


Was this just that patch you've tested with or the entire patch series?

If the latter, Alan, is this the expected outcome?


Yes, it is.  The same thing should happen with the entire patch series.


I would've thought the error recover should _not_ run into
offlining devices here, but rather the device should be recovered
eventually.


The command times out, it is aborted, and the command is retried.  The
same thing happens, and we repeat five times.  Eventually the SCSI core
gives up and declares the device to be offline.


Hmm. Ok. If you are fine with it who am I to argue here.
James, shall I resent the patch series?

Cheers,

Hannes

--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Woodhouse, David

On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
> > > > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> > > > > >> > > > > 7f61e000
> > 
> > That "Present bit in context entry is clear" fault means that we have
> > not set up *any* mappings for this PCI device… on this IOMMU.
> > 
> > > > Yes, specifically (finally done bisecting):
> > > > 
> > > > commit 2e45528930388658603ea24d49cf52867b928d3e
> > > > Author: Jiang Liu 
> > > > Date:   Wed Feb 19 14:07:36 2014 +0800
> > > > 
> > > > iommu/vt-d: Unify the way to process DMAR device scope array
> > 
> > This commit is about how we decide which IOMMU a given PCI device is
> > attached to.
> > 
> > Thus, my first guess would be that we are quite happily setting up the
> > requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> > device actually tries to do DMA.
> > 
> > However, I'm not 100% convinced of that. The fault address looks
> > suspiciously like a true physical address, not a virtual bus address of
> > the type that we'd normally allocate for a dma_map_* operation. Those
> > would start at 0xf000 and work downwards, typically.
> > 
> > Do you have 'iommu=pt' on the kernel command line? 
> 
> No.
> 
> > Can I see the full
> > dmesg as this system boots, and also a copy of the DMAR table?
> 
> Attaching a dmesg from one of the kernels that boots. It doesn't appear
> to have much of the related information... 

It shows us that the address 0x7f61e000 is in an E820-reserved region,
and that there's and RMRR covering that region for an unspecified PCI
device, but that's going to be the hpsa.

So if isn't just a simple case of us assigning this device to the wrong
IOMMU, *perhaps* it's that we lose the RMRR when the driver takes
control of the device. RMRRs are generally expected to be a boot-time
thing, for things like legacy keyboard/mouse emulation via USB. Using
them while the system is *active* is... horrid. We've often not quite
handled that right.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation


smime.p7s
Description: S/MIME cryptographic signature

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Davidlohr Bueso

On Thu, 2014-04-10 at 16:34 +0800, Jiang Liu wrote:
> Hi Baoquan,
>   Could you please help to give output of "lspci -"?

Attached.

> Is device "hpsa :03:00.0" a legacy PCI device(non-PCIe)?
> It may have relationship with IOMMU driver.

I honestly don't know. PCI is way out of my area of knowledge.
00:00.0 Host bridge: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port (rev 
22)
Subsystem: Hewlett-Packard Company Device 330b
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 

00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 1 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 2 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 3 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 
(rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 
(rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:06.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 6 
(rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 7 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 8 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express 
Root Port 9 (rev 22) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- Dis

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Woodhouse, David

On Thu, 2014-04-10 at 09:14 -0600, Bjorn Helgaas wrote:
> > Thus, my first guess would be that we are quite happily setting up the
> > requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> > device actually tries to do DMA.
> >
> I like the "wrong IOMMU (or no IOMMU at all)" theory.  If we didn't
> connect the device with an IOMMU at all, that would explain the device
> DMAing directly to a physical address, wouldn't it?

An unlikely failure mode. We're much more likely to see *wrong* IOMMU
than no IOMMU. And thus we'd still see the distinctive virtual addresses
just below 4GiB.

However, Rob's answer may solve that puzzle. If this is one of those
abominations where the device continues to do DMA to system memory even
after the OS is up and running and *thinks* it has control of the
hardware, then the offending address will be listed in an RMRR entry
(which tells the OS to set up a 1:1 mapping for access to certain memory
ranges for a given device). And will be inside an E820 reserved region.

A little odd that such an error would trigger only when we're actually
trying to initialise the device from the Linux driver, not as soon as we
enable the IOMMU. But all things are possible.

But the DMAR table and dmesg that I asked for would give us a bit more
information and hopefully let us stop speculating...

> > We should also rate-limit DMA faults, which would avoid the lockup
> > failure mode. Bjorn, what should an IOMMU driver *do* when it detects
> > that a device is creating an endless stream of DMA faults and isn't
> > aborting the transaction?
> 
> You mentioned that POWER with EEH does something intelligent in this
> case, but I'm not familiar with that code.  We have AER support, which
> can result in resetting a device, but I think DMA faults are reported
> differently, and I don't think there's any nice existing way for PCI
> to deal with them.  Maybe there should be, though.

Quite frankly, I don't care how *you* deal with them, or even if you
can. All I want to know is how I tell you about the problem, because *I*
sure as hell don't want to be trying to deal with it in the IOMMU code.
That's a generic PCI layer thing. :)

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

smime.p7s
Description: S/MIME cryptographic signature

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Linda Knippers

On 4/10/2014 11:14 AM, Bjorn Helgaas wrote:
> On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
>  wrote:
> 
>>> DMAR:[fault reason 02] Present bit in context entry is clear
>>> dmar: DRHD: handling fault status reg 602
>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>>
>> That "Present bit in context entry is clear" fault means that we have
>> not set up *any* mappings for this PCI device… on this IOMMU.
>>
 Yes, specifically (finally done bisecting):

 commit 2e45528930388658603ea24d49cf52867b928d3e
 Author: Jiang Liu 
 Date:   Wed Feb 19 14:07:36 2014 +0800

 iommu/vt-d: Unify the way to process DMAR device scope array
>>
>> This commit is about how we decide which IOMMU a given PCI device is
>> attached to.
>>
>> Thus, my first guess would be that we are quite happily setting up the
>> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
>> device actually tries to do DMA.
>>
>> However, I'm not 100% convinced of that. The fault address looks
>> suspiciously like a true physical address, not a virtual bus address of
>> the type that we'd normally allocate for a dma_map_* operation. Those
>> would start at 0xf000 and work downwards, typically.
> 
> I like the "wrong IOMMU (or no IOMMU at all)" theory.  If we didn't
> connect the device with an IOMMU at all, that would explain the device
> DMAing directly to a physical address, wouldn't it?
> 
>> Do you have 'iommu=pt' on the kernel command line? Can I see the full
>> dmesg as this system boots, and also a copy of the DMAR table?

This will be really helpful information.  This box has devices with
RMRR records and if they're not set up correctly, DMAR faults can occur.

>>
>> We should also rate-limit DMA faults, which would avoid the lockup
>> failure mode. Bjorn, what should an IOMMU driver *do* when it detects
>> that a device is creating an endless stream of DMA faults and isn't
>> aborting the transaction?
> 
> You mentioned that POWER with EEH does something intelligent in this
> case, but I'm not familiar with that code.  We have AER support, which
> can result in resetting a device, but I think DMA faults are reported
> differently, and I don't think there's any nice existing way for PCI
> to deal with them.  Maybe there should be, though.
> 
> Bjorn
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Alan Stern

On Thu, 10 Apr 2014, Hannes Reinecke wrote:

> On 04/10/2014 12:58 PM, Andreas Reis wrote:
> > That patch appears to work in preventing the crashes, judged on one
> > repeated appearance of the bug.
> > 
> > dmesg had the usual
> > [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> > [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> > xhci_hcd
> > [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> > with disabled ep 880427b829c0
> > [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> > with disabled ep 880427b82a08
> > [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> > 
> > repeated five times, followed by one
> > [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> > recovery
> > 
> > and then as often as something tried to read from it:
> > [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> > 
> > The stick could then be properly un- and remounted (the latter if it
> > had been physically replugged) without issue � for the bug to
> > reoccur after one to three minutes. I tried this three times, no
> > dmesg difference except the ep addresses varied on two of that.
> > 
> Was this just that patch you've tested with or the entire patch series?
> 
> If the latter, Alan, is this the expected outcome?

Yes, it is.  The same thing should happen with the entire patch series.

> I would've thought the error recover should _not_ run into
> offlining devices here, but rather the device should be recovered
> eventually.

The command times out, it is aborted, and the command is retried.  The
same thing happens, and we repeat five times.  Eventually the SCSI core
gives up and declares the device to be offline.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Bjorn Helgaas

On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
 wrote:

>> > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
>> > > >> > > > > dmar: DRHD: handling fault status reg 602
>> > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
>> > > >> > > > > 7f61e000
>
> That "Present bit in context entry is clear" fault means that we have
> not set up *any* mappings for this PCI device… on this IOMMU.
>
>> > Yes, specifically (finally done bisecting):
>> >
>> > commit 2e45528930388658603ea24d49cf52867b928d3e
>> > Author: Jiang Liu 
>> > Date:   Wed Feb 19 14:07:36 2014 +0800
>> >
>> > iommu/vt-d: Unify the way to process DMAR device scope array
>
> This commit is about how we decide which IOMMU a given PCI device is
> attached to.
>
> Thus, my first guess would be that we are quite happily setting up the
> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> device actually tries to do DMA.
>
> However, I'm not 100% convinced of that. The fault address looks
> suspiciously like a true physical address, not a virtual bus address of
> the type that we'd normally allocate for a dma_map_* operation. Those
> would start at 0xf000 and work downwards, typically.

I like the "wrong IOMMU (or no IOMMU at all)" theory.  If we didn't
connect the device with an IOMMU at all, that would explain the device
DMAing directly to a physical address, wouldn't it?

> Do you have 'iommu=pt' on the kernel command line? Can I see the full
> dmesg as this system boots, and also a copy of the DMAR table?
>
> We should also rate-limit DMA faults, which would avoid the lockup
> failure mode. Bjorn, what should an IOMMU driver *do* when it detects
> that a device is creating an endless stream of DMA faults and isn't
> aborting the transaction?

You mentioned that POWER with EEH does something intelligent in this
case, but I'm not familiar with that code.  We have AER support, which
can result in resetting a device, but I think DMA faults are reported
differently, and I don't think there's any nice existing way for PCI
to deal with them.  Maybe there should be, though.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] sd: medium access timeout counter fails to reset

2014-04-10 Thread David Jeffery

There is an error with the medium access timeout feature of the sd driver. The
sdkp->medium_access_timed_out value is reset to zero in sd_done() in the wrong
place.  Currently it is reset to zero only when a command returns sense data.
This can result in cases where the medium access check falsely triggers from
timed out commands which are hours or days apart.

For example, an I/O command times out and is aborted.  It then retries and
succeeds.  But with no sense data generated and returned, the
medium_access_timed_out value is not reset.  If no sd command returns sense
data, then the next command to time out (however far in time from the first
failure) will trigger the medium access timeout and put the device offline.

The resetting of sdkp->medium_access_timed_out should occur before the check
for sense data.

Signed-off-by: David Jeffery 

---

To reproduce using scsi_debug, use SCSI_DEBUG_OPT_TIMEOUT or
SCSI_DEBUG_OPT_MAC_TIMEOUT to force an I/O command to timeout.  Then, remove
the opt value so the I/O will succeed on retry.  Perform more I/O as desired.
Finally, repeat the process to make a new I/O command time out.  Without the
patch, the device will be marked offline even though many I/O commands have
succeeded between the 2 instances of timed out commands.


diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 470954a..a41e68e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1689,12 +1689,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
   sshdr.ascq));
}
 #endif
+   sdkp->medium_access_timed_out = 0;
+
if (driver_byte(result) != DRIVER_SENSE &&
(!sense_valid || sense_deferred))
goto out;
 
-   sdkp->medium_access_timed_out = 0;
-
switch (sshdr.sense_key) {
case HARDWARE_ERROR:
case MEDIUM_ERROR:
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: esp_scsi QTAG in FAS216

2014-04-10 Thread Kars de Jong

2014-04-06 22:33 GMT+02:00 Michael Schmitz :
>
> Hello Dave, Tuomas,
>
> >> Also, looking at the timeout formulae in the old NCR53C9x.c driver,
> >> the values would be different for FAS216. Why was this dropped from
> >> the modern esp_scsi?
> >
> > I've never seen a formula for any ESP or FAS chip for the timeout
> > other than the one mentioned in huge comment in
> > esp_set_clock_params(), although I do see the 7668 instead of 8192
> > factor being used in the old NCR53C9x driver.
>
> I haven't gone far enough back in the 53C9x revision history to be
> certain. but it would seem to me that Kars de Jong added that FAS
> special case.
>
> Can you confirm that, Kars? Any recollection as to the reason?

That is the value that's in the data manual of the Symbios Logic
SYM53CF94/96-2 (the actual chip that's in my Amiga SCSI controller).

Funny, according to the QLogic FAS2x6 manual the value should be 7682
for FAS216/216U/236/236U chips...

I don't think it's all that important. It only means that the actual
selection timeout used by the chip will be slightly shorter than it is
supposed to be.


Kind regards,

Kars.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ch: add refcounting

2014-04-10 Thread Hannes Reinecke

On 04/10/2014 01:51 PM, Christoph Hellwig wrote:
>>  static int
>>  ch_release(struct inode *inode, struct file *file)
>>  {
>>  scsi_changer *ch = file->private_data;
>>  
>>  scsi_device_put(ch->device);
>> +ch->device = NULL;
>>  file->private_data = NULL;
>> +kref_put(&ch->ref, ch_destroy);
> 
> Any reason you need to put the scsi_device here already?  Defering
> this would give you much eaiser life time rules, and no need to
> deal with a NULL ch->device ever.
> 
Sure. But this would require a far more in-depth analysis of the
lifetime of the ch object, and most likely a far more intrusive
patch. You're welcome to do so :-)

This patch is just a minimal fix; I didn't dare to change too much
of the internals.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Darlehen Angebot

2014-04-10 Thread ROYAL ASSURED LOAN

Wir bieten privaten und gewerblichen Darlehen ohne Sicherheiten (nur 
Identifikation) bei 3% Zinssatz, ab € 10.000 bis € 90.000.000 in 1 Jahr 
bis 20 Jahren Laufzeit überall in der Welt..

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Hannes Reinecke

On 04/10/2014 02:26 PM, Andreas Reis wrote:
> Only your 0/3 patch to which Alan linked, along with two other
> patches by Mathias Nyman ("disable usb3 on intel hosts" and "disable
> all lpm related control transfers", one of which is the source of
> the "do nothing"s).
> 
> I'll revert the latter two and apply the rest of the set. Which I'm
> guessing currently consists of said 0/3 patch —
> http://www.spinics.net/lists/linux-scsi/msg73502.html
> — plus 2/3 and 3/3?
> 
Yes, that is correct.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Andreas Reis

Only your 0/3 patch to which Alan linked, along with two other patches 
by Mathias Nyman ("disable usb3 on intel hosts" and "disable all lpm 
related control transfers", one of which is the source of the "do 
nothing"s).


I'll revert the latter two and apply the rest of the set. Which I'm 
guessing currently consists of said 0/3 patch —

http://www.spinics.net/lists/linux-scsi/msg73502.html
— plus 2/3 and 3/3?

Or should I just omit 0/3 and try whichever of the two in 1/3 "works 
best"? Rather confusing ATM.


Anyway, for whatever reason the bug is happening rather frequently now. 
I've spotted the following occurring after the "Device offlined" line 
two times now:


[  206.901385] sd 11:0:0:0: [sdg] Unhandled error code
[  206.901394] sd 11:0:0:0: [sdg]
[  206.901397] Result: hostbyte=0x01 driverbyte=0x00
[  206.901400] sd 11:0:0:0: [sdg] CDB:
[  206.901403] cdb[0]=0x2a: 2a 00 02 25 1b 50 00 00 08 00
[  206.901419] end_request: I/O error, dev sdg, sector 35986256

The second time had "sd 12:0:0:0", "cdb[0]=0x28: 28 00 03 94 77 20 00 00 
08 00" and a different sector.


Andreas Reis

On 10.04.2014 13:37, Hannes Reinecke wrote:

On 04/10/2014 12:58 PM, Andreas Reis wrote:

That patch appears to work in preventing the crashes, judged on one
repeated appearance of the bug.

dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
xhci_hcd
[  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b829c0
[  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b82a08
[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error
recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it
had been physically replugged) without issue — for the bug to
reoccur after one to three minutes. I tried this three times, no
dmesg difference except the ep addresses varied on two of that.


Was this just that patch you've tested with or the entire patch series?

If the latter, Alan, is this the expected outcome?
I would've thought the error recover should _not_ run into
offlining devices here, but rather the device should be recovered
eventually.

Andreas, can you test with the entire patch series and enable
'scsi_logging_level -s -E 5' prior to running the tests?

THX.

Cheers,

Hannes



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ch: add refcounting

2014-04-10 Thread Christoph Hellwig

>  static int
>  ch_release(struct inode *inode, struct file *file)
>  {
>   scsi_changer *ch = file->private_data;
>  
>   scsi_device_put(ch->device);
> + ch->device = NULL;
>   file->private_data = NULL;
> + kref_put(&ch->ref, ch_destroy);

Any reason you need to put the scsi_device here already?  Defering
this would give you much eaiser life time rules, and no need to
deal with a NULL ch->device ever.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Hannes Reinecke

On 04/10/2014 12:58 PM, Andreas Reis wrote:
> That patch appears to work in preventing the crashes, judged on one
> repeated appearance of the bug.
> 
> dmesg had the usual
> [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
> [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
> xhci_hcd
> [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> with disabled ep 880427b829c0
> [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
> with disabled ep 880427b82a08
> [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
> 
> repeated five times, followed by one
> [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
> recovery
> 
> and then as often as something tried to read from it:
> [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
> 
> The stick could then be properly un- and remounted (the latter if it
> had been physically replugged) without issue — for the bug to
> reoccur after one to three minutes. I tried this three times, no
> dmesg difference except the ep addresses varied on two of that.
> 
Was this just that patch you've tested with or the entire patch series?

If the latter, Alan, is this the expected outcome?
I would've thought the error recover should _not_ run into
offlining devices here, but rather the device should be recovered
eventually.

Andreas, can you test with the entire patch series and enable
'scsi_logging_level -s -E 5' prior to running the tests?

THX.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Andreas Reis

That patch appears to work in preventing the crashes, judged on one 
repeated appearance of the bug.


dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd
[  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep 880427b829c0
[  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep 880427b82a08

[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it had 
been physically replugged) without issue — for the bug to reoccur after 
one to three minutes. I tried this three times, no dmesg difference 
except the ep addresses varied on two of that.


Andreas Reis

On 09.04.2014 20:02, Alan Stern wrote:

On Wed, 9 Apr 2014, Hannes Reinecke wrote:


I finally got a chance to try it out.  It does seem to do what we want.
I didn't track the flow of control in complete detail, but the command
definitely got aborted both times it was issued.


Good, so it is as I thought. James, can we include this patch instead of
your prior solution?


First, we should have the original bug reporter try it out.

Andreas, the patch in question can be found here:

http://marc.info/?l=linux-usb&m=13962706597&w=2

Can you try this in place of the 1/3 patch posted by James?  It should
have the same effect, of preventing your system from crashing when the
READ command fails.

Alan Stern



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] blk-mq: move request structures into struct blk_mq_tags

2014-04-10 Thread Christoph Hellwig

On Wed, Apr 09, 2014 at 10:23:32AM -0600, Jens Axboe wrote:
> This should go into block/blk-mq-tag.h.

Ok.

> We might as well leave this, the mtip32xx conversion ends up using it. So 
> if we pull it now, it'll just be reintroduced shortly.

It's back in the latest revision of the patch, just taking a
struct blk_mq_tag pointer now so that it can be used by SCSI as well.

I've also changed an opencode variant of it to use the helper.

Pointer: 
http://git.infradead.org/users/hch/scsi.git/commitdiff/b0f1ed35bbeb6d0177fc0cc0bf5c880c3c5d1817

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Woodhouse, David

On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
> [+ David, VT-d maintainer ]
> 
> Jiang, David, can you please have a look into this issue?
> 

> > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > > >> > > > > dmar: DRHD: handling fault status reg 602
> > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> > > >> > > > > 7f61e000

That "Present bit in context entry is clear" fault means that we have
not set up *any* mappings for this PCI device… on this IOMMU.

> > Yes, specifically (finally done bisecting):
> > 
> > commit 2e45528930388658603ea24d49cf52867b928d3e
> > Author: Jiang Liu 
> > Date:   Wed Feb 19 14:07:36 2014 +0800
> > 
> > iommu/vt-d: Unify the way to process DMAR device scope array

This commit is about how we decide which IOMMU a given PCI device is
attached to.

Thus, my first guess would be that we are quite happily setting up the
requested DMA maps on the *wrong* IOMMU, and then taking faults when the
device actually tries to do DMA.

However, I'm not 100% convinced of that. The fault address looks
suspiciously like a true physical address, not a virtual bus address of
the type that we'd normally allocate for a dma_map_* operation. Those
would start at 0xf000 and work downwards, typically.

Do you have 'iommu=pt' on the kernel command line? Can I see the full
dmesg as this system boots, and also a copy of the DMAR table?

We should also rate-limit DMA faults, which would avoid the lockup
failure mode. Bjorn, what should an IOMMU driver *do* when it detects
that a device is creating an endless stream of DMA faults and isn't
aborting the transaction?

I can set it to silent so that it just stops *reporting* the DMA faults
for that device... and I suppose I can re-enable them when I next see a
DMA mapping for it (although actually it'd be better to have a hook to
do that on FLR or something like that). But there must be a better
answer than that, surely? And I don't want to hack it up locally in
*one* specific IOMMU driver, any more than I have to.

On a POWER system with EEH, the kernel would end up isolating the
offending device completely, and subsequently resetting it...

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

smime.p7s
Description: S/MIME cryptographic signature

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Jiang Liu

Hi Baoquan,
Could you please help to give output of "lspci -"?
Is device "hpsa :03:00.0" a legacy PCI device(non-PCIe)?
It may have relationship with IOMMU driver.
Thanks!
Gerry

On 2014/4/10 12:03, Bjorn Helgaas wrote:
> [+cc Joerg, iommu list]
> 
> On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
>> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
>>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
 On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
>> [+linux-scsi]
>> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
>>> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
 Hi,

 The kernel is 3.14.0+ which is pulled just now.
>>>
>>> Cc'ing more people.
>>>
>>> While the hpsa driver appears to be involved in some way, I'm sure if
>>> this is a related issue, but as of today's pull I'm getting another
>>> problem that causes my DL980 not to come up.
>>>
>>> *Massive* amounts of:
>>>
>>> DMAR:[fault reason 02] Present bit in context entry is clear
>>> dmar: DRHD: handling fault status reg 602
>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>>>
>>> Then:
>>>
>>> hpsa :03:00.0: Controller lockup detected: 0x
>>> ...
>>> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
>>> ...
>>>
>>> Screenshot of the actual LOCKUP:
>>> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
>>>
>>> While I haven't bisected, things worked fine until at least until commit
>>> 39de65aa2c3e (April 2nd).
>>>
>>> Any ideas?
>>
>> Well, it's either a DMA remapping issue or a hpsa one.  Your assertion
>> that everything worked fine until 39de65aa2c3e would tend to vindicate
>> hpsa,

 Hmm here you mean DMA, right?
>>>
>>> No, it vindicates the hpsa changes ... they don't seem to be causing
>>> problems until something goes wrong with dma remapping.
>>>
> because all the hpsa changes went in before that under
> Missing crucial info:
>
> commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
>
>> Merge: 3e75c6d b2bff6c
>> Author: Linus Torvalds 
>> Date:   Tue Apr 1 18:49:04 2014 -0700
>>
>> Merge tag 'scsi-misc' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
>>
>> can you revalidate that this commit works OK just to make sure?

 Ok so I don't see those DMA messages and system starts just fine. I'm
 thinking perhaps something broke after the IO mmu stuff in commit
 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
 causing the CPU stalls and just blame hpsa in the path as a side effect?

 /me goes out to try the commit.
>>>
>>> That's my guess.  The DMAR messages are DMA remapping issues caused in
>>> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
>>> indicating the IOMMU is calling for a mapping address before it can
>>> satisfy the driver read request, which is causing the hang apparently in
>>> the hpsa driver.
>>>
>>> I've added linux-pci to the cc; I think they deal with iommu issues on
>>> x86.
>>
>> So that merge commit appears to be the culprit, I see both the DMA
>> messages and the lockup blaming hpsa...
> 
> My understanding so far (please correct me if I'm wrong):
> 
> 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hpsa driver bug crack kernel down!

2014-04-10 Thread Joerg Roedel

[+ David, VT-d maintainer ]

Jiang, David, can you please have a look into this issue?

Thanks,

Joerg

On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> > 
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > >> > > > [+linux-scsi]
> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > >> > > > >
> > >> > > > > Cc'ing more people.
> > >> > > > >
> > >> > > > > While the hpsa driver appears to be involved in some way, I'm 
> > >> > > > > sure if
> > >> > > > > this is a related issue, but as of today's pull I'm getting 
> > >> > > > > another
> > >> > > > > problem that causes my DL980 not to come up.
> > >> > > > >
> > >> > > > > *Massive* amounts of:
> > >> > > > >
> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > >> > > > > dmar: DRHD: handling fault status reg 602
> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> > >> > > > > 7f61e000
> > >> > > > >
> > >> > > > > Then:
> > >> > > > >
> > >> > > > > hpsa :03:00.0: Controller lockup detected: 0x
> > >> > > > > ...
> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > >> > > > > ...
> > >> > > > >
> > >> > > > > Screenshot of the actual LOCKUP:
> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >> > > > >
> > >> > > > > While I haven't bisected, things worked fine until at least 
> > >> > > > > until commit
> > >> > > > > 39de65aa2c3e (April 2nd).
> > >> > > > >
> > >> > > > > Any ideas?
> > >> > > >
> > >> > > > Well, it's either a DMA remapping issue or a hpsa one.  Your 
> > >> > > > assertion
> > >> > > > that everything worked fine until 39de65aa2c3e would tend to 
> > >> > > > vindicate
> > >> > > > hpsa,
> > >> >
> > >> > Hmm here you mean DMA, right?
> > >>
> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> > >> problems until something goes wrong with dma remapping.
> > >>
> > >> > > because all the hpsa changes went in before that under
> > >> > > Missing crucial info:
> > >> > >
> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >> > >
> > >> > > > Merge: 3e75c6d b2bff6c
> > >> > > > Author: Linus Torvalds 
> > >> > > > Date:   Tue Apr 1 18:49:04 2014 -0700
> > >> > > >
> > >> > > > Merge tag 'scsi-misc' of
> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >> > > >
> > >> > > > can you revalidate that this commit works OK just to make sure?
> > >> >
> > >> > Ok so I don't see those DMA messages and system starts just fine. I'm
> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
> > >> > effect?
> > >> >
> > >> > /me goes out to try the commit.
> > >>
> > >> That's my guess.  The DMAR messages are DMA remapping issues caused in
> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> > >> indicating the IOMMU is calling for a mapping address before it can
> > >> satisfy the driver read request, which is causing the hang apparently in
> > >> the hpsa driver.
> > >>
> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> > >> x86.
> > >
> > > So that merge commit appears to be the culprit, I see both the DMA
> > > messages and the lockup blaming hpsa...
> > 
> > My understanding so far (please correct me if I'm wrong):
> > 
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> 
> Yes, specifically (finally done bisecting):
> 
> commit 2e45528930388658603ea24d49cf52867b928d3e
> Author: Jiang Liu 
> Date:   Wed Feb 19 14:07:36 2014 +0800
> 
> iommu/vt-d: Unify the way to process DMAR device scope array
> 
> Now we have a PCI bus notification based mechanism to update DMAR
> device scope array, we could extend the mechanism to support boot
> time initialization too, which will help to unify and simplify
> the implementation.
> 
> Signed-off-by: Jiang Liu 
> Signed-off-by: Joerg Roedel 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

42 matches

Mail list logo