Re: hpsa driver bug crack kernel down!
On 04/10/14 at 04:34pm, Jiang Liu wrote: Hi Baoquan, Could you please help to give output of lspci -? Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)? It may have relationship with IOMMU driver. Thanks! Gerry Hi, I just saw your mail now. Do you still need the output of lspci - on my test machine? In fact, I didn't see the DMAR error related to intel vt-d issues. If the output is helpful, I can make a latest build to do this. Thanks Baoquan On 2014/4/10 12:03, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] hpsa: fix uninitialized trans_support in hpsa_put_ctlr_into_performant_mode()
This patch works for me. Tested-by: Baoquan He b...@redhat.com Thanks Baoquan On 04/10/14 at 05:17pm, scame...@beardog.cce.hp.com wrote: Without this, you'll see a null pointer dereference in hpsa_enter_performant_mode(). Signed-off-by: Stephen M. Cameron scame...@beardog.cce.hp.com --- drivers/scsi/hpsa.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index 8cf4a0c..ef4dfdd 100644 --- a/drivers/scsi/hpsa.c +++ b/drivers/scsi/hpsa.c @@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct ctlr_info *h) if (hpsa_simple_mode) return; + trans_support = readl((h-cfgtable-TransportSupport)); + if (!(trans_support PERFORMANT_MODE)) + return; + /* Check for I/O accelerator mode support */ if (trans_support CFGTBL_Trans_io_accel1) { transMethod |= CFGTBL_Trans_io_accel1 | -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hpsa driver bug crack kernel down!
On 04/10/14 at 04:34pm, Jiang Liu wrote: Hi Baoquan, Could you please help to give output of lspci -? Is device hpsa :03:00.0 a legacy PCI device(non-PCIe)? It may have relationship with IOMMU driver. Thanks! Gerry Well, the machine bug was reported on is a AMD machine, and it doesn't have the IOMMU problem. David saw there are some DMAR errors, it should be a intel machine which use the VT-d. On 2014/4/10 12:03, Bjorn Helgaas wrote: [+cc Joerg, iommu list] On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso davidl...@hp.com wrote: On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote: On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote: [+linux-scsi] On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote: On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote: Hi, The kernel is 3.14.0+ which is pulled just now. Cc'ing more people. While the hpsa driver appears to be involved in some way, I'm sure if this is a related issue, but as of today's pull I'm getting another problem that causes my DL980 not to come up. *Massive* amounts of: DMAR:[fault reason 02] Present bit in context entry is clear dmar: DRHD: handling fault status reg 602 dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 Then: hpsa :03:00.0: Controller lockup detected: 0x ... Workqueue: events hpsa_monitor_ctlr_worker [hpsa] ... Screenshot of the actual LOCKUP: http://stgolabs.net/hpsa-hard-lockup-3.14+.png While I haven't bisected, things worked fine until at least until commit 39de65aa2c3e (April 2nd). Any ideas? Well, it's either a DMA remapping issue or a hpsa one. Your assertion that everything worked fine until 39de65aa2c3e would tend to vindicate hpsa, Hmm here you mean DMA, right? No, it vindicates the hpsa changes ... they don't seem to be causing problems until something goes wrong with dma remapping. because all the hpsa changes went in before that under Missing crucial info: commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1 Merge: 3e75c6d b2bff6c Author: Linus Torvalds torva...@linux-foundation.org Date: Tue Apr 1 18:49:04 2014 -0700 Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi can you revalidate that this commit works OK just to make sure? Ok so I don't see those DMA messages and system starts just fine. I'm thinking perhaps something broke after the IO mmu stuff in commit 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly causing the CPU stalls and just blame hpsa in the path as a side effect? /me goes out to try the commit. That's my guess. The DMAR messages are DMA remapping issues caused in the IOMMU. If I had to guess, I'd say the DMAR fault message is indicating the IOMMU is calling for a mapping address before it can satisfy the driver read request, which is causing the hang apparently in the hpsa driver. I've added linux-pci to the cc; I think they deal with iommu issues on x86. So that merge commit appears to be the culprit, I see both the DMA messages and the lockup blaming hpsa... My understanding so far (please correct me if I'm wrong): 39de65aa2c3e OK (Merge branch 'i2c/for-next') 1a0b6abaea78 OK (Merge tag 'scsi-misc') 3f583bc21977 BAD (Merge tag 'iommu-updates-v3.15') -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html