Re: [PATCH] qla2xxx: Fix crash in qla2xxx_eh_abort on bad ptr

2017-03-13 Thread Martin K. Petersen
> "Bill" == Bill Kuzeja  writes:

Hi Bill,

Bill> I've seen this issue only after a Qlogic card breaks upon
Bill> initialization (one of my test cases). After the break,
Bill> qla2x00_abort_all_cmds gets invoked. This routine has a relatively
Bill> new section introduced by these commits:

[...]

Please resubmit a v2 of this fix with a concise (~ one paragraph) patch
description. The extensive debugging write-up is fine but it should be
placed after a --- separator so it doesn't end up in the git log.

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


RE: [PATCH] qla2xxx: Fix crash in qla2xxx_eh_abort on bad ptr

2017-03-09 Thread Madhani, Himanshu


> -Original Message-
> From: Bill Kuzeja [mailto:william.kuz...@stratus.com]
> Sent: Thursday, March 09, 2017 8:47 AM
> To: linux-scsi@vger.kernel.org
> Cc: qla2xxx-upstr...@qlogic.com; Bill Kuzeja 
> Subject: [PATCH] qla2xxx: Fix crash in qla2xxx_eh_abort on bad ptr
> 
> I've seen this issue only after a Qlogic card breaks upon initialization (one 
> of
> my test cases). After the break, qla2x00_abort_all_cmds gets invoked. This
> routine has a relatively new section introduced by these
> commits:
> 
> commit 1535aa75a3d8 ("scsi: qla2xxx: fix invalid DMA access after command
> aborts in PCI device remove") commit c733ab351243 ("scsi: qla2xxx: do not
> abort all commands in the adapter during EEH recovery") commit
> 2780f3c8f0233 ("scsi: qla2xxx: Avoid that issuing a LIP triggers a kernel 
> crash")
> 
> The section is problematic in certain cases. Here's the if statement in
> question:
> 
> if (GET_CMD_SP(sp) && !ha->flags.eeh_busy) {
>/* Get a reference to the sp and drop the lock.
> * The reference ensures this sp->done() call
> * - and not the call in qla2xxx_eh_abort() -
> * ends the SCSI command (with result 'res').
> */
> sp_get(sp);
> spin_unlock_irqrestore(>hardware_lock,flags);
> qla2xxx_eh_abort(GET_CMD_SP(sp));
> spin_lock_irqsave(>hardware_lock, flags);
> }
> 
> Now here's the panic my test provokes:
> 
> [  927.823858] Call Trace:
> [  927.581661]  qla2xxx_eh_abort+0x19/0x2b0 [qla2xxx] [  927.829269]
> [] qla2x00_abort_all_cmds+0xf6/0x14 [qla2xxx] [
> 927.845014]  []
> qla2x00_disable_board_on_pci_error+0x8f/0x160 [qla2xxx] [  927.863054]
> [] process_one_work+0x17b/0x470 [  927.875916]
> [] worker_thread+0x126/0x410 [  927.888203]
> [] ? rescuer_thread+0x460/0x460 [  927.901067]
> [] kthread+0xcf/0xe0 [  927.911823]  []
> ?kthread_create_on_node+0x140/0x140
> [  927.926224]  [] ret_from_fork+0x58/0x90 [  927.938132]
> [] ?kthread_create_on_node+0x140/0x140
> 
> We crash in qla2xxx_eh_abort trying to get the vha:
> 
>   scsi_qla_host_t *vha = shost_priv(cmd->device->host);
> 
> Closer examination of the crash shows that the value of "cmd" is 2, certainly
> not a valid pointer.
> 
> What's happening here?
> 
> The check at the top of the if-block GET_CMD_SP(sp) implicitly != NULL
> would have prevented this sort of thing. However, since sp->u.scmd.cmd is
> not *quite* null (in my crashes it's usually 2), we fall into the if-block 
> and call
> qla2xxx_eh_abort - and crash trying to get cmd.
> 
> Note that the GET_CMD_SP(sp) is doing this:
> 
> #define GET_CMD_SP(sp) (sp->u.scmd.cmd)
> 
> and is acting upon a union:
> 
> union {
> struct srb_iocb iocb_cmd;
> struct fc_bsg_job *bsg_job;
> struct srb_cmd scmd;
>   }
> 
> The address it's getting is the first element in this structure:
> 
> struct srb_cmd {
>   struct scsi_cmnd *cmd;  /* Linux SCSI command pkt */
> 
> }
> 
> However, since this is a union, the same memory can be used this way
> instead:
> 
> struct srb_iocb {
>   union {
>   struct {
>   uint16_t flags;
>   uint16_t data[2];
>   u32 iop[2];
>   } logio;
>   
> 
> Turns out, in the kernel panics I have gotten, the iocb type is logio 
> (verified
> by srb->type = SRB_LOGIN_CMD).
> 
> To further check, I looked at the logio iocb in the crash:
> 
> logio = {
>   flags = 0x2,
>   data = {0x0, 0x0}
> 
> 
> which follows since:
> 
>lio->u.logio.flags |= SRB_LOGIN_COND_PLOGI;
> 
> and
> 
> #define SRB_LOGIN_COND_PLOGI  BIT_1
> 
> In order to eliminate this crash, this patch adds an extra check to the top of
> the if statement to check whether or not sp->type == SRB_SCSI_CMD.
> If not, then don't bother doing the rest of the if-block. It doesn't look 
> like we
> should be invoking qla2xxx_eh_abort for anything other than srb_cmds
> anyways.
> 
> I thought about changing the definition of GET_CMD_SP to include this type
> check and return NULL if sp is not type SRB_SCSI_CMD - like this:
> 
> #define GET_CMD_SP(sp) ((sp->type == SRB_SCSI_CMD) ? sp->u.scmd.cmd
> : NULL)
> 
> I decided against it as there are multiple places in the code that do not 
> check
> for NULL. If you're calling GET_CMD_SP you should be dealing with an
> SRB_SCSI_CMDbut we aren't in this case. So, for this patch I went with the
> more contained and safer change.
> 
> This problem is hard to hit, but I have run into it doing negative testing 
> many
> times now (with a test specifically designed to bring it out), so I can verify
> that this fix works. My testing has been against a RHEL7 driver variant, but
> the bug and patch are equally relevant to to the upstream driver.
> 
> Fixes: 1535aa75a3d8 ("scsi: qla2xxx: fix invalid DMA access after command
> aborts in PCI device remove")