On 2017-05-17 17:01:53 [+0200], To Chad Dupuis wrote:
> On 2017-05-12 11:55:52 [-0400], Chad Dupuis wrote:
> > Ok, I believe I've found the issue here.  The machine that the test has 
> > performed on had many more possible CPUs than active CPUs.  We calculate 
> > which CPU to the work time on in bnx2fc_process_new_cqes() like this:
> > 
> > unsigned int cpu = wqe % num_possible_cpus();
> > 
> > Since not all CPUs are active, we were trying to schedule work on 
> > non-active CPUs which meant that the upper layers were never notified of 
> > the completion.  With this change:
> > 
> > diff --git a/drivers/scsi/bnx2fc/bnx2fc_hwi.c 
> > b/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > index c2288d6..6f08e43 100644
> > --- a/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > +++ b/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > @@ -1042,7 +1042,12 @@ static int bnx2fc_process_new_cqes(struct 
> > bnx2fc_rport *tgt)
> >                         /* Pending work request completion */
> >                         struct bnx2fc_work *work = NULL;
> >                         struct bnx2fc_percpu_s *fps = NULL;
> > -                       unsigned int cpu = wqe % num_possible_cpus();
> > +                       unsigned int cpu = wqe % num_active_cpus();
> > +
> > +                       /* Sanity check cpu to make sure it's online */
> > +                       if (!cpu_active(cpu))
> > +                               /* Default to CPU 0 */
> > +                               cpu = 0;
> >  
> >                         work = bnx2fc_alloc_work(tgt, wqe);
> >                         if (work) {
> > 
> > The issue is fixed.
> > 
> > Sebastian, can you add this change to your patch set?
> 
> Are sure that you can reliably reproduce the issue and fix it with the
> patch above? Because this patch:

oh. Okay. Now it clicked. It can fix the issue but it is still possible,
that CPU0 goes down between your check for it and schedule_work_on()
returning. Let my think of something…

Sebastian

Reply via email to