The problem that I see is related to a STREAMS driver calling a kernel function that, in turn, calls schedule(). The LiS queue scheduler is holding a spin lock on the queue when it calls the service procedure, and the call to schedule() can switch to some other process that also wants to use that queue, and which then proceeds to spin on the lock. If there are more such contenders than CPUs you can end up with all CPUs spinning on the lock and the thread that would release the lock sitting in the schedule queue.
I am in the process of fixing that one by using a semaphore rather than a spin lock to single thread entries to service procedures.
In your case it sounds like a little bit of KGDB would go a long way towards figuring it out.
-- Dave
At 08:19 PM 2/12/2004, Matthew Gierlach wrote:
Hello Dave:
This suggested SMP patch does not appear to provide help to
our SMP problem. After we got thorugh the new major/minor
clone driver issues we started the driver running and we
see the: Qhead / Qtail assertion messages. Shortly thereafter
the system (RH EL 3.0) PANICs. It does not write the PANIC info
into /var/log/messages and the screen can not be back scrolled
to see the chain of events that preciptate the PANIC.
We're running on a SUN branded IA P4 architecture (SunFire v60x) with
hyperthreading enabled. This means Linux will see 4 CPUs, and top
displays 4 CPUs.
When we disable hyperthreading and Linux only sees 2 CPUs, the driver
and LiS run without incident.
Is there any tracing or debugging I can provide to help further
diagnose the root cause?
Thanks, Matt
On Tue, 10 Feb 2004, Dave Grothe wrote:
> I have been testing on a 4 CPU IBM x335 running Red Hat 9. I don't have a
> copy of RH EL to test with.
>
> I found a problem having to do with assignment of queue runners to
> CPUs. The following patch takes care of that problem. You might try it to
> see if it helps.
>
> The message that you saw was essentially an assertion failure. I am not
> sure that there is any way for that condition to occur unless RH EL has
> busted spin lock code. LiS might recover from the assertion failure better
> by returning from the function instead of proceeding.
>
> -- Dave
>
> Version diff for linux-mdep.c, version 2.123
> --- /tmp/sccsdiff.25102/linux-mdep.c 2004-02-10 09:41:25.000000000 -0600
> +++ /rsys/linux/LiS-2.17/head/linux-mdep.c 2004-02-09
> 16:24:55.000000000 -0600
> @@ -402,11 +402,11 @@
>
>
> #if defined(CONFIG_DEV)
> -#define FLF , const char *file, int line, const char *fn
> -#define FLFV FLF
> +#define FLFV const char *file, int line, const char *fn
> +#define FLF , FLFV
> #else
> -#define FLF /* nothing */
> #define FLFV void
> +#define FLF /* nothing */
> #endif
>
> static
> @@ -3847,15 +3847,7 @@
> current->policy = SCHED_FIFO ; /* real-time: run when ready */
> current->rt_priority = 50 ; /* middle value real-time
> priority */
> sigaddset(&MY_BLKS, SIGTERM) ; /* inhibit SIGTERM */
> -#if defined(KERNEL_2_5)
> -# if defined(__SMP__)
> set_cpus_allowed(current, 1 << cpu_id) ;
> -# else
> - /* of course this symbol is not defined unless the kernel was built
> w/SMP */
> -# endif
> -#elif !defined(_PPC_LIS_)
> - current->cpus_allowed = (1 << cpu_id) ; /* bind to a CPU */
> -#endif
>
> #if defined(KERNEL_2_5)
> yield() ; /* reschedule our thread */
> @@ -3900,8 +3892,9 @@
> static int msg_cnt ;
>
> if (++msg_cnt < 5)
> - printk("%s woke up running on CPU %d\n",
> - current->comm, smp_processor_id()) ;
> + printk("%s woke up running on CPU %d -- cpu_id=%d mask=0x%x\n",
> + current->comm, smp_processor_id(), cpu_id,
> + current->cpus_allowed) ;
> }
> /*
> * If there are characters queued up in need of printing, print
> them if
>
> At 07:41 PM 2/9/2004, Matthew Gierlach wrote:
>
> >Hi Dave:
> >
> > After the
> >
> > LiS:qenable before Qhead error:lis_qhead=c941bb80 lis_qtail=0.
> >
> > there is a kernel panic (yep, panic. RH EL PANICs instead of
> > Oopsing).
> >
> > Matt
> >
> >On Mon, 9 Feb 2004, Matthew Gierlach wrote:
> >
> > > Hi Dave:
> > >
> > > We're performing some testing of RedHat Enterprise Linux AS 3.0
> > > and LiS is failing. We're testing on SUN repackaged Intel Hardware
> > > (SunFire v60x) that appears to Linux as four CPUs: two chips with
> > > two Xeon processors inside each chip.
> > >
> > > The LiS symptom is:
> > >
> > > LiS:qenable before Qhead error:lis_qhead=c941bb80 lis_qtail=0.
> > >
> > > This occurs when all 4 CPUs are enabled and does not occur when
> > > only two CPUs are enabled. When hyperthreading in the BIOS is
> > diabled,
> > > this message is not issued by LiS. Also "noapic" has been set in the
> > > vmlinuz image.
> > >
> > > We see the same messages with both RH EL WS 3.0 and RH EL AS 3.0 with
> > > 4 CPUs enabled. We thought that compiling LiS on WS did not work
> > because
> > > WS only supports 2 CPUs and was not providing LiS suppport to handle
> > > the 3rd and 4th CPUs gracefully. So we compiled LiS on AS (which
> > > supports up to 16 CPUs) expecting LiS to inherit the >2 SMP support
> > > from the AS, and that does not appear to have occured.
> > >
> > > Should LiS be compatible with > 2 CPU SMP environments?
> > >
> > > Thanks, Matt Gierlach
> > >
> > >
> > > WS Enterprise 3.0 SMP Kernel with Hyperthreading Enabled in BIOS;
> > >
> > > the system (SunFire v60x) will panic with a
> > >
> > > LiS:qenable before Qhead error:lis_qhead=c941bb80 lis_qtail=0.
> > >
> > > This happens on both WS and AS versions of RedHat Linux Enterprise 3.0.
> > >
> > >
> > >
> > >
> > >
> >
> >
> >---
> >Incoming mail is certified Virus Free.
> >Checked by AVG anti-virus system (http://www.grisoft.com).
> >Version: 6.0.577 / Virus Database: 366 - Release Date: 2/3/2004
>
>
---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.587 / Virus Database: 371 - Release Date: 2/12/2004
--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.587 / Virus Database: 371 - Release Date: 2/12/2004
