Date: Wed, 6 Sep 2000 12:31:55 +1200
   From: Chris Wedgwood <[EMAIL PROTECTED]>

   Face it Dave -- you are just smarter than many of the rest of us.

I would actually assert that I am not, and that I know the things I do
for reasons other than "talent", and I think the best way to describe
it is hard work.

   You might not need certain debugging aids, some people _right now_
   do (at at the very least will benefit from them).

True.

   Maybe debugging aids should be excluded from the kernel for various
   reasons, I'm not commenting on that, but expecting the rest of the
   world to get smarter all of a sudden isn't very realistic.

I'm not expecting them to get smarter, I am expecting them to put in a
little bit of hard work and become more familiar with how the kernel
works.  I want people to do a little bit of this, instead of stepping
over a few instructions and examine a few variables from a debugger.

That process does not increase familiarity, it is the study of
behaviorology and _nothing_ more.  You don't come from the debugger
experience having learned how something works, however debugging using
your brain does have this effect.

   Perhaps you would like to describe how you do debug the kernel? I
   ask this because I use printf more often than anything else when
   debugging userland code and I often use printk when debugging the
   kernel.

Sure, no problem, I'll describe how it usually goes.

%85 of cases requiring probing the kernel for info go something like:

1) Some evidence of kernel being in an incorrect state is made visible
   to me.  Either this comes from an OOPS/etc. dump in an email, or
   someone prescribes a reproducable test case with which I can
   capture the dump on one of my systems.

2) Once dump is decoded, I determine what is most immediately wrong in
   the kernel at the moment the dump triggered.  Ie. I decide that
   some part of some data structure is corrupt, or that some page
   cache page had ended up being used for an inode structure, etc.

3) Once I know the nature of the immedately incorrect state, I sit and
   think about how it would be possible arrive at that state.  I try
   to walk backwards from the point of corruption back to where it may
   have originated.

4) Once I've got a decent idea of the ways such a corruption could
   possibly happen, I begin studying the kernel looking for places
   where the necessary set of conditions might be allowed.  I verify
   these specific (and usually surrounding parts of) code for
   correctness.

5) If I am truly stumped I strategically place debugging checks from
   the point of corruption and gradually "upwards" in the event
   chains.

   Because I have taken the thinking time required in step #4 I
   will not spend much time rebooting over and over, 2 or 3 reboots
   and debugging check changes should be sufficient to capture the
   information I need to pinpoint the source of the problem.

   Actually, usually the phases are "run step 5, learn something from
   what is printed, iterate redoing step 4+5 until problem is spotted"

And step 6 is reached when the true root cause of the bug is
discovered :-)

The other %15 entails situations where I code up specialized debugging
code because capturing the specific set of conditions is nontrivial
(for example, userspace seeing stale corrupt TLB translations, I have
written code for sparc64 which validates the TLB to find such bugs).

   Only, with the former, I get to restart the application everytime it
   croaks, with the latter (modules excluded) I have to reboot. This is
   much more time consuming and means you really have to be much smarter
   about what checks and printk statements you put in where... the hope
   is with more intelligent debugging aids I can glean more information
   for each reboot.

While you're rebooting you can come up with a game plan for steps 4
and 5 above, this is what I do.  fsck time is "time to think".

Ever have a situation where you were totally stumped on something, you
go out and get a hamburger or something, and halfway through eating
that juicy thing you're working out in your head the problem you're
working on, and the solution just comes to you?  This is the kind of
process the above steps are meant to encourage.  Discovery is IMHO
one of the most fantastic parts of the human experiance.

Now keep in mind, long ago I spent some inordinate amounts of time
sprinkling printk's all over the place.  And you can do this arbitrary
poking to find bugs regardless of how much you know about the code,
but it takes a long time.  The goal is to learn how to sit, think, and
study.  Doing this instead of running immediately to the editor and
trying to find an arbitrary new spot to stick a printk.

Here's a suggestion, buy a pair of chinese balls, and upon reboot put
some relevant source files into an editor on your screen and use your
hands to play with the chinese balls.  This is meant to fight the urge
to just start typing, and to instead look at the code and think about
what it is doing.

After each such debugging session your brain will contain new
assosciations and understandings.  It's like a free kernel debugging
database that you control the power of.  Over time more bugs become of
the form "oh duh, yeah putting debugging at x or y will show me
exactly who is to blame for this".

Moreso, at the end of such a debugging session you will very
likely have discovered that the problem is fundamental one and
can be found in other places.  Often, for the most difficult of
bugs to find, their discovery leads to the solving of _many_ bugs at
once.  This is because such bugs usually unearth an understanding
which was not apparent to anyone beforehand.

It is doing this for more than 5 years which lets me almost
immediately know where a bug is when looking at many dumps.  Very
often, I can skip step 5, and immediately work on implementing and
testing a fix.

I can also guarentee you, that if you put forth the effort necessary
to debug this way, you will be almost immune to causing the same types
of bugs yourself in your own kernel code from that point forward.
While coding your brain will begin to say things to you like "oh crap,
remember that nasty XXX problem we debugged the other month, we better
watch out for that in this new stuff".

One final note.  Much like for aviro, it disturbs me to see someone
point out a problem in it's immediate form, try to fix it or suggest a
fix locally, and never consider things on a larger scale.  Let me give
a concrete example, remember all the discussions about the scheduler
run queue not scaling?  The initial reactions were to make changes to
the scheduler, but _NOT ONE_ of these people said to themselves "hey
wait a second, why are so many tasks in run state at one time
anyways?".  The discovery of the answer to that question showed where
the real problem was, we were doing spurious wakeups in the TCP socket
code, and once that was fixed the "run queue scalability" issue went
away for the benchmarks the original reporter was using.

Now imagine if we had just taken the report at face value and said
"yeah, fix the complexity of run queue handling in the scheduler" and
ignored the real cause of the performance problem.  I don't want this
to start happening, and automated debugging/profiling tools tend to
encourage people to operate in such a way.  I need to take pain
killers when I see it :-)

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Reply via email to