Ted Unangst wrote:
On 8/11/08, Daniel Ouellet <[EMAIL PROTECTED]> wrote:
 Any idea on how it might be possible to boot the system step by step to get
an idea of where this bug might be isolated?

The real bug is looking at load average and pretending it means anything.

Also, Ted, here is an other idea. May be totally stupid, so just give me a chance to explain it if I may. (;>

And if I don't get it right, fell free to correct me so that I understand it better, it would be welcome.


So, looking at the scheduler, it does put a lock in the kernel, something like SCHED_LOCK(s), then does things, yes, I don't fully understand it. However, in that same lock stage, it also update and calculate the load average there and then later on does SCHED_UNLOCK(s);

That's inside schedcpu() in sched_bsd.c.

Now if all are equal, if calculation on things inside the lock process here is wrong, isn't also fair to assume it's possible that also some other things be wrong during that time and have other impact?

In short, I am not saying this is a huge deal for the load average, however, because of where it is done and other things are also done during the same lock process there, isn't fair to say that it's possible ( I am not saying there is more ) other strange things happen in that part as well from a lock/unlock in kernel at start on some system cause some weird problem. So, finding this, might as well correct other issues too?

That's one way to think about it I guess and unless I am totally wrong, with I grant you I sure can be, then having this error there, not being by itself a huge deal, may well correct some other issue that are not obvious either by something in that part of the code not being right?

In the case of FreeBSD, when this bug (let say for discussion sake, very similar issue) was fix:

http://www.freebsd.org/cgi/query-pr.cgi?pr=65857

Also in the kernel only affected the load average calculation there and yes the kernel are way different to mean anything I guess.

In the present case however, it's done in a lock/unlock stage, witch if in that stage lead to corruption of data, it's also fair to assume that there is a possibility that some other data may be corrupted as well in that same process?

Yes, this is totally speculation on my part, but looking at the code, I however think it's possible, or a possibility never the less.

Am I totally wrong to think that?

I respect your knowledge of the system WAY more then my own for sure, not even a question there, but logically, isn't what I suggest possible may be?

Something to think about considering specially where the calculation of the load average is done more then the meaning of it in this case.

Isn't this have a valid base to be consider?

Regards,

Daniel

Reply via email to