David Gwynne wrote:
what is the bug you're able to reproduce?

I posted it on misc@ and reply on an email with the same hardware problem on tech@ and open a but report as well on it.

But the short story of it is that using amd64.mp kernel on Sun X4100 M2 I can crash the box at will by writing to the SAS drive at high speed and that's ONLY when you write to it at high speed.

The box will crash and reboot, no debug output or anything, just crash/reboot and I can reproduce that at will.

Again, that ONLY happen with amd64.mp, the single processor kernel do not suffer this problem and the i386 single or mp kernel do not have it either.

A very simple way to crash the box right away, not matter what is to do something as simple as what I put in the bug report and do:

dd if=/dev/zero of=/var/test bs=1m count=1000

and just watch it crash in less then a second.

Now if you disable the USB virtual SCSI CDRom in bios, you may get may be 3 seconds, but I haven't track donw why that is yet, just that there have to be something common between the USB code and the driver for the SAS drive in the AMD64.MP kernel and I am trying to isolate that as much as possible.

Right now, I am actually trying to isolate the exact transfer speed that will crash the box.

The way I do this, is very simple.

scp -l xxx /tmp/test [EMAIL PROTECTED]:/var/test

For the test I use a 10GB file just to be sure I don't run out of data.

And depending how fast I do the -l limit in scp, I will crash it right away, meaning less then 5 seconds, or it will just go to the end of the transfer without a problem.

I do this at will.

Now what exact speed, or interrupts level that will happen, I am trying to isolate this.

I can only say for sure that it has to be in the driver for the SAS drive and related to the mp kernel, so that's why I asked what differences it could exists, so that I can look in the code and try to isolate this for good.

It really annoy me so bad that I want to find it. Will I be able, not sure as the kernel is over my head, but I am giving it a shut anyway.

Right now I am continuing to narrow it down as much as I can.

I am much closer then I was two months ago but the progress are slow, however getting much closer now I think.

I would just love a little bit more ideas to try may be and if that's possible in anyway, where it might be in the code, but I may not be at a point where I pin point it so close to be sure what section to look in more details yet.

I guess it is a battle between the server and me and hell I don't want the server to win, yet anyway.

I just give up three months ago at using AMD on these Sun servers as they are simply not reliable one bit under load, but very good when i386 is use, however, it still bug me so bad that I sure hell would love to find the bug as I am digging into this for so long and did so many difference tests and kernel compile, etc that not getting the final word would make me pretty mad in the end. I just have to realize my limitation in the understanding of the kernel code at this point. It's not like it's only a few thousands lines of code for sure.

But I am not welling to give up yet!

Best,

Daniel

Reply via email to