[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-28 Thread Gavin Heavyside
I've just reproduced this crash using the stock 3.2.0-24-39 kernel on VirtualBox on OS X (Lion). I created a 2-CPU VM using the latest VirtualBox (4.1.16 r78094), for Ubuntu 64-bit, default 8GB disk. The steps I followed were: * Install 64-bit 12.04 Server LTS, minimal install from ISO downloaded

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-25 Thread Stefan Bader
It hopefully prints out the hex content of the data structure I want to check. But that should all be in the crash console output. Apart from that just the xen version is still the same 3.4.3-2.6.18 (preserve-AD) as before. -- You received this bug notification because you are a member of Ubuntu

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-25 Thread Karl Matthias
We have the debugging kernel installed on a box that has been happily crashing for the last few days. We'll see if we can get a good debug for you from it. What would you like us to gather the next time it crashes? -- You received this bug notification because you are a member of Ubuntu Bugs, w

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-24 Thread Gavin Heavyside
We've also seen this on the -24.38 and -24.39 kernels now: [56843.390534] BUG: unable to handle kernel NULL pointer dereference at 0010 [56843.390551] IP: [] rb_next+0x1/0x50 [56843.390566] PGD 1d20a7067 PUD 1d29a2067 PMD 0 [56843.390575] Oops: [#1] SMP [56843.390583] CPU 1 [5

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-24 Thread Karl Matthias
Excellent, thanks for that. I'll drop it on a box tomorrow (UK standard time). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/999755 Title: Kernel crash on EC2 m1.large instances To manage notifica

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-24 Thread Stefan Bader
A first attempt to get more information can be found at http://people.canonical.com/~smb/lp999755/ (I took the older -24.37 version as that at least is confirmed to be broken). Up to now I was not able to cause a similar crash (or in fact any crash) locally. But then I am not sure I how to get the

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-23 Thread Karl Matthias
Stefan, if you have a kernel build with memory dumps to help debug this, I'm happy to try it. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/999755 Title: Kernel crash on EC2 m1.large instances To m

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-22 Thread Stefan Bader
So from the disassembly and the registers of the crash, it is clear that both variations at some point do a schedule which calls on to pick_next_task_fair(). As that calls into pick_next_entity() it can be assumed that (struct cfs_rq *)->nr_running is not 0. But then (struct cfs_rq *)->rb_leftmo

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-22 Thread Karl Matthias
Thanks, we're running tests on the 3.2.0-24.38 kernel now to see if we can get it to crash in the same way. Yes, this affected both the previous versions: -24.37 and -23.36. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-21 Thread Stefan Bader
Writing that I noticed that the traces differ based on different kernels. One set has -24.37 and the other -23.36. So that may just cause a slightly different timing or location of certain things... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-21 Thread Stefan Bader
Also I see that a new kernel has just been pushed out. It cannot hurt doing tests running that (3.2.0-24.38). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/999755 Title: Kernel crash on EC2 m1.large

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-21 Thread Stefan Bader
The Xen version is mostly to check for correlations. So far both times those have been reported it seemed to be a 3.4.3 version (not sure whether one can still hit other versions, but if then it would be great to know whether those are affected the same way). I know its not possible to know what on

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-20 Thread Gavin Heavyside
Triggered this again by running ohai in a continuous loop, took about 24 hours to occur: [18438803.627371] BUG: unable to handle kernel NULL pointer dereference at 0010 [18438803.627388] IP: [] rb_next+0x1/0x50 [18438803.627402] PGD 1d0efa067 PUD 1d232d067 PMD 0 [18438803.627411] Oo

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-19 Thread Gavin Heavyside
BTW Xen version from dmesg is: Xen version: 3.4.3-2.6.18 (preserve-AD) This is on EC2 so we have no control over this. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/999755 Title: Kernel crash on

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-19 Thread Gavin Heavyside
I've reproduced this by running the OHAI command from the OpsCode Chef ohai gem (0.6.12) in a loop, although it took nearly 2 days before it triggered. Basically I ran `gem install ohai; while true; do ohai; done` in a screen session. The stack trace is: [18362917.357055] BUG: unable to handle k

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-17 Thread Karl Matthias
Forgot to post this earlier. Re: Brad's request for crash logs. apport-cli says "No pending crash reports. Try --help for more information." ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-17 Thread Karl Matthias
Here's another one if it helps: [848423.023156] BUG: unable to handle kernel NULL pointer dereference at 0010 [848423.023180] IP: [] rb_next+0x1/0x50 [848423.023194] PGD 18ad83067 PUD 18ad82067 PMD 0 [848423.023203] Oops: [#1] SMP [848423.023210] CPU 1 [848423.023213] Modules

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-17 Thread Karl Matthias
Hi Stefan, OK. Here's another stack trace: [47708.053788] BUG: unable to handle kernel NULL pointer dereference at 0010 [47708.053810] IP: [] rb_next+0x1/0x50 [47708.053824] PGD 1d0b83067 PUD 1d0e64067 PMD 0 [47708.053833] Oops: [#1] SMP [47708.053840] CPU 1 [47708.053843] Mo

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-16 Thread Stefan Bader
The crash itself seems to point at a broken task structures tree while scheduling during a pipe read. Though this is rather cause not reason. Right now, nothing obvious strikes. There were a few Xen related patches waiting to come in through stable but I am not sure those would help here. Meanwh

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-15 Thread Joseph Salisbury
** Changed in: linux (Ubuntu) Importance: Undecided => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/999755 Title: Kernel crash on EC2 m1.large instances To manage notifications about thi

[Bug 999755] Re: Kernel crash on EC2 m1.large instances

2012-05-15 Thread Karl Matthias
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/999755 Title: Kernel crash on EC2 m1.large instances To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/