[Xen-devel] GPF Heisenbug with rumprun-xen

2015-02-05 Thread Ian Jackson
Ian Campbell writes (Re: [Xen-devel] [xen-4.5-testing test] 34157: regressions 
- FAIL):
http://www.chiark.greenend.org.uk/~xensrcts/logs/34157/ 
...
 test-amd64-amd64-rumpuserxen-amd64 11 
rumpuserxen-demo-xenstorels/xenstorels fail REGR. vs. 34088

Guest console contains:

   device/vbd/768/protocol = x86_64-abi   (n2,r0)
   device/vbd/832:
   rumpxenstack:
   could not access permissions for 832: Bad file descriptor

   rumpxenstack: xs_directory (device/vbd/832): Bad file descriptor

   === ERROR: _exit(1) called ===

 It looks to be some sort of Heisenbug in the rump kernel stuff.

I agree.  We had a failure on the 16th of January which looked like
some kind of race:
 (Subject: Re: [Xen-devel] [rumpuserxen test] 33416: regressions - FAIL)


 This 
 http://www.chiark.greenend.org.uk/~xensrcts/results/history.test-amd64-amd64-rumpuserxen-amd64.html
 show a history of random failures at the xenstorels step.

 At least the ones as far back as 33830 (the last one with logs still
 available) all show signs of what looks like memory corruption of some
 sort.

Thanks for the digging.  (I have left the quoted text in for the
benefit of rumpkernel-users.)

The first failure in that history that looks like part of this is
flight 33690.  We don't have logs for that any more but it used
  rumpuserxen 598ceb54916b
  xen 49de0b57b853
  netbsdsrc   17a547ca2943
Failure probability after then seems about 20%.  If I go back 10
passes from 33690 I get to 33611 which used
  rumpuserxen ffcd777f8062
  xen 0d2879062076
  netbsdsrc   a7c6b12e1752

It seems unlikely that the difference is going to be due to changes
in the versions of linux, linuxfirmware, ovmf, qemu[u] or seabios.
buildrump.sh has been 47b1a5eef43c throughout.

Ian.


 http://www.chiark.greenend.org.uk/~xensrcts/logs/33830/test-amd64-amd64-rumpuserxen-amd64/11.ts-rumpuserxen-demo-xenstorels.log
 GPF rip: 0x46ea1c, error_code=0
 Page fault at linear address 0x0, rip 0x13563, regs 0x469ff8, sp 
 0x46a0a8, our_sp 0x469fe0, code 0
 Page fault in pagetable walk (access to invalid memory?).
 
 http://www.chiark.greenend.org.uk/~xensrcts/logs/33846/test-amd64-amd64-rumpuserxen-amd64/11.ts-rumpuserxen-demo-xenstorels.log
 $VAR1 = {
   'theirs' = 'STUB ``__sigaction14\'\' called
 device/G:
 rumpxenstack: 
 could not access permissions for G: Invalid argument
 
 rumpxenstack: xs_directory (device/G): Invalid argument
 
 http://www.chiark.greenend.org.uk/~xensrcts/logs/33925/test-amd64-amd64-rumpuserxen-amd64/11.ts-rumpuserxen-demo-xenstorels.log
 Similar to 33846
 
 
 http://www.chiark.greenend.org.uk/~xensrcts/logs/34086/test-amd64-amd64-rumpuserxen-amd64/11.ts-rumpuserxen-demo-xenstorels.log
 rumpxenstack: 
 could not access permissions for mac: Invalid argument
 device/vif/0/mac = 5a:36:0e:26:00:05 
 rumpxenstack: xs_directory (device/vif/0/mac): Bad file descriptor
 
 http://www.chiark.greenend.org.uk/~xensrcts/logs/34127/test-amd64-amd64-rumpuserxen-amd64/11.ts-rumpuserxen-demo-xenstorels.log
 Page fault at linear address 0x256, rip 0x33e1c, regs 0x53f458, sp 
 0x53f500, our_sp 0x53f440, code 0
 Thread: main
 RIP: e030:[00033e1c] 
 RSP: e02b:0053f500  EFLAGS: 00010202
 RAX: 0025a927 RBX: 0246 RCX: 0053fe88
 RDX:  RSI:  RDI: 0025a8b0
 RBP: 001d55df R08: 00454091 R09: 
 R10:  R11:  R12: 0053fe88
 R13:  R14: 0025a8b0 R15: 
 base is 0x1d55df caller is 0x6e69616d20676e69
 base is 0x6c6c6163203d3d3d GPF rip: 0x13e00, error_code=0
 
 http://www.chiark.greenend.org.uk/~xensrcts/logs/34157/test-amd64-amd64-rumpuserxen-amd64/11.ts-rumpuserxen-demo-xenstorels.log
 Another bad fd, on device/vbd but otherwise similar to 34086.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] GPF Heisenbug with rumprun-xen

2015-02-05 Thread Antti Kantee

On 05/02/15 15:51, Ian Jackson wrote:

It looks to be some sort of Heisenbug in the rump kernel stuff.


I agree.  We had a failure on the 16th of January which looked like
some kind of race:
  (Subject: Re: [Xen-devel] [rumpuserxen test] 33416: regressions - FAIL)


Aha!  I told you I don't believe in cosmic rays ;)




This
http://www.chiark.greenend.org.uk/~xensrcts/results/history.test-amd64-amd64-rumpuserxen-amd64.html
show a history of random failures at the xenstorels step.

At least the ones as far back as 33830 (the last one with logs still
available) all show signs of what looks like memory corruption of some
sort.


Thanks for the digging.  (I have left the quoted text in for the
benefit of rumpkernel-users.)

The first failure in that history that looks like part of this is
flight 33690.  We don't have logs for that any more but it used
   rumpuserxen 598ceb54916b
   xen 49de0b57b853
   netbsdsrc   17a547ca2943
Failure probability after then seems about 20%.  If I go back 10
passes from 33690 I get to 33611 which used
   rumpuserxen ffcd777f8062
   xen 0d2879062076
   netbsdsrc   a7c6b12e1752

It seems unlikely that the difference is g
oing to be due to changes
in the versions of linux, linuxfirmware, ovmf, qemu[u] or seabios.
buildrump.sh has been 47b1a5eef43c throughout.


The diffs for rumpuserxen and netbsdsrc between those revisions are 
luckily small.  I couldn't spot anything in there which would 
immediately look suspicious.  The most suspicious change is calling 
sched_yield() as part of the bootstrap process, but that's not very 
dramatic as far as suspicious goes.  TLS support was added, but I'm not 
sure how that would affect threads which do not use TLS.  That said, TLS 
did work right off the bat, so it is a bit suspicious ...


Is it possible that some change in xen is tickling the bug?  That would 
explain why attempts to reproduce the bug in other setups have failed. 
Is it easy to fire off runs with arbitrary revisions of each repo?


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel