Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-06-06 Thread M A Young
On Mon, 1 Jun 2015, M A Young wrote:

> On Mon, 1 Jun 2015, Jan Beulich wrote:
> 
> > >>> On 31.05.15 at 00:43,  wrote:
> > > On 30/05/2015 23:07, M A Young wrote:
> > >> On Fri, 29 May 2015, Andrew Cooper wrote:
> > >>> FC22 is miscompiling the C to:
> > >>>
> > >>> struct page_info *page = mfn_to_page(mfn);
> > >>> struct domain *owner = page_get_owner_and_reference(page);
> > >>> if ( owner )
> > >>> put_page(mfn_to_page(0));
> > >>>
> > >>> which is wrong, and why free_domheap_pages() does legitimately complain
> > >>> about the wonky refcount.
> > >> With a bit of experimentation I have found that compiling with the 
> > >> -fno-caller-saves flag gets this code segment back to the Fedora 21 
> > >> version, thus avoiding the bug.
> > > 
> > > After sending this email, I wondered whether the optimiser as assuming
> > > that %rdi was preserved.  Indeed, it turns out that the generated code
> > > for page_get_owner_and_reference leaves %rdi unmodified, and safe for
> > > reuse after return.
> > > 
> > > If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
> > > still contains the correct result of the original calculation.
> > 
> > And %r8 is known to be preserved too?
> > 
> > > Therefore, I suspect that the bug is in the -fcaller-saves optimisation
> > > code.
> > 
> > I suppose together with us allowing it to do such for global functions
> > by marking everything hidden (i.e. something possibly not seeing much
> > testing).
> > 
> > Questions now are:
> > 1) Was a bug against gcc opened already?
> > 2) What do we do about it? Working around the issue by setting
> > -fno-caller-saves seems awkward, as we'd likely have nothing but
> > the gcc version to tie this to. And considering distros carry their
> > own patch sets, the version alone may not even be enough. (I
> > didn't see any reports against our tip facing a similar issue despite
> > it being built with gcc 5 now too afaik.)
> 
> There is a Fedora bug on this
> https://bugzilla.redhat.com/show_bug.cgi?id=1219197
> which I updated and reassigned to gcc yesterday.

The Fedora gcc maintainer has now filed an upstream bug which is 
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=66444

Michael Young

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-06-01 Thread M A Young
On Mon, 1 Jun 2015, Jan Beulich wrote:

> >>> On 31.05.15 at 00:43,  wrote:
> > On 30/05/2015 23:07, M A Young wrote:
> >> On Fri, 29 May 2015, Andrew Cooper wrote:
> >>> FC22 is miscompiling the C to:
> >>>
> >>> struct page_info *page = mfn_to_page(mfn);
> >>> struct domain *owner = page_get_owner_and_reference(page);
> >>> if ( owner )
> >>> put_page(mfn_to_page(0));
> >>>
> >>> which is wrong, and why free_domheap_pages() does legitimately complain
> >>> about the wonky refcount.
> >> With a bit of experimentation I have found that compiling with the 
> >> -fno-caller-saves flag gets this code segment back to the Fedora 21 
> >> version, thus avoiding the bug.
> > 
> > After sending this email, I wondered whether the optimiser as assuming
> > that %rdi was preserved.  Indeed, it turns out that the generated code
> > for page_get_owner_and_reference leaves %rdi unmodified, and safe for
> > reuse after return.
> > 
> > If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
> > still contains the correct result of the original calculation.
> 
> And %r8 is known to be preserved too?
> 
> > Therefore, I suspect that the bug is in the -fcaller-saves optimisation
> > code.
> 
> I suppose together with us allowing it to do such for global functions
> by marking everything hidden (i.e. something possibly not seeing much
> testing).
> 
> Questions now are:
> 1) Was a bug against gcc opened already?
> 2) What do we do about it? Working around the issue by setting
> -fno-caller-saves seems awkward, as we'd likely have nothing but
> the gcc version to tie this to. And considering distros carry their
> own patch sets, the version alone may not even be enough. (I
> didn't see any reports against our tip facing a similar issue despite
> it being built with gcc 5 now too afaik.)

There is a Fedora bug on this
https://bugzilla.redhat.com/show_bug.cgi?id=1219197
which I updated and reassigned to gcc yesterday.

Michael Young

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-06-01 Thread Jan Beulich
>>> On 31.05.15 at 00:43,  wrote:
> On 30/05/2015 23:07, M A Young wrote:
>> On Fri, 29 May 2015, Andrew Cooper wrote:
>>> FC22 is miscompiling the C to:
>>>
>>> struct page_info *page = mfn_to_page(mfn);
>>> struct domain *owner = page_get_owner_and_reference(page);
>>> if ( owner )
>>> put_page(mfn_to_page(0));
>>>
>>> which is wrong, and why free_domheap_pages() does legitimately complain
>>> about the wonky refcount.
>> With a bit of experimentation I have found that compiling with the 
>> -fno-caller-saves flag gets this code segment back to the Fedora 21 
>> version, thus avoiding the bug.
> 
> After sending this email, I wondered whether the optimiser as assuming
> that %rdi was preserved.  Indeed, it turns out that the generated code
> for page_get_owner_and_reference leaves %rdi unmodified, and safe for
> reuse after return.
> 
> If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
> still contains the correct result of the original calculation.

And %r8 is known to be preserved too?

> Therefore, I suspect that the bug is in the -fcaller-saves optimisation
> code.

I suppose together with us allowing it to do such for global functions
by marking everything hidden (i.e. something possibly not seeing much
testing).

Questions now are:
1) Was a bug against gcc opened already?
2) What do we do about it? Working around the issue by setting
-fno-caller-saves seems awkward, as we'd likely have nothing but
the gcc version to tie this to. And considering distros carry their
own patch sets, the version alone may not even be enough. (I
didn't see any reports against our tip facing a similar issue despite
it being built with gcc 5 now too afaik.)

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-30 Thread Andrew Cooper
On 30/05/2015 23:07, M A Young wrote:
> On Fri, 29 May 2015, Andrew Cooper wrote:
>
>> On 29/05/15 12:17, M A Young wrote:
> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> boot for me, but if I replace xen.gz with one from the same code built on 
> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> available via 
> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> if anyone else wants to do some testing.
>
>   Michael Young
 Do you have easy access to xen-syms from each build?
>>> Yes.
>>>
>> Thankyou very much.
>>
>> GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:
>>
>> The C snippet from mmio_ro_do_page_fault():
>>
>> struct page_info *page = mfn_to_page(mfn);
>> struct domain *owner = page_get_owner_and_reference(page);
>> if ( owner )
>> put_page(page);
>>
>> In fc21 is:
>>
>> movabs $0x82e0,%rbp
>> shr%cl,%rax
>> or %rdx,%rax
>> shl$0x5,%rax
>> add%rax,%rbp
>> mov%rbp,%rdi
>> callq  82d080186900 
>> test   %rax,%rax
>> mov%rax,%r12
>> je 82d080189c4e 
>> mov%rbp,%rdi
>> callq  82d080188ec0 
>>
>> and in fc22 is:
>>
>> movabs $0x82e0,%r8
>> shr%cl,%rax
>> or %rdx,%rax
>> shl$0x5,%rax
>> lea(%r8,%rax,1),%rdi
>> callq  82d0801874f0 
>> test   %rax,%rax
>> mov%rax,%rbp
>> je 82d08018ca14 
>> mov%r8,%rdi
>> callq  82d080189a90 
>>
>> "lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
>> mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
>> snippet.
>>
>> In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
>> for call to put_page().
>>
>> However, in FC22, the result of the calculation is only held in %rdi,
>> and clobbered by the call to page_get_owner_and_reference().  When it
>> comes to call put_page(), %r8 is reloaded, which is still a pointer to
>> the base of the frametable, not the page we actually took a reference on.
>>
>> FC22 is miscompiling the C to:
>>
>> struct page_info *page = mfn_to_page(mfn);
>> struct domain *owner = page_get_owner_and_reference(page);
>> if ( owner )
>> put_page(mfn_to_page(0));
>>
>> which is wrong, and why free_domheap_pages() does legitimately complain
>> about the wonky refcount.
> With a bit of experimentation I have found that compiling with the 
> -fno-caller-saves flag gets this code segment back to the Fedora 21 
> version, thus avoiding the bug.

After sending this email, I wondered whether the optimiser as assuming
that %rdi was preserved.  Indeed, it turns out that the generated code
for page_get_owner_and_reference leaves %rdi unmodified, and safe for
reuse after return.

If the 'mov %r8,%rdi' were simply omitted, the code would work, as %rdi
still contains the correct result of the original calculation.

Therefore, I suspect that the bug is in the -fcaller-saves optimisation
code.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-30 Thread M A Young
On Fri, 29 May 2015, Andrew Cooper wrote:

> On 29/05/15 12:17, M A Young wrote:
> >
> >>> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> >>> boot for me, but if I replace xen.gz with one from the same code built on 
> >>> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> >>> available via 
> >>> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> >>> if anyone else wants to do some testing.
> >>>
> >>>   Michael Young
> >> Do you have easy access to xen-syms from each build?
> > Yes.
> >
> 
> Thankyou very much.
> 
> GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:
> 
> The C snippet from mmio_ro_do_page_fault():
> 
> struct page_info *page = mfn_to_page(mfn);
> struct domain *owner = page_get_owner_and_reference(page);
> if ( owner )
> put_page(page);
> 
> In fc21 is:
> 
> movabs $0x82e0,%rbp
> shr%cl,%rax
> or %rdx,%rax
> shl$0x5,%rax
> add%rax,%rbp
> mov%rbp,%rdi
> callq  82d080186900 
> test   %rax,%rax
> mov%rax,%r12
> je 82d080189c4e 
> mov%rbp,%rdi
> callq  82d080188ec0 
> 
> and in fc22 is:
> 
> movabs $0x82e0,%r8
> shr%cl,%rax
> or %rdx,%rax
> shl$0x5,%rax
> lea(%r8,%rax,1),%rdi
> callq  82d0801874f0 
> test   %rax,%rax
> mov%rax,%rbp
> je 82d08018ca14 
> mov%r8,%rdi
> callq  82d080189a90 
> 
> "lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
> mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
> snippet.
> 
> In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
> for call to put_page().
> 
> However, in FC22, the result of the calculation is only held in %rdi,
> and clobbered by the call to page_get_owner_and_reference().  When it
> comes to call put_page(), %r8 is reloaded, which is still a pointer to
> the base of the frametable, not the page we actually took a reference on.
> 
> FC22 is miscompiling the C to:
> 
> struct page_info *page = mfn_to_page(mfn);
> struct domain *owner = page_get_owner_and_reference(page);
> if ( owner )
> put_page(mfn_to_page(0));
> 
> which is wrong, and why free_domheap_pages() does legitimately complain
> about the wonky refcount.

With a bit of experimentation I have found that compiling with the 
-fno-caller-saves flag gets this code segment back to the Fedora 21 
version, thus avoiding the bug.

Michael Young

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-29 Thread Andrew Cooper
On 29/05/15 12:17, M A Young wrote:
>
>>> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
>>> boot for me, but if I replace xen.gz with one from the same code built on 
>>> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
>>> available via 
>>> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
>>> if anyone else wants to do some testing.
>>>
>>> Michael Young
>> Do you have easy access to xen-syms from each build?
> Yes.
>

Thankyou very much.

GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:

The C snippet from mmio_ro_do_page_fault():

struct page_info *page = mfn_to_page(mfn);
struct domain *owner = page_get_owner_and_reference(page);
if ( owner )
put_page(page);

In fc21 is:

movabs $0x82e0,%rbp
shr%cl,%rax
or %rdx,%rax
shl$0x5,%rax
add%rax,%rbp
mov%rbp,%rdi
callq  82d080186900 
test   %rax,%rax
mov%rax,%r12
je 82d080189c4e 
mov%rbp,%rdi
callq  82d080188ec0 

and in fc22 is:

movabs $0x82e0,%r8
shr%cl,%rax
or %rdx,%rax
shl$0x5,%rax
lea(%r8,%rax,1),%rdi
callq  82d0801874f0 
test   %rax,%rax
mov%rax,%rbp
je 82d08018ca14 
mov%r8,%rdi
callq  82d080189a90 

"lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
snippet.

In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
for call to put_page().

However, in FC22, the result of the calculation is only held in %rdi,
and clobbered by the call to page_get_owner_and_reference().  When it
comes to call put_page(), %r8 is reloaded, which is still a pointer to
the base of the frametable, not the page we actually took a reference on.

FC22 is miscompiling the C to:

struct page_info *page = mfn_to_page(mfn);
struct domain *owner = page_get_owner_and_reference(page);
if ( owner )
put_page(mfn_to_page(0));

which is wrong, and why free_domheap_pages() does legitimately complain
about the wonky refcount.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-29 Thread M A Young
On Fri, 29 May 2015, Andrew Cooper wrote:

> On 29/05/15 11:50, M A Young wrote:
> > On Fri, 29 May 2015, Andrew Cooper wrote:
> >
> >> Are you in a position to compile identical Xen 4.5 source with two 
> >> different
> >> versions of gcc?  (current staging-4.5 staging even has the gcc5 build fix
> >> in)
> >>
> >> If it is a gcc compiler bug, we would expect the version compiled with gcc
> >> 4.9 to work fine, but the one compiled with 5 to fail in the identified
> >> manor.
> > I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> > boot for me, but if I replace xen.gz with one from the same code built on 
> > Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> > available via 
> > http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> > if anyone else wants to do some testing.
> >
> > Michael Young
> 
> Do you have easy access to xen-syms from each build?

Yes.

Michael Young

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-29 Thread Andrew Cooper
On 29/05/15 11:50, M A Young wrote:
> On Fri, 29 May 2015, Andrew Cooper wrote:
>
>> Are you in a position to compile identical Xen 4.5 source with two different
>> versions of gcc?  (current staging-4.5 staging even has the gcc5 build fix
>> in)
>>
>> If it is a gcc compiler bug, we would expect the version compiled with gcc
>> 4.9 to work fine, but the one compiled with 5 to fail in the identified
>> manor.
> I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
> boot for me, but if I replace xen.gz with one from the same code built on 
> Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
> available via 
> http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
> if anyone else wants to do some testing.
>
>   Michael Young

Do you have easy access to xen-syms from each build?

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-29 Thread M A Young
On Fri, 29 May 2015, Andrew Cooper wrote:

> Are you in a position to compile identical Xen 4.5 source with two different
> versions of gcc?  (current staging-4.5 staging even has the gcc5 build fix
> in)
> 
> If it is a gcc compiler bug, we would expect the version compiled with gcc
> 4.9 to work fine, but the one compiled with 5 to fail in the identified
> manor.

I did a bit of testing - xen-4.5.1-rc1 built on Fedora 22 (gcc5) doesn't 
boot for me, but if I replace xen.gz with one from the same code built on 
Fedora 21 (gcc4) then it does boot. There are rpms and build logs 
available via 
http://copr.fedoraproject.org/coprs/myoung/xentest/build/93366/
if anyone else wants to do some testing.

Michael Young___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-29 Thread Andrew Cooper
On 29/05/15 07:24, Jason Fritcher wrote:
> On Wed, 20 May 2015, Major Hayden wrote:
>
> >/ On 05/20/2015 05:41 AM, Jan Beulich wrote:/
> >/ > Considering that no-one else is seeing this - is this perhaps connected/
> >/ > to you building Xen with pre-release gcc 5.0.1? This is also because in/
> >/ > order for the above to indeed occur, mmio_ro_do_page_fault()'s/
> >/ > put_page() would need to drop the last reference of a page, yet/
> >/ > page_get_owner_and_reference() doesn't obtain a reference when/
> >/ > a page is unallocated (and hence unowned), i.e. normally a page/
> >/ > would have a refcount of at least 2 here. Hence this would be/
> >/ > possible only due to a race, but the exact same race to be observed/
> >/ > on different hardware _and_ under an emulator is extremely unlikely./
>
> You could try with the xen.gz file from
> https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm
> It is roughly the same version of xen but built against Fedora 21 and gcc
> 4.9.2. If that works then it probably is gcc 5.
> Greetings,
>
> I have run into pretty much the same issue as the original poster.
>
> I am running a recently updated Arch Linux system, with GCC 5.1.0,
> using UEFI and gummiboot to boot. I currently have a build of Xen
> 4.4.1, built with GCC 4.9.2 from before my last update, that is
> functioning correctly on this machine. But the builds of Xen 4.5.0,
> using GCC 5 and mingw64-binutils for the EFI binary, are all failing
> when Xen starts the Linux kernel, with the same error mentioned in the
> subject. Below is the boot log I captured via the serial port.
>
> http://pastebin.com/bBC78306
>
> Wondering if my specific toolchain was the issue, I downloaded the
> Fedora 22 version of xen-hypervisor and installed its EFI Xen binary
> over my compiled binary and received an identical error message, with
> slightly different addresses in the panic dump. The Fedora version was
> compiled with GCC 5.0.1. Below is the boot log I captured from that
> binary.
>
> http://pastebin.com/jvg1JazC
>
> After finding this thread, and specifically, the quoted message above,
> I downloaded that xen-hypervisor package and installed its EFI Xen
> binary. That binary boots successfully, as seen by the captured boot
> log below.
>
> http://pastebin.com/DKxwaU2U
>
> So, while I’m not familiar enough with Xen to begin to have an idea of
> what could possibly be wrong with Xen or GCC 5 to be causing this bug,
> I’d like to do what I can to track down the issue so I can get a
> working build of Xen 4.5. :)

Are you in a position to compile identical Xen 4.5 source with two
different versions of gcc?  (current staging-4.5 staging even has the
gcc5 build fix in)

If it is a gcc compiler bug, we would expect the version compiled with
gcc 4.9 to work fine, but the one compiled with 5 to fail in the
identified manor.

~Andrew
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-28 Thread Jason Fritcher
On Wed, 20 May 2015, Major Hayden wrote:

> On 05/20/2015 05:41 AM, Jan Beulich wrote:
> > Considering that no-one else is seeing this - is this perhaps connected
> > to you building Xen with pre-release gcc 5.0.1? This is also because in
> > order for the above to indeed occur, mmio_ro_do_page_fault()'s
> > put_page() would need to drop the last reference of a page, yet
> > page_get_owner_and_reference() doesn't obtain a reference when
> > a page is unallocated (and hence unowned), i.e. normally a page
> > would have a refcount of at least 2 here. Hence this would be
> > possible only due to a race, but the exact same race to be observed
> > on different hardware _and_ under an emulator is extremely unlikely.

You could try with the xen.gz file from 
https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm
 

It is roughly the same version of xen but built against Fedora 21 and gcc 
4.9.2. If that works then it probably is gcc 5.
Greetings,

I have run into pretty much the same issue as the original poster.

I am running a recently updated Arch Linux system, with GCC 5.1.0, using UEFI 
and gummiboot to boot. I currently have a build of Xen 4.4.1, built with GCC 
4.9.2 from before my last update, that is functioning correctly on this 
machine. But the builds of Xen 4.5.0, using GCC 5 and mingw64-binutils for the 
EFI binary, are all failing when Xen starts the Linux kernel, with the same 
error mentioned in the subject. Below is the boot log I captured via the serial 
port.

http://pastebin.com/bBC78306

Wondering if my specific toolchain was the issue, I downloaded the Fedora 22 
version of xen-hypervisor and installed its EFI Xen binary over my compiled 
binary and received an identical error message, with slightly different 
addresses in the panic dump. The Fedora version was compiled with GCC 5.0.1. 
Below is the boot log I captured from that binary.

http://pastebin.com/jvg1JazC 

After finding this thread, and specifically, the quoted message above, I 
downloaded that xen-hypervisor package and installed its EFI Xen binary. That 
binary boots successfully, as seen by the captured boot log below.

http://pastebin.com/DKxwaU2U

So, while I’m not familiar enough with Xen to begin to have an idea of what 
could possibly be wrong with Xen or GCC 5 to be causing this bug, I’d like to 
do what I can to track down the issue so I can get a working build of Xen 4.5. 
:)

Thanks!

—
Jason Fritcher



smime.p7s
Description: S/MIME cryptographic signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-20 Thread M A Young
On Wed, 20 May 2015, Major Hayden wrote:

> On 05/20/2015 05:41 AM, Jan Beulich wrote:
> > Considering that no-one else is seeing this - is this perhaps connected
> > to you building Xen with pre-release gcc 5.0.1? This is also because in
> > order for the above to indeed occur, mmio_ro_do_page_fault()'s
> > put_page() would need to drop the last reference of a page, yet
> > page_get_owner_and_reference() doesn't obtain a reference when
> > a page is unallocated (and hence unowned), i.e. normally a page
> > would have a refcount of at least 2 here. Hence this would be
> > possible only due to a race, but the exact same race to be observed
> > on different hardware _and_ under an emulator is extremely unlikely.

You could try with the xen.gz file from 
https://copr-be.cloud.fedoraproject.org/results/myoung/xentest/fedora-21-x86_64/xen-4.5.1-0.rc1.fc21/xen-hypervisor-4.5.1-0.rc1.fc21.x86_64.rpm
It is roughly the same version of xen but built against Fedora 21 and gcc 
4.9.2. If that works then it probably is gcc 5.

Michael Young

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-20 Thread Major Hayden
On 05/20/2015 05:41 AM, Jan Beulich wrote:
> Considering that no-one else is seeing this - is this perhaps connected
> to you building Xen with pre-release gcc 5.0.1? This is also because in
> order for the above to indeed occur, mmio_ro_do_page_fault()'s
> put_page() would need to drop the last reference of a page, yet
> page_get_owner_and_reference() doesn't obtain a reference when
> a page is unallocated (and hence unowned), i.e. normally a page
> would have a refcount of at least 2 here. Hence this would be
> possible only due to a race, but the exact same race to be observed
> on different hardware _and_ under an emulator is extremely unlikely.

That could be a possibility.  There is one Fedora patch[1] to fix a GCC 5 
compile error but that's probably unrelated to the crash.

I'm still hunting around to see what I can figure out.

[1] http://pkgs.fedoraproject.org/cgit/xen.git/tree/?h=f22

--
Major Hayden

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-20 Thread Jan Beulich
>>> On 19.05.15 at 20:06,  wrote:
> I've been doing some testing of Xen 4.5 on Fedora 22 (due out within a week) 
> and I have an error that prevents the server from booting in the very early 
> boot process:
> 
>> (XEN) Xen call trace:
>> (XEN)[] free_domheap_pages+0x240/0x430
>> (XEN)[] mmio_ro_do_page_fault+0x114/0x160
>> (XEN)[] do_page_fault+0x1a0/0x4f0
>> (XEN)[] handle_exception_saved+0x2e/0x6c
>> (XEN) 
>> (XEN) 
>> (XEN) 
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at page_alloc.c:1738
>> (XEN) 
> 
> The full output is over in a Github Gist[1].
> 
> I've tested this on some physical machines (Dell, HP, and SuperMicro 
> servers) as well as within a KVM virtual machine but I get the same boot 
> error each time.

Considering that no-one else is seeing this - is this perhaps connected
to you building Xen with pre-release gcc 5.0.1? This is also because in
order for the above to indeed occur, mmio_ro_do_page_fault()'s
put_page() would need to drop the last reference of a page, yet
page_get_owner_and_reference() doesn't obtain a reference when
a page is unallocated (and hence unowned), i.e. normally a page
would have a refcount of at least 2 here. Hence this would be
possible only due to a race, but the exact same race to be observed
on different hardware _and_ under an emulator is extremely unlikely.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-19 Thread Major Hayden
I compiled Xen with debugging enabled and it appears to pass the
initial boot but then fails later in the boot process.  I'm working
through that now.  Here's what my source of page_alloc.c looks like
around line 1738:

1731 if ( likely(d) && likely(d != dom_cow) )
1732 {
1733 /* NB. May recursively lock from relinquish_memory(). */
1734 spin_lock_recursive(&d->page_alloc_lock);
1735
1736 for ( i = 0; i < (1 << order); i++ )
1737 {
1738 BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);
1739 page_list_del2(&pg[i], &d->page_list,
&d->arch.relmem_list);
1740 }
1741
1742 drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
1743
1744 spin_unlock_recursive(&d->page_alloc_lock);
1745
1746 /*
1747  * Normally we expect a domain to clear pages before
freeing them,
1748  * if it cares about the secrecy of their contents.
However, after
1749  * a domain has died we assume responsibility for erasure.
1750  */
1751 scrub = !!d->is_dying;
1752 }

On Tue, May 19, 2015 at 1:16 PM, Andrew Cooper
 wrote:
>
> Can you try a debug hypervisor and rerun, to confirm the stack trace and
> see whether any assertions fire.
>
> Can you identify exactly which line xen/common/page_alloc.c:1738 is in
> your source?

-- 
Major Hayden

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-19 Thread Andrew Cooper
On 19/05/15 19:06, Major Hayden wrote:
> Hello there,
>
> I've been doing some testing of Xen 4.5 on Fedora 22 (due out within a week) 
> and I have an error that prevents the server from booting in the very early 
> boot process:
>
>> (XEN) Xen call trace:
>> (XEN)[] free_domheap_pages+0x240/0x430
>> (XEN)[] mmio_ro_do_page_fault+0x114/0x160
>> (XEN)[] do_page_fault+0x1a0/0x4f0
>> (XEN)[] handle_exception_saved+0x2e/0x6c
>> (XEN) 
>> (XEN) 
>> (XEN) 
>> (XEN) Panic on CPU 0:
>> (XEN) Xen BUG at page_alloc.c:1738
>> (XEN) 
> The full output is over in a Github Gist[1].
>
> I've tested this on some physical machines (Dell, HP, and SuperMicro servers) 
> as well as within a KVM virtual machine but I get the same boot error each 
> time.  It occurs with Xen 4.5 and Linux 3.17-4.0.x.  Xen 4.5.1-rc1 fails in 
> the same way.  I've opened a Red Hat Bug[2] as well as a Xen bug[3] on it.
>
> The code within free_domheap_pages() hasn't changed much since late 2014 so 
> I'm not sure if that's the culprit.  Does anyone have any suggestions on how 
> to debug it further?
>
> Thanks in advance!
>
> [1] https://gist.github.com/major/baa0e2eee7de51a2bcd1
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1219197
> [3] http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1908

Can you try a debug hypervisor and rerun, to confirm the stack trace and
see whether any assertions fire.

Can you identify exactly which line xen/common/page_alloc.c:1738 is in
your source?

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Xen BUG at page_alloc.c:1738 (Xen 4.5)

2015-05-19 Thread Major Hayden
Hello there,

I've been doing some testing of Xen 4.5 on Fedora 22 (due out within a week) 
and I have an error that prevents the server from booting in the very early 
boot process:

> (XEN) Xen call trace:
> (XEN)[] free_domheap_pages+0x240/0x430
> (XEN)[] mmio_ro_do_page_fault+0x114/0x160
> (XEN)[] do_page_fault+0x1a0/0x4f0
> (XEN)[] handle_exception_saved+0x2e/0x6c
> (XEN) 
> (XEN) 
> (XEN) 
> (XEN) Panic on CPU 0:
> (XEN) Xen BUG at page_alloc.c:1738
> (XEN) 

The full output is over in a Github Gist[1].

I've tested this on some physical machines (Dell, HP, and SuperMicro servers) 
as well as within a KVM virtual machine but I get the same boot error each 
time.  It occurs with Xen 4.5 and Linux 3.17-4.0.x.  Xen 4.5.1-rc1 fails in the 
same way.  I've opened a Red Hat Bug[2] as well as a Xen bug[3] on it.

The code within free_domheap_pages() hasn't changed much since late 2014 so I'm 
not sure if that's the culprit.  Does anyone have any suggestions on how to 
debug it further?

Thanks in advance!

[1] https://gist.github.com/major/baa0e2eee7de51a2bcd1
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1219197
[3] http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1908

--
Major Hayden

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel