Re: [Xen-devel] xenstored crashes with SIGSEGV

2015-03-12 Thread Philipp Hahn
Hello,
On 12.03.2015 19:17, Oleg Nesterov wrote:
> On 03/12, Philipp Hahn wrote:
>>
>> Have you seen any other corruption
> 
> No,
> 
>> or is one of your patches likely to
>> fix something like the issue mentioned above:
> 
> I am not sure I even understand the problem above ;) I mean, after the quick
> look I do not see how this connects to FPU. $rdi == 2 looks obviously wrong.

In December we found some strange crashes of a Xen daemon, but other
processes crashed as well. One strange pattern Ian found was some
0x..00.ff pattern, which seems to have come from some SSE register
corruption.
That is why we upgrades to 3.10.62, which contains some fixes for saving
the FPU state. If my memory is correct the FPU registers share the space
with the MMU/SSE registers, so that seemed a good candidate.

You might want to take a look at
,
where you find the mail thread from December.

>>> $ git l1 --grep fpu v3.10.. -- arch/x86
>>> c7b228a Merge branch 'x86-fpu-for-linus' of 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>>> dc56c0f x86, fpu: Shift "fpu_counter = 0" from copy_thread() to 
>>> arch_dup_task_struct()
>>> 5e23fee x86, fpu: copy_process: Sanitize fpu->last_cpu initialization
>>> f185350 x86, fpu: copy_process: Avoid fpu_alloc/copy if !used_math()
>>> 31d9633 x86, fpu: Change __thread_fpu_begin() to use use_eager_fpu()
> 
> This is only cleanups... I do not think this series can fix something.

My guess from reading your description, but still tanks for your help.

Philipp

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2015-03-12 Thread Oleg Nesterov
On 03/12, Philipp Hahn wrote:
>
> Have you seen any other corruption

No,

> or is one of your patches likely to
> fix something like the issue mentioned above:

I am not sure I even understand the problem above ;) I mean, after the quick
look I do not see how this connects to FPU. $rdi == 2 looks obviously wrong.

> > $ git l1 --grep fpu v3.10.. -- arch/x86
> > c7b228a Merge branch 'x86-fpu-for-linus' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > dc56c0f x86, fpu: Shift "fpu_counter = 0" from copy_thread() to 
> > arch_dup_task_struct()
> > 5e23fee x86, fpu: copy_process: Sanitize fpu->last_cpu initialization
> > f185350 x86, fpu: copy_process: Avoid fpu_alloc/copy if !used_math()
> > 31d9633 x86, fpu: Change __thread_fpu_begin() to use use_eager_fpu()

This is only cleanups... I do not think this series can fix something.

Oleg.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2015-03-12 Thread Philipp Hahn
Hello,

On 06.01.2015 08:19, Philipp Hahn wrote:
> On 19.12.2014 13:36, Philipp Hahn wrote:
>> On 18.12.2014 11:17, Ian Campbell wrote:
>>> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
 Do we have a bug in Xen that affect SSE instructions (possibly already
 fixed after Philipp version) ?
>>>
>>> I've had a niggling feeling of Deja Vu over this which I'd been putting
>>> down to an old Xen on ARM bug in the area of FPU register switching.
>>>
>>> But it seems at some point (possibly even still) there was a similar
>>> issue with pvops kernels on x86, see:
>>> http://bugs.xenproject.org/xen/bug/40
...
>>> Philipp, what kernel are you guys using?
>>
>> The crash "2014-12-06 01:26:21 xenstored[4337]" happened on linux-3.10.46.
> 
> I looked through the changes of v3.10.46..v3.10.63 and found the
> following patches:
> | fb5b6e7 x86, fpu: shift drop_init_fpu() from save_xstate_sig() to
> handle_signal()
> | b888e3d x86, fpu: __restore_xstate_sig()->math_state_restore() needs
> preempt_disable()
> 
> They look interesting enough to may have fixed the bug, which could
> explain the strange bit pattern caused by not restoring the FPU state
> correctly.
...
> we're now working on upgrading the dom0 kernel which should give use
> usable core dumps again and may also fix the underlying problem. It that
> bug ever happens again I'll keep you informed.

We're now running 3.10.62 and the situation seems to have improved, but
yesterday and today we got two crashes on different host - this time
both times again in vsnprintf():

> [304534.173707] xenstored[3731]: segfault at 2 ip 7f6da00805ad sp 
> 7fff544a2b80 error 4 in libc-2.11.3.so[7f6da003b000+158000]

> (gdb) where
> #0  0x7f6da00805ad in _IO_vfprintf_internal (s=0x7fff544a3230, 
> format=, ap=0x7fff544a3790) at vfprintf.c:1617
> #1  0x7f6da00a2452 in _IO_vsnprintf (string=0x7fff544a3390 "%%p 
> 4249828122762082015 03:11:04 9JT\377\177", maxlen=, 
> format=0x40da48 "%s %p %04d%02d%02d %02d:%02d:%02d %s (", 
> args=0x7fff544a3790) at vsnprintf.c:120
> #2  0x004029ad in trace (fmt=0x40da48 "%s %p %04d%02d%02d 
> %02d:%02d:%02d %s (") at xenstored_core.c:140
> #3  0x00402c67 in trace_io (conn=0xbb51f0, data=0xbf1fe0, out=0) at 
> xenstored_core.c:174
> #4  0x004041cd in handle_input (conn=0xbb51f0) at 
> xenstored_core.c:1307
> #5  0x00405170 in main (argc=, argv= optimized out>) at xenstored_core.c:1964

The SSE register again contain the 00..ff.. pattern, but accessing
%es:(%rdi)=0x0:0x2 looks very broken.

> (gdb) info all-registers 
> rax0x0  0
> rbx0x40da48 4250184
> rcx0x   -1
> rdx0x7fff544a3890   140734607538320
> rsi0x40da69 4250217
> rdi0x2  2
> rbp0x7fff544a3790   0x7fff544a3790
> rsp0x7fff544a3390   0x7fff544a3390
> r8 0x1  1
> r9 0x2  2
> r100x2  2
> r110x10 16
> r120x0  0
> r130x7fff544a3950   140734607538512
> r140x7fff544a39d0   140734607538640
> r150xc  12
> rip0x4029ad 0x4029ad 
> eflags 0x10286  [ PF SF IF RF ]
> cs 0xe033   57395
> ss 0xe02b   57387
> ds 0x0  0
> es 0x0  0
> fs 0x0  0
> gs 0x0  0
> st00(raw 0x)
> st10(raw 0x)
> st20(raw 0x)
> st30(raw 0x)
> st40(raw 0x)
> st50(raw 0x)
> st60(raw 0x)
> st70(raw 0x)
> fctrl  0x37f895
> fstat  0x0  0
> ftag   0x   65535
> fiseg  0x0  0
> fioff  0x0  0
> foseg  0x0  0
> fooff  0x0  0
> fop0x0  0
> xmm0   {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, 
> v16_int8 = {0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0x0, 0xff, 0x0, 
> 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0xff, 0x0, 0xff00, 0x0, 0x0, 0xff, 0x0, 
> 0x0}, v4_int32 = {0xff, 0xff00, 0xff, 0x0}, v2_int64 = {0xffff, 
> 0xff}, uint128 = 0x00ffffff}
> xmm1   {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, 
> v16_int8 = {0x25 }, v8_int16 = {0x2525, 0x2525, 0x2525, 
> 0x2525, 0x2525, 0x2525, 0x2525, 0x2525}, v4_int32 = {0x25252525, 0x25252525, 
> 0x25252525, 0x25252525}, v2_int64 = {0x2525252525252525, 0x2525252525252525}, 
> uint128 = 0x25252525252525252525252525252525}
> xmm2   {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, 
> v16_int8 = {0x0 }, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 
> 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 

Re: [Xen-devel] xenstored crashes with SIGSEGV

2015-01-05 Thread Philipp Hahn
Hello,

happy new year to everyone.

On 19.12.2014 13:36, Philipp Hahn wrote:
> On 18.12.2014 11:17, Ian Campbell wrote:
>> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
>>> Do we have a bug in Xen that affect SSE instructions (possibly already
>>> fixed after Philipp version) ?
>>
>> I've had a niggling feeling of Deja Vu over this which I'd been putting
>> down to an old Xen on ARM bug in the area of FPU register switching.
>>
>> But it seems at some point (possibly even still) there was a similar
>> issue with pvops kernels on x86, see:
>> http://bugs.xenproject.org/xen/bug/40
> 
> That definitely looks interesting.
> 
>> Philipp, what kernel are you guys using?
> 
> The crash "2014-12-06 01:26:21 xenstored[4337]" happened on linux-3.10.46.

I looked through the changes of v3.10.46..v3.10.63 and found the
following patches:
| fb5b6e7 x86, fpu: shift drop_init_fpu() from save_xstate_sig() to
handle_signal()
| b888e3d x86, fpu: __restore_xstate_sig()->math_state_restore() needs
preempt_disable()

They look interesting enough to may have fixed the bug, which could
explain the strange bit pattern caused by not restoring the FPU state
correctly. Because of that and because of the missing

>> commit d1cc001905146d58c17ac8452eb96f226767819d
>> Author: Silesh C V 
>> Date:   Wed Jul 23 13:59:59 2014 -0700
>>
>> coredump: fix the setting of PF_DUMPCORE
>> commit aed8adb7688d5744cb484226820163af31d2499a upstream.

we're now working on upgrading the dom0 kernel which should give use
usable core dumps again and may also fix the underlying problem. It that
bug ever happens again I'll keep you informed.

Thanks so far to everybody for the excellent support.

Sincerely
Philipp Hahn

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-19 Thread Konrad Rzeszutek Wilk
On Thu, Dec 18, 2014 at 10:25:15AM +, David Vrabel wrote:
> On 18/12/14 10:17, Ian Campbell wrote:
> > On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
> >> Do we have a bug in Xen that affect SSE instructions (possibly already
> >> fixed after Philipp version) ?
> > 
> > I've had a niggling feeling of Deja Vu over this which I'd been putting
> > down to an old Xen on ARM bug in the area of FPU register switching.
> > 
> > But it seems at some point (possibly even still) there was a similar
> > issue with pvops kernels on x86, see:
> > http://bugs.xenproject.org/xen/bug/40
> > 
> > Philipp, what kernel are you guys using?
> > 
> > CCing Jan and the x86 kernel guys (and George since he registered the
> > bug). I'm not seeing anything in the kernel logs which looks like a fix
> > (there's some PVH related cr0 frobbing, but I don't think that's it).
> > 
> > I also can't quite shake the feeling that there was another much older
> > issue relating to FPU context switch on x86, but I think that was truly
> > ancient history (2.6.18 era stuff)
> 
> http://marc.info/?l=linux-kernel&m=139132566024357

More up-to-date: http://lkml.iu.edu/hypermail/linux/kernel/1409.0/01057.html

And boy did that mess up Oracle DB! There had been an P0 bug affecting
databases and this patch solved it.

> 
> David

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-19 Thread Philipp Hahn
Hello Ian,

On 18.12.2014 11:17, Ian Campbell wrote:
> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
> 
> I've had a niggling feeling of Deja Vu over this which I'd been putting
> down to an old Xen on ARM bug in the area of FPU register switching.
> 
> But it seems at some point (possibly even still) there was a similar
> issue with pvops kernels on x86, see:
> http://bugs.xenproject.org/xen/bug/40

That definitely looks interesting.

> Philipp, what kernel are you guys using?

The crash "2014-12-06 01:26:21 xenstored[4337]" happened on linux-3.10.46.

That kernel is missing v3.10.50-13-gd1cc001:
> commit d1cc001905146d58c17ac8452eb96f226767819d
> Author: Silesh C V 
> Date:   Wed Jul 23 13:59:59 2014 -0700
>
> coredump: fix the setting of PF_DUMPCORE
> commit aed8adb7688d5744cb484226820163af31d2499a upstream.
which explains why the xmm* registers are not included in the core file.

> I also can't quite shake the feeling that there was another much older
> issue relating to FPU context switch on x86, but I think that was truly
> ancient history (2.6.18 era stuff)

Some of those host might still use 3.2, most use 3.10.x, but definitely
no 2.6 kernels.

Xen-Hypervisor is 4.1.3

If you need anything more, just ask. It might take me some time to
answer as I'm on vacation for the next 2 weeks.

Thanks again for your help.
Philipp

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-18 Thread Ian Campbell
On Thu, 2014-12-18 at 10:49 +, Jan Beulich wrote:
> >>> On 18.12.14 at 11:17,  wrote:
> > On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
> >> Do we have a bug in Xen that affect SSE instructions (possibly already
> >> fixed after Philipp version) ?
> > 
> > I've had a niggling feeling of Deja Vu over this which I'd been putting
> > down to an old Xen on ARM bug in the area of FPU register switching.
> > 
> > But it seems at some point (possibly even still) there was a similar
> > issue with pvops kernels on x86, see:
> > http://bugs.xenproject.org/xen/bug/40 
> > 
> > Philipp, what kernel are you guys using?
> > 
> > CCing Jan and the x86 kernel guys (and George since he registered the
> > bug). I'm not seeing anything in the kernel logs which looks like a fix
> > (there's some PVH related cr0 frobbing, but I don't think that's it).
> 
> I just went through the thread again and didn't find where kernel/
> hypervisor logs were posted.

I don't think they were yet -- until recently it seemed like a xenstored
bug. Philipp, can you post them now?

Also the patch linked to by David seems like a good thing to try if you
are indeed running a kernel which is susceptible to this issue.

> You mentioning PVH made me want to take a look - said FPU related bug
> would be exposed only by PV kernels.

It's (almost certainly) not a PVH issue, I was saying that the patch
which touched PVH probably isn't relevant to this particular issue.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-18 Thread Jan Beulich
>>> On 18.12.14 at 11:17,  wrote:
> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
> 
> I've had a niggling feeling of Deja Vu over this which I'd been putting
> down to an old Xen on ARM bug in the area of FPU register switching.
> 
> But it seems at some point (possibly even still) there was a similar
> issue with pvops kernels on x86, see:
> http://bugs.xenproject.org/xen/bug/40 
> 
> Philipp, what kernel are you guys using?
> 
> CCing Jan and the x86 kernel guys (and George since he registered the
> bug). I'm not seeing anything in the kernel logs which looks like a fix
> (there's some PVH related cr0 frobbing, but I don't think that's it).

I just went through the thread again and didn't find where kernel/
hypervisor logs were posted. You mentioning PVH made me want to
take a look - said FPU related bug would be exposed only by PV
kernels.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-18 Thread David Vrabel
On 18/12/14 10:17, Ian Campbell wrote:
> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
> 
> I've had a niggling feeling of Deja Vu over this which I'd been putting
> down to an old Xen on ARM bug in the area of FPU register switching.
> 
> But it seems at some point (possibly even still) there was a similar
> issue with pvops kernels on x86, see:
> http://bugs.xenproject.org/xen/bug/40
> 
> Philipp, what kernel are you guys using?
> 
> CCing Jan and the x86 kernel guys (and George since he registered the
> bug). I'm not seeing anything in the kernel logs which looks like a fix
> (there's some PVH related cr0 frobbing, but I don't think that's it).
> 
> I also can't quite shake the feeling that there was another much older
> issue relating to FPU context switch on x86, but I think that was truly
> ancient history (2.6.18 era stuff)

http://marc.info/?l=linux-kernel&m=139132566024357

David

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-18 Thread Philipp Hahn
Hello,

On 17.12.2014 10:14, Frediano Ziglio wrote:
> 2014-12-16 16:44 GMT+00:00 Frediano Ziglio :
>> 2014-12-16 16:23 GMT+00:00 Ian Campbell :
...
>> First we (I'll try when I reach home) can check if memset in glibc (or
>> the version called from talloc_zero) can use SSE. A possible dmesg
>> output and /proc/cpuinfo content could help too but I think SSE are
>> now quite common.
> 
> I have access to some core dumps. glibc memset is using SSE,
> specifically xmm0 register.
> 
> Unfortunately is seems that core dumps contains only standard
> registers, so all register appears zeroed. If you try with a newer gdb
> version is shows that registers are not available.

I had another look myself and I'm confused now:

Using "info float" or "info vector" with gdb-7.0.1 shows the FP and MMX
registers to be all zero.
A newer gdb-7.2 shows the registers as "unavailable".

"eu-readelf --notes core" doesn't show a NT_FPREGSET note, so to me it
looks like at least the FP-registers were not dumped.
But is that also used for the MMX registers? If my memory is right, the
FP and MMX registers are "shared" in the CPU, but that might be old
knowledge.

I wrote a small SSE using program, which dumps core. If I run that
locally and do a "readelf --notes core", I get:
  CORE  0x0200  NT_FPREGSET (floating point registers)

If I do the same in dom0, I don't get that note and gdb doesn't show the
register content.
SSE seems to be available in the dom0, as the program would crash with
SIGILL otherwise:
# grep ^flags /proc/cpuinfo
flags   : fpu de tsc msr pae mce cx8 apic sep mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good
nopl nonstop_tsc pni est ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor
lahf_lm ida dtherm

Look like that got fixed with a newer 3.10.61 kernel, so I'll urge our
admins to update to a later kernel (again), so we'll get more useful
core dumps for future crashes.

I'm still investigating the core files of the other programs, but it
takes some time. I don't know if I will be able to finish that in time,
as the Christmas holiday season starts tomorrow and I will be
unavailable for nearly two weeks,

So happy Christmas to everybody and thanks again for your help.

Philipp

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-18 Thread Ian Campbell
On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
> Do we have a bug in Xen that affect SSE instructions (possibly already
> fixed after Philipp version) ?

I've had a niggling feeling of Deja Vu over this which I'd been putting
down to an old Xen on ARM bug in the area of FPU register switching.

But it seems at some point (possibly even still) there was a similar
issue with pvops kernels on x86, see:
http://bugs.xenproject.org/xen/bug/40

Philipp, what kernel are you guys using?

CCing Jan and the x86 kernel guys (and George since he registered the
bug). I'm not seeing anything in the kernel logs which looks like a fix
(there's some PVH related cr0 frobbing, but I don't think that's it).

I also can't quite shake the feeling that there was another much older
issue relating to FPU context switch on x86, but I think that was truly
ancient history (2.6.18 era stuff)

Ian.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-17 Thread Frediano Ziglio
2014-12-16 16:44 GMT+00:00 Frediano Ziglio :
> 2014-12-16 16:23 GMT+00:00 Ian Campbell :
>> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
>>> What does "info all-registers" gdb command say about SSE registers?
>>
>> All zeroes. No ffs anywhere.
>>
>
> Could be that core does not dump these registers for some reasons? On
> my machine I got some FFs even just before the main is reached.
>
>>> Do we have a bug in Xen that affect SSE instructions (possibly already
>>> fixed after Philipp version) ?
>>
>> Possibly. When this was thought to be xenstored (which doesn't change
>> all that much) debugging 4.1 seemed plausible, but since it could be
>> anywhere else I think we either need a plausible reproduction, or a
>> repro on a newer hypervisor (or possibly kernel) I'm afraid.
>>
>> Ian.
>>
>
> I found these
>
> 1) https://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.8
> 2) https://sourceware.org/bugzilla/show_bug.cgi?id=16064
>
> 1 seems to indicate a problem with kernel 3.2. Second with glibc 2.18.
>
> First we (I'll try when I reach home) can check if memset in glibc (or
> the version called from talloc_zero) can use SSE. A possible dmesg
> output and /proc/cpuinfo content could help too but I think SSE are
> now quite common.
>

I have access to some core dumps. glibc memset is using SSE,
specifically xmm0 register.

Unfortunately is seems that core dumps contains only standard
registers, so all register appears zeroed. If you try with a newer gdb
version is shows that registers are not available.

> For the reproduction could be that a program doing some memset(0)
> continuously while another fill SSE register with garbage could
> help... at least if they execute on the same CPU (so could be limiting
> Xen to one CPU). Also doing some FPU operation which could lead to
> exception could help too.
>

Frediano

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Frediano Ziglio
2014-12-16 16:23 GMT+00:00 Ian Campbell :
> On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
>> What does "info all-registers" gdb command say about SSE registers?
>
> All zeroes. No ffs anywhere.
>

Could be that core does not dump these registers for some reasons? On
my machine I got some FFs even just before the main is reached.

>> Do we have a bug in Xen that affect SSE instructions (possibly already
>> fixed after Philipp version) ?
>
> Possibly. When this was thought to be xenstored (which doesn't change
> all that much) debugging 4.1 seemed plausible, but since it could be
> anywhere else I think we either need a plausible reproduction, or a
> repro on a newer hypervisor (or possibly kernel) I'm afraid.
>
> Ian.
>

I found these

1) https://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.2.8
2) https://sourceware.org/bugzilla/show_bug.cgi?id=16064

1 seems to indicate a problem with kernel 3.2. Second with glibc 2.18.

First we (I'll try when I reach home) can check if memset in glibc (or
the version called from talloc_zero) can use SSE. A possible dmesg
output and /proc/cpuinfo content could help too but I think SSE are
now quite common.

For the reproduction could be that a program doing some memset(0)
continuously while another fill SSE register with garbage could
help... at least if they execute on the same CPU (so could be limiting
Xen to one CPU). Also doing some FPU operation which could lead to
exception could help too.

Frediano

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Ian Campbell
On Tue, 2014-12-16 at 16:13 +, Frediano Ziglio wrote:
> What does "info all-registers" gdb command say about SSE registers?

All zeroes. No ffs anywhere.

> Do we have a bug in Xen that affect SSE instructions (possibly already
> fixed after Philipp version) ?

Possibly. When this was thought to be xenstored (which doesn't change
all that much) debugging 4.1 seemed plausible, but since it could be
anywhere else I think we either need a plausible reproduction, or a
repro on a newer hypervisor (or possibly kernel) I'm afraid.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Frediano Ziglio
2014-12-16 12:23 GMT+00:00 Ian Campbell :
> On Tue, 2014-12-16 at 11:30 +, Frediano Ziglio wrote:
>> 2014-12-16 11:06 GMT+00:00 Ian Campbell :
>> > On Tue, 2014-12-16 at 10:45 +, Ian Campbell wrote:
>> >> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>> >> > > I notice in your bugzilla (for a different occurrence, I think):
>> >> > >> [2090451.721705] univention-conf[2512]: segfault at ff ip 
>> >> > >> 0045e238 sp 768dfa30 error 6 in 
>> >> > >> python2.6[40+21e000]
>> >> > >
>> >> > > Which appears to have faulted access 0xff0 too. It looks like
>> >> > > this process is a python thing, it's nothing to do with xenstored I
>> >> > > assume?
>> >> >
>> >> > Yes, that's one univention-config, which is completely independent of
>> >> > xen(stored).
>> >> >
>> >> > > It seems rather coincidental that it should be accessing the
>> >> > > same sort of address and be faulting.
>> >> >
>> >> > Yes, good catch. I'll have another look at those core dumps.
>> >>
>> >> With this in mind, please can you confirm what model of machines you've
>> >> seen this on, and in particular whether they are all the same class of
>> >> machine or whether they are significantly different.
>> >>
>> >> The reason being that randomly placed 0xff values in a field of 0x00
>> >> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
>> >> memory pages.
>> >
>> > Thanks for giving me access to the core files. This is very suspicious:
>> > (gdb) frame 2
>> > #2  0x0040a348 in tdb_open_ex (name=0x1941fb0 
>> > "/var/lib/xenstored/tdb.0x1935bb0", hash_size=, 
>> > tdb_flags=0, open_flags=, mode=,
>> > log_fn=0x4093b0 , hash_fn=) at 
>> > tdb.c:1958
>> > 1958SAFE_FREE(tdb->locked);
>> >
>> > (gdb) x/96x tdb
>> > 0x1921270:  0x  0x  0x  0x
>> > 0x1921280:  0x001f  0x00ff  0xff00  0x00ff
>> > 0x1921290:  0x  0x00ff  0xff00  0x00ff
>> > 0x19212a0:  0x  0x00ff  0xff00  0x00ff
>> > 0x19212b0:  0x  0x00ff  0xff00  0x00ff
>> > 0x19212c0:  0x  0x00ff  0xff00  0x00ff
>> > 0x19212d0:  0x  0x00ff  0xff00  0x00ff
>> > 0x19212e0:  0x  0x00ff  0xff00  0x00ff
>> > 0x19212f0:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921300:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921310:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921320:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921330:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921340:  0x  0x  0xff00  0x00ff
>> > 0x1921350:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921360:  0x  0x00ff  0xff00  0x00ff
>> > 0x1921370:  0x004093b0  0x  0x004092f0  0x
>> > 0x1921380:  0x0002  0x  0x0091  0x
>> > 0x1921390:  0x0193de70  0x  0x01963600  0x
>> > 0x19213a0:  0x  0x  0x0193fbb0  0x
>> > 0x19213b0:  0x  0x  0x  0x
>> > 0x19213c0:  0x00405870  0x  0x0040e3e0  0x
>> > 0x19213d0:  0x0038  0x  0xe814ec70  0x6f2f6567
>> > 0x19213e0:  0x01963650  0x  0x0193dec0  0x
>> >
>> > Something has clearly done a number on the ram of this process.
>> > 0x1921270 through 0x192136f is 256 bytes...
>> >
>> > Since it appears to be happening to other processes too I would hazard
>> > that this is not a xenstored issue.
>> >
>> > Ian.
>> >
>>
>> Good catch Ian!
>>
>> Strange corruption. Probably not related to xenstored as you
>> suggested. I would be curious to see what's before the tdb pointer and
>> where does the corruption starts.
>
> (gdb) print tdb
> $2 = (TDB_CONTEXT *) 0x1921270
> (gdb) x/64x 0x1921200
> 0x1921200:  0x01921174  0x  0x  0x
> 0x1921210:  0x01921174  0x  0x0171  0x
> 0x1921220:  0x  0x  0x  0x

0x0 next (u64)
0x0 prev (u64)

> 0x1921230:  0x01941f60  0x  0x  0x

0x01941f60 parent (u64), make sense is not NULL
0x0 child (u64)

> 0x1921240:  0x  0x  0x  0x6f630065

0x0 refs (u64)
0x0 null_refs (u32)
0x6f630065 pad, garbage (u32)

> 0x1921250:  0x  0x  0x0040e8a7  0x

0x0 destructor (u64)
0x0040e8a7 name (u64)

> 0x1921260:  0x0118  0x  0xe814ec70  0x000

Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Ian Campbell
On Tue, 2014-12-16 at 11:30 +, Frediano Ziglio wrote:
> 2014-12-16 11:06 GMT+00:00 Ian Campbell :
> > On Tue, 2014-12-16 at 10:45 +, Ian Campbell wrote:
> >> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> >> > > I notice in your bugzilla (for a different occurrence, I think):
> >> > >> [2090451.721705] univention-conf[2512]: segfault at ff ip 
> >> > >> 0045e238 sp 768dfa30 error 6 in 
> >> > >> python2.6[40+21e000]
> >> > >
> >> > > Which appears to have faulted access 0xff0 too. It looks like
> >> > > this process is a python thing, it's nothing to do with xenstored I
> >> > > assume?
> >> >
> >> > Yes, that's one univention-config, which is completely independent of
> >> > xen(stored).
> >> >
> >> > > It seems rather coincidental that it should be accessing the
> >> > > same sort of address and be faulting.
> >> >
> >> > Yes, good catch. I'll have another look at those core dumps.
> >>
> >> With this in mind, please can you confirm what model of machines you've
> >> seen this on, and in particular whether they are all the same class of
> >> machine or whether they are significantly different.
> >>
> >> The reason being that randomly placed 0xff values in a field of 0x00
> >> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
> >> memory pages.
> >
> > Thanks for giving me access to the core files. This is very suspicious:
> > (gdb) frame 2
> > #2  0x0040a348 in tdb_open_ex (name=0x1941fb0 
> > "/var/lib/xenstored/tdb.0x1935bb0", hash_size=, 
> > tdb_flags=0, open_flags=, mode=,
> > log_fn=0x4093b0 , hash_fn=) at 
> > tdb.c:1958
> > 1958SAFE_FREE(tdb->locked);
> >
> > (gdb) x/96x tdb
> > 0x1921270:  0x  0x  0x  0x
> > 0x1921280:  0x001f  0x00ff  0xff00  0x00ff
> > 0x1921290:  0x  0x00ff  0xff00  0x00ff
> > 0x19212a0:  0x  0x00ff  0xff00  0x00ff
> > 0x19212b0:  0x  0x00ff  0xff00  0x00ff
> > 0x19212c0:  0x  0x00ff  0xff00  0x00ff
> > 0x19212d0:  0x  0x00ff  0xff00  0x00ff
> > 0x19212e0:  0x  0x00ff  0xff00  0x00ff
> > 0x19212f0:  0x  0x00ff  0xff00  0x00ff
> > 0x1921300:  0x  0x00ff  0xff00  0x00ff
> > 0x1921310:  0x  0x00ff  0xff00  0x00ff
> > 0x1921320:  0x  0x00ff  0xff00  0x00ff
> > 0x1921330:  0x  0x00ff  0xff00  0x00ff
> > 0x1921340:  0x  0x  0xff00  0x00ff
> > 0x1921350:  0x  0x00ff  0xff00  0x00ff
> > 0x1921360:  0x  0x00ff  0xff00  0x00ff
> > 0x1921370:  0x004093b0  0x  0x004092f0  0x
> > 0x1921380:  0x0002  0x  0x0091  0x
> > 0x1921390:  0x0193de70  0x  0x01963600  0x
> > 0x19213a0:  0x  0x  0x0193fbb0  0x
> > 0x19213b0:  0x  0x  0x  0x
> > 0x19213c0:  0x00405870  0x  0x0040e3e0  0x
> > 0x19213d0:  0x0038  0x  0xe814ec70  0x6f2f6567
> > 0x19213e0:  0x01963650  0x  0x0193dec0  0x
> >
> > Something has clearly done a number on the ram of this process.
> > 0x1921270 through 0x192136f is 256 bytes...
> >
> > Since it appears to be happening to other processes too I would hazard
> > that this is not a xenstored issue.
> >
> > Ian.
> >
> 
> Good catch Ian!
> 
> Strange corruption. Probably not related to xenstored as you
> suggested. I would be curious to see what's before the tdb pointer and
> where does the corruption starts.

(gdb) print tdb
$2 = (TDB_CONTEXT *) 0x1921270
(gdb) x/64x 0x1921200
0x1921200:  0x01921174  0x  0x  0x
0x1921210:  0x01921174  0x  0x0171  0x
0x1921220:  0x  0x  0x  0x
0x1921230:  0x01941f60  0x  0x  0x
0x1921240:  0x  0x  0x  0x6f630065
0x1921250:  0x  0x  0x0040e8a7  0x
0x1921260:  0x0118  0x  0xe814ec70  0x
0x1921270:  0x  0x  0x  0x
0x1921280:  0x001f  0x00ff  0xff00  0x00ff
0x1921290:  0x  0x00ff  0xff00  0x00ff
0x19212a0:  0x  0x00ff  0xff00  0x00ff
0x19212b0:  0x  0x00ff  

Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Philipp Hahn
Hello,

On 16.12.2014 11:45, Ian Campbell wrote:
> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>>> I notice in your bugzilla (for a different occurrence, I think):
 [2090451.721705] univention-conf[2512]: segfault at ff ip 
 0045e238 sp 768dfa30 error 6 in python2.6[40+21e000]
>>>
>>> Which appears to have faulted access 0xff0 too. It looks like
>>> this process is a python thing, it's nothing to do with xenstored I
>>> assume?
>>
>> Yes, that's one univention-config, which is completely independent of
>> xen(stored).
>>
>>> It seems rather coincidental that it should be accessing the 
>>> same sort of address and be faulting.
>>
>> Yes, good catch. I'll have another look at those core dumps.
> 
> With this in mind, please can you confirm what model of machines you've
> seen this on, and in particular whether they are all the same class of
> machine or whether they are significantly different.

They are all from the same vendor, but I have to check the individual
models and firmware versions, which might take some time.

> The reason being that randomly placed 0xff values in a field of 0x00
> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
> memory pages.

Good catch: that would explain why it only happens for us and no one
other has seen that strange bug before.

Thanks you again.
Philipp Hahn


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Frediano Ziglio
2014-12-16 11:06 GMT+00:00 Ian Campbell :
> On Tue, 2014-12-16 at 10:45 +, Ian Campbell wrote:
>> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
>> > > I notice in your bugzilla (for a different occurrence, I think):
>> > >> [2090451.721705] univention-conf[2512]: segfault at ff ip 
>> > >> 0045e238 sp 768dfa30 error 6 in python2.6[40+21e000]
>> > >
>> > > Which appears to have faulted access 0xff0 too. It looks like
>> > > this process is a python thing, it's nothing to do with xenstored I
>> > > assume?
>> >
>> > Yes, that's one univention-config, which is completely independent of
>> > xen(stored).
>> >
>> > > It seems rather coincidental that it should be accessing the
>> > > same sort of address and be faulting.
>> >
>> > Yes, good catch. I'll have another look at those core dumps.
>>
>> With this in mind, please can you confirm what model of machines you've
>> seen this on, and in particular whether they are all the same class of
>> machine or whether they are significantly different.
>>
>> The reason being that randomly placed 0xff values in a field of 0x00
>> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
>> memory pages.
>
> Thanks for giving me access to the core files. This is very suspicious:
> (gdb) frame 2
> #2  0x0040a348 in tdb_open_ex (name=0x1941fb0 
> "/var/lib/xenstored/tdb.0x1935bb0", hash_size=, 
> tdb_flags=0, open_flags=, mode=,
> log_fn=0x4093b0 , hash_fn=) at 
> tdb.c:1958
> 1958SAFE_FREE(tdb->locked);
>
> (gdb) x/96x tdb
> 0x1921270:  0x  0x  0x  0x
> 0x1921280:  0x001f  0x00ff  0xff00  0x00ff
> 0x1921290:  0x  0x00ff  0xff00  0x00ff
> 0x19212a0:  0x  0x00ff  0xff00  0x00ff
> 0x19212b0:  0x  0x00ff  0xff00  0x00ff
> 0x19212c0:  0x  0x00ff  0xff00  0x00ff
> 0x19212d0:  0x  0x00ff  0xff00  0x00ff
> 0x19212e0:  0x  0x00ff  0xff00  0x00ff
> 0x19212f0:  0x  0x00ff  0xff00  0x00ff
> 0x1921300:  0x  0x00ff  0xff00  0x00ff
> 0x1921310:  0x  0x00ff  0xff00  0x00ff
> 0x1921320:  0x  0x00ff  0xff00  0x00ff
> 0x1921330:  0x  0x00ff  0xff00  0x00ff
> 0x1921340:  0x  0x  0xff00  0x00ff
> 0x1921350:  0x  0x00ff  0xff00  0x00ff
> 0x1921360:  0x  0x00ff  0xff00  0x00ff
> 0x1921370:  0x004093b0  0x  0x004092f0  0x
> 0x1921380:  0x0002  0x  0x0091  0x
> 0x1921390:  0x0193de70  0x  0x01963600  0x
> 0x19213a0:  0x  0x  0x0193fbb0  0x
> 0x19213b0:  0x  0x  0x  0x
> 0x19213c0:  0x00405870  0x  0x0040e3e0  0x
> 0x19213d0:  0x0038  0x  0xe814ec70  0x6f2f6567
> 0x19213e0:  0x01963650  0x  0x0193dec0  0x
>
> Something has clearly done a number on the ram of this process.
> 0x1921270 through 0x192136f is 256 bytes...
>
> Since it appears to be happening to other processes too I would hazard
> that this is not a xenstored issue.
>
> Ian.
>

Good catch Ian!

Strange corruption. Probably not related to xenstored as you
suggested. I would be curious to see what's before the tdb pointer and
where does the corruption starts. I also don't understand where the
"fd = 47" came from a previous mail. 0x1f is 31, not 47 (which is
0x2f).

I would not be surprised about a strange bug in libc or the kernel.

Frediano

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Ian Campbell
On Tue, 2014-12-16 at 10:45 +, Ian Campbell wrote:
> On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> > > I notice in your bugzilla (for a different occurrence, I think):
> > >> [2090451.721705] univention-conf[2512]: segfault at ff ip 
> > >> 0045e238 sp 768dfa30 error 6 in python2.6[40+21e000]
> > > 
> > > Which appears to have faulted access 0xff0 too. It looks like
> > > this process is a python thing, it's nothing to do with xenstored I
> > > assume?
> > 
> > Yes, that's one univention-config, which is completely independent of
> > xen(stored).
> > 
> > > It seems rather coincidental that it should be accessing the 
> > > same sort of address and be faulting.
> > 
> > Yes, good catch. I'll have another look at those core dumps.
> 
> With this in mind, please can you confirm what model of machines you've
> seen this on, and in particular whether they are all the same class of
> machine or whether they are significantly different.
> 
> The reason being that randomly placed 0xff values in a field of 0x00
> could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
> memory pages.

Thanks for giving me access to the core files. This is very suspicious:
(gdb) frame 2
#2  0x0040a348 in tdb_open_ex (name=0x1941fb0 
"/var/lib/xenstored/tdb.0x1935bb0", hash_size=, 
tdb_flags=0, open_flags=, mode=, 
log_fn=0x4093b0 , hash_fn=) at tdb.c:1958
1958SAFE_FREE(tdb->locked);

(gdb) x/96x tdb
0x1921270:  0x  0x  0x  0x
0x1921280:  0x001f  0x00ff  0xff00  0x00ff
0x1921290:  0x  0x00ff  0xff00  0x00ff
0x19212a0:  0x  0x00ff  0xff00  0x00ff
0x19212b0:  0x  0x00ff  0xff00  0x00ff
0x19212c0:  0x  0x00ff  0xff00  0x00ff
0x19212d0:  0x  0x00ff  0xff00  0x00ff
0x19212e0:  0x  0x00ff  0xff00  0x00ff
0x19212f0:  0x  0x00ff  0xff00  0x00ff
0x1921300:  0x  0x00ff  0xff00  0x00ff
0x1921310:  0x  0x00ff  0xff00  0x00ff
0x1921320:  0x  0x00ff  0xff00  0x00ff
0x1921330:  0x  0x00ff  0xff00  0x00ff
0x1921340:  0x  0x  0xff00  0x00ff
0x1921350:  0x  0x00ff  0xff00  0x00ff
0x1921360:  0x  0x00ff  0xff00  0x00ff
0x1921370:  0x004093b0  0x  0x004092f0  0x
0x1921380:  0x0002  0x  0x0091  0x
0x1921390:  0x0193de70  0x  0x01963600  0x
0x19213a0:  0x  0x  0x0193fbb0  0x
0x19213b0:  0x  0x  0x  0x
0x19213c0:  0x00405870  0x  0x0040e3e0  0x
0x19213d0:  0x0038  0x  0xe814ec70  0x6f2f6567
0x19213e0:  0x01963650  0x  0x0193dec0  0x

Something has clearly done a number on the ram of this process.
0x1921270 through 0x192136f is 256 bytes...

Since it appears to be happening to other processes too I would hazard
that this is not a xenstored issue.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Ian Campbell
On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> > I notice in your bugzilla (for a different occurrence, I think):
> >> [2090451.721705] univention-conf[2512]: segfault at ff ip 
> >> 0045e238 sp 768dfa30 error 6 in python2.6[40+21e000]
> > 
> > Which appears to have faulted access 0xff0 too. It looks like
> > this process is a python thing, it's nothing to do with xenstored I
> > assume?
> 
> Yes, that's one univention-config, which is completely independent of
> xen(stored).
> 
> > It seems rather coincidental that it should be accessing the 
> > same sort of address and be faulting.
> 
> Yes, good catch. I'll have another look at those core dumps.

With this in mind, please can you confirm what model of machines you've
seen this on, and in particular whether they are all the same class of
machine or whether they are significantly different.

The reason being that randomly placed 0xff values in a field of 0x00
could possibly indicate hardware (e.g. a GPU) DMAing over the wrong
memory pages.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Ian Campbell
On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> (gdb) print *tdb
> $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
> 16711680,
>   locked = 0xff00, ecode = 16711680, header = {
> magic_food =
> "\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377",
> version = 0, hash_size = 0,

tdb->fd has been initialised, but version and hash_size have not yet.
This means we must have failed somewhere between the open() and the call
to tdb_new_database() (the second one, the first one is only if
TDB_INTERNAL, which is not the case here).

There are three interesting actions in that space.

The first is tdb_brlock, which could have gone wrong.

The second is ftruncate(). This is not a candidate because tdb->flags
doesn't have TDB_CLEAR_IF_FIRST (the actual test is on tdb_flags, which
is changed by the time of the stack trace, but it is stored in
tdb->flags where we can see it. tdb_flags isn't changed before the
check, so baring compiler problems I think we can rule that out).

The third is the read of the header itself. The fact that
tdb->header.magic_food isn't either all zeroes or the requisite magic
string "TDB file\n" is suspicious. Instead it is mostly zero with the
off 0xff in it. An interesting pattern of 0xff..00..00, may be a
coincidence, or not.

Ian.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-16 Thread Ian Campbell
On Mon, 2014-12-15 at 23:29 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 18:45, Ian Campbell wrote:
> > On Mon, 2014-12-15 at 14:50 +, Ian Campbell wrote:
> >> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> >>> I just noticed something strange:
> >>>
>  #3  0x0040a684 in tdb_open (name=0xff   0xff out of bounds>, hash_size=0,
>  tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> ...
> > I'm reasonably convinced now that this is just a weird artefact of
> > running gdb on an optimised binary, probably a shortcoming in the debug
> > info leading to gdb getting confused.
> > 
> > Unfortunately this also calls into doubt the parameter to talloc_free,
> > perhaps in that context 0xff000 is a similar artefact.
> > 
> > Please can you print the entire contents of tdb in the second frame
> > ("print *tdb" ought to do it). I'm curious whether it is all sane or
> > not.
> 
> (gdb) print *tdb
> $1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
> 16711680,
>   locked = 0xff00,

So it really does seem to be 0xff00 in memory.

> flags = 0,
> travlocks = {
> next = 0xff, off = 0, hash = 65280}, next = 0xff,
>   device = 280375465082880, inode = 16711680, log_fn = 0x4093b0
> ,
>   hash_fn = 0x4092f0 , open_flags = 2}

And here we can see tdb->{flags,open_flags} == 0 and 2, contrary to what
the stack trace says we were called with, which was nonsense. Since 0
and 2 are sensible and correspond to what the caller passes I think the
stack trace is just confused.

> (gdb) info registers
> rax0x0  0
> rbx0x16bff7023854960
> rcx0x   -1
> rdx0x40ecd0 4254928
> rsi0x0  0
> rdi0xff00   280375465082880

And here it is in the registers.

> rbp0x7fcaed6c96a8   0x7fcaed6c96a8
> rsp0x7fff9dc86330   0x7fff9dc86330
> r8 0x7fcaece54c08   140509534571528
> r9 0xff00   -72057594037927936
> r100x7fcaed08c14c   140509536895308
> r110x246582
> r120xd  13
> r130xff00   280375465082880

And again.

> r140x4093b0 4232112
> r150x167d62023582240
> rip0x4075c4 0x4075c4 

This must be the faulting address.

> eflags 0x10206  [ PF IF RF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0  0
> es 0x0  0
> fs 0x0  0
> gs 0x0  0
> fctrl  0x0  0
> fstat  0x0  0
> ftag   0x0  0
> fiseg  0x0  0
> fioff  0x0  0
> foseg  0x0  0
> fooff  0x0  0
> fop0x0  0
> mxcsr  0x0  [ ]
> 
> (gdb) disassemble
> Dump of assembler code for function talloc_chunk_from_ptr:
> 0x004075c0 :   sub$0x8,%rsp
> 0x004075c4 :   mov-0x8(%rdi),%edx

This is the line corresponding to %rip above which is doing a read via %
rdi, which is 0xff00.

It's reading tc->flags. It's been optimised, tc = pp - SIZE, so it is
loading *(pp-SIZE+offsetof(flags)), which is pp-8 (flags is the last
field in the struct).

So rdi contains pp which == the ptr given as an argument to the
function, so ptr was bogus.

So it seems we really do have tdb->locked containing 0xff00.

This is only allocated in one place which is:
tdb->locked = talloc_zero_array(tdb, struct tdb_lock_type,
tdb->header.hash_size+1);
midway through tdb_open_ex. It might be worth inserting a check+log for
this returning  0xff, 0xff00, 0xff ... 0xff00 etc.

> 0x004075c7 :   lea-0x50(%rdi),%rax

This is actually calculating tc, ready for return upon success.

> 0x004075cb :  mov%edx,%ecx
> 0x004075cd :  and
> $0xfff0,%ecx
> 0x004075d0 :  cmp$0xe814ec70,%ecx
> 0x004075d6 :  jne0x4075e2 
> 

(tc->flags & ~0xF) != TALLOC_MAGIC

> 0x004075d8 :  and$0x1,%edx
> 0x004075db :  jne0x4075e2 
> 

tc->flags & TALLOC_FLAG_FREE

> 0x004075dd :  add$0x8,%rsp
> 0x004075e1 :  retq

Success, return.

> 0x004075e2 :  nopw   0x0(%rax,%rax,1)
> 0x004075e8 :  callq  0x401b98 

The two TALLOC_ABORTS both end up here if the checks above fail.

> > Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
> > This ought to be the actual faulting address, which ought to give a hint
> > on how much we can trust the parameters in the stack trace.
> 
> Hmm, my gdb refused to access $_siginfo:
> (gdb) show convenience
> $_siginfo = Unable to read siginfo

That's ok, I think I've convinced myself above what the crash is.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.o

Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-15 Thread Philipp Hahn
Hello Ian,

On 15.12.2014 18:45, Ian Campbell wrote:
> On Mon, 2014-12-15 at 14:50 +, Ian Campbell wrote:
>> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
>>> I just noticed something strange:
>>>
 #3  0x0040a684 in tdb_open (name=0xff >>> 0xff out of bounds>, hash_size=0,
 tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
...
> I'm reasonably convinced now that this is just a weird artefact of
> running gdb on an optimised binary, probably a shortcoming in the debug
> info leading to gdb getting confused.
> 
> Unfortunately this also calls into doubt the parameter to talloc_free,
> perhaps in that context 0xff000 is a similar artefact.
> 
> Please can you print the entire contents of tdb in the second frame
> ("print *tdb" ought to do it). I'm curious whether it is all sane or
> not.

(gdb) print *tdb
$1 = {name = 0x0, map_ptr = 0x0, fd = 47, map_size = 65280, read_only =
16711680,
  locked = 0xff00, ecode = 16711680, header = {
magic_food =
"\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377\000\000\000\000\000\000\000\000\000\000\377\000\000\000\000\377",
version = 0, hash_size = 0,
rwlocks = 65280, reserved = {16711680, 0, 0, 65280, 16711680, 0, 0,
65280,
  16711680, 0, 0, 65280, 16711680, 0, 0, 65280, 16711680, 0, 0,
65280, 16711680,
  0, 0, 65280, 16711680, 0, 0, 65280, 16711680, 0, 0}}, flags = 0,
travlocks = {
next = 0xff, off = 0, hash = 65280}, next = 0xff,
  device = 280375465082880, inode = 16711680, log_fn = 0x4093b0
,
  hash_fn = 0x4092f0 , open_flags = 2}

> Please can you also print "info regs" at the point of the segv (in frame
> 0) as well as "disas" at that point.

(gdb) info registers
rax0x0  0
rbx0x16bff7023854960
rcx0x   -1
rdx0x40ecd0 4254928
rsi0x0  0
rdi0xff00   280375465082880
rbp0x7fcaed6c96a8   0x7fcaed6c96a8
rsp0x7fff9dc86330   0x7fff9dc86330
r8 0x7fcaece54c08   140509534571528
r9 0xff00   -72057594037927936
r100x7fcaed08c14c   140509536895308
r110x246582
r120xd  13
r130xff00   280375465082880
r140x4093b0 4232112
r150x167d62023582240
rip0x4075c4 0x4075c4 
eflags 0x10206  [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0  0
es 0x0  0
fs 0x0  0
gs 0x0  0
fctrl  0x0  0
fstat  0x0  0
ftag   0x0  0
fiseg  0x0  0
fioff  0x0  0
foseg  0x0  0
fooff  0x0  0
fop0x0  0
mxcsr  0x0  [ ]

(gdb) disassemble
Dump of assembler code for function talloc_chunk_from_ptr:
0x004075c0 :   sub$0x8,%rsp
0x004075c4 :   mov-0x8(%rdi),%edx
0x004075c7 :   lea-0x50(%rdi),%rax
0x004075cb :  mov%edx,%ecx
0x004075cd :  and
$0xfff0,%ecx
0x004075d0 :  cmp$0xe814ec70,%ecx
0x004075d6 :  jne0x4075e2

0x004075d8 :  and$0x1,%edx
0x004075db :  jne0x4075e2

0x004075dd :  add$0x8,%rsp
0x004075e1 :  retq
0x004075e2 :  nopw   0x0(%rax,%rax,1)
0x004075e8 :  callq  0x401b98 

> Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
> This ought to be the actual faulting address, which ought to give a hint
> on how much we can trust the parameters in the stack trace.

Hmm, my gdb refused to access $_siginfo:
(gdb) show convenience
$_siginfo = Unable to read siginfo

> Since I'm asking for the world I may as well ask you to dump the raw
> stack too "x/64x $sp" ought to be a good starting point.

(gdb) x/64x $sp
0x7fff9dc86330: 0xed6c96a8  0x7fca  0x00407edf  0x
0x7fff9dc86340: 0x  0x  0x016bff70  0x
0x7fff9dc86350: 0xed6c96a8  0x7fca  0x000d  0x
0x7fff9dc86360: 0x  0x  0x004093b0  0x
0x7fff9dc86370: 0x0167d620  0x  0x0040a348  0x
0x7fff9dc86380: 0x  0x  0x  0x
0x7fff9dc86390: 0x  0x  0x  0x
0x7fff9dc863a0: 0x0011  0x  0x411d4816  0x
0x7fff9dc863b0: 0x0001  0x  0x81a0  0x
0x7fff9dc863c0: 0x  0x  0x  0x
0x7fff9dc863d0: 0x00096000  0x  0x1000  0x
0x7fff9dc863e0: 0x04b0  0x  0x5438ba01  0x
0x7fff9dc863f0: 0x07fd332e  0x  0x5438ba01  0x
0x7fff9dc86400: 0x07fd332e  0x  0x5438ba01  0x
0x7fff9dc86410: 0x07fd332

Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-15 Thread Ian Campbell
On Mon, 2014-12-15 at 14:50 +, Ian Campbell wrote:
> On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> > I just noticed something strange:
> > 
> > > #3  0x0040a684 in tdb_open (name=0xff  > > 0xff out of bounds>, hash_size=0,
> > > tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> > > #4  0x0040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> > > "/var/lib/xenstored/tdb.0x1935bb0")
> > 
> > Why does gdb-7.0.1 print "name=0xff00" here for frame 3, but for
> > frame 2 and 4 the pointers are correct again?
> > Verifying the values with an explicit "print" shows them as correct.
> 
> I has just noticed that and was wondering about that same thing. I'm
> starting to worry that 0xff might just be a gdb thing, similar
> to , but infinitely more misleading.

I'm reasonably convinced now that this is just a weird artefact of
running gdb on an optimised binary, probably a shortcoming in the debug
info leading to gdb getting confused.

Unfortunately this also calls into doubt the parameter to talloc_free,
perhaps in that context 0xff000 is a similar artefact.

Please can you print the entire contents of tdb in the second frame
("print *tdb" ought to do it). I'm curious whether it is all sane or
not.

Please can you also print "info regs" at the point of the segv (in frame
0) as well as "disas" at that point.

Can you also "p $_siginfo._sifields._sigfault.si_addr" (in frame 0).
This ought to be the actual faulting address, which ought to give a hint
on how much we can trust the parameters in the stack trace.

Since I'm asking for the world I may as well ask you to dump the raw
stack too "x/64x $sp" ought to be a good starting point.

I notice in your bugzilla (for a different occurrence, I think):
> [2090451.721705] univention-conf[2512]: segfault at ff ip 
> 0045e238 sp 768dfa30 error 6 in python2.6[40+21e000]

Which appears to have faulted access 0xff0 too. It looks like
this process is a python thing, it's nothing to do with xenstored I
assume? It seems rather coincidental that it should be accessing the 
same sort of address and be faulting.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-15 Thread Ian Campbell
On Mon, 2014-12-15 at 15:19 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 15.12.2014 14:17, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:58 +, Ian Campbell wrote:
> >>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> >>> On 12.12.2014 17:56, Ian Campbell wrote:
>  On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> > On 12.12.2014 17:32, Ian Campbell wrote:
> >> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> ...
> >>> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> >>>
> >>> (gdb) bt full
> >>> #0  talloc_chunk_from_ptr (ptr=0xff) at talloc.c:116
> >>> tc = 
> >>> #1  0x00407edf in talloc_free (ptr=0xff) at talloc.c:551
> >>> tc = 
> >>> #2  0x0040a348 in tdb_open_ex (name=0x1941fb0
> >>> "/var/lib/xenstored/tdb.0x1935bb0",
> 
> I just noticed something strange:
> 
> > #3  0x0040a684 in tdb_open (name=0xff  > 0xff out of bounds>, hash_size=0,
> > tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> > #4  0x0040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> > "/var/lib/xenstored/tdb.0x1935bb0")
> 
> Why does gdb-7.0.1 print "name=0xff00" here for frame 3, but for
> frame 2 and 4 the pointers are correct again?
> Verifying the values with an explicit "print" shows them as correct.

I has just noticed that and was wondering about that same thing. I'm
starting to worry that 0xff might just be a gdb thing, similar
to , but infinitely more misleading.

I've also noticed in
https://forge.univention.org/bugzilla/show_bug.cgi?id=35104 that the
constant can be either 0xff00, 0xff or 0xff00 (6, 8
or 10 zeroes).

> >>> hash_size=, tdb_flags=0, open_flags= >>> optimized out>, mode=,
> >>> log_fn=0x4093b0 , hash_fn=) at
> >>> tdb.c:1958
> > 
> > Please can you confirm what is at line 1958 of your copy of tdb.c. I
> > think it will be tdb->locked, but I'd like to be sure.
> 
> Yes, that's the line:
> # sed -ne 1958p tdb.c
> SAFE_FREE(tdb->locked);

Good, thanks.

> > You are running a 64-bit dom0, correct?
> 
> yes: x86_64

Thanks for confirming. I'm resurrecting the 64-bit root partition on my
test box (which it turns out was still Debian Squeeze!)

> 
> > I've only just noticed that
> > 0xff is >32bits. My testing so far was 32-bit, I don't think it
> > should matter wrt use of uninitialised data etc.
> > 
> > I can't help feeling that 0xff must be some sort of magic
> > sentinel value to someone. I can't figure out what though.
> 
> 0xff is too much for bit flip errors. and also two crashes on different
> machines in the same location very much rules out any HW error for me.
> 
> My 2nd idea was that someone decremented 0 one too many, but then that
> would have to be an 8 bit value - reading the code I didn't see anything
> like that.

I was wondering if it was an overflow or sign-extension thing, but it
doesn't seem likely, not enough high bits set for one thing.

> One more thing we noticed: /var/lib/xenstored/ contained the tdb file
> and to bit-identical copies after the crash, so I would read that as two
> transactions being in progress at the time of the crash. Might be that
> this is important.

It's certainly worth noting, thanks.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-15 Thread Philipp Hahn
Hello Ian,

On 15.12.2014 14:17, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:58 +, Ian Campbell wrote:
>>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
>>> On 12.12.2014 17:56, Ian Campbell wrote:
 On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> On 12.12.2014 17:32, Ian Campbell wrote:
>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
...
>>> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
>>>
>>> (gdb) bt full
>>> #0  talloc_chunk_from_ptr (ptr=0xff) at talloc.c:116
>>> tc = 
>>> #1  0x00407edf in talloc_free (ptr=0xff) at talloc.c:551
>>> tc = 
>>> #2  0x0040a348 in tdb_open_ex (name=0x1941fb0
>>> "/var/lib/xenstored/tdb.0x1935bb0",

I just noticed something strange:

> #3  0x0040a684 in tdb_open (name=0xff  0xff out of bounds>, hash_size=0,
> tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> #4  0x0040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0")

Why does gdb-7.0.1 print "name=0xff00" here for frame 3, but for
frame 2 and 4 the pointers are correct again?
Verifying the values with an explicit "print" shows them as correct.

>> I've timed out for tonight will try and have another look next week.
> 
> I've had another dig, and have instrumented all of the error paths from
> this function and I can't see any way for an invalid pointer to be
> produced, let alone freed. I've been running under valgrind which should
> have caught any uninitialised memory type errors.

Thank you for testing that.

>>> hash_size=, tdb_flags=0, open_flags=>> optimized out>, mode=,
>>> log_fn=0x4093b0 , hash_fn=) at
>>> tdb.c:1958
> 
> Please can you confirm what is at line 1958 of your copy of tdb.c. I
> think it will be tdb->locked, but I'd like to be sure.

Yes, that's the line:
# sed -ne 1958p tdb.c
SAFE_FREE(tdb->locked);

> You are running a 64-bit dom0, correct?

yes: x86_64

> I've only just noticed that
> 0xff is >32bits. My testing so far was 32-bit, I don't think it
> should matter wrt use of uninitialised data etc.
> 
> I can't help feeling that 0xff must be some sort of magic
> sentinel value to someone. I can't figure out what though.

0xff is too much for bit flip errors. and also two crashes on different
machines in the same location very much rules out any HW error for me.

My 2nd idea was that someone decremented 0 one too many, but then that
would have to be an 8 bit value - reading the code I didn't see anything
like that.

> Have you observed the xenstored processes growing especially large
> before this happens? I'm wondering if there might be a leak somewhere
> which after a time is resulting a 

I have no monitoring of the memory usage for the crashed systems, but
the core files look reasonable sane.
Looking at the test-system running
/usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py the memory
usage stays constant since last Friday.

> I'm about to send out a patch which plumbs tdb's logging into
> xenstored's logging, in the hopes that next time you see this it might
> say something as it dies.

Thank you for the patch: I'll try to incorporate it and will continue
trying to reproduce the crash.


One more thing we noticed: /var/lib/xenstored/ contained the tdb file
and to bit-identical copies after the crash, so I would read that as two
transactions being in progress at the time of the crash. Might be that
this is important.
But /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py seems to
create more transaction in parallel and my test system so far has
survived this since Friday.

Sincerely
Philipp Hahn

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-15 Thread Ian Campbell
On Fri, 2014-12-12 at 17:58 +, Ian Campbell wrote:
> (adding Ian J who knows a bit more about C xenstored than me...)
> 
>  On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> > Hello Ian,
> > 
> > On 12.12.2014 17:56, Ian Campbell wrote:
> > > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> > >> On 12.12.2014 17:32, Ian Campbell wrote:
> > >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> >  We did enable tracing and now have the xenstored-trace.log of one 
> >  crash:
> >  It contains 1.6 billion lines and is 83 GiB.
> >  It just shows xenstored to crash on TRANSACTION_START.
> > 
> >  Is there some tool to feed that trace back into a newly launched 
> >  xenstored?
> > >>>
> > >>> Not that I know of I'm afraid.
> > >>
> > >> Okay, then I have to continue with my own tool.
> > > 
> > > If you do end up developing a tool to replay a xenstore trace then I
> > > think that'd be something great to have in tree!
> > 
> > I just need to figure out how to talk to xenstored on the wire: for some
> > strange reason xenstored is closing the connection to the UNIX socket on
> > the first write inside a transaction.
> > Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...
> > 
> > >>> Do you get a core dump when this happens? You might need to fiddle with
> > >>> ulimits (some distros disable by default). IIRC there is also some /proc
> > >>> nob which controls where core dumps go on the filesystem.
> > >>
> > >> Not for that specific trace: We first enabled generating core files, but
> > >> only then discovered that this is not enough.
> > > 
> > > How wasn't it enough? You mean you couldn't use gdb to extract a
> > > backtrace from the core file? Or was something else wrong?
> > 
> > The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> > 
> > (gdb) bt full
> > #0  talloc_chunk_from_ptr (ptr=0xff) at talloc.c:116
> > tc = 
> > #1  0x00407edf in talloc_free (ptr=0xff) at talloc.c:551
> > tc = 
> > #2  0x0040a348 in tdb_open_ex (name=0x1941fb0
> > "/var/lib/xenstored/tdb.0x1935bb0",
> 
> I've timed out for tonight will try and have another look next week.

I've had another dig, and have instrumented all of the error paths from
this function and I can't see any way for an invalid pointer to be
produced, let alone freed. I've been running under valgrind which should
have caught any uninitialised memory type errors.

> > hash_size=, tdb_flags=0, open_flags= > optimized out>, mode=,
> > log_fn=0x4093b0 , hash_fn=) at
> > tdb.c:1958

Please can you confirm what is at line 1958 of your copy of tdb.c. I
think it will be tdb->locked, but I'd like to be sure.

You are running a 64-bit dom0, correct? I've only just noticed that
0xff is >32bits. My testing so far was 32-bit, I don't think it
should matter wrt use of uninitialised data etc.

I can't help feeling that 0xff must be some sort of magic
sentinel value to someone. I can't figure out what though.

Have you observed the xenstored processes growing especially large
before this happens? I'm wondering if there might be a leak somewhere
which after a time is resulting a 

I'm about to send out a patch which plumbs tdb's logging into
xenstored's logging, in the hopes that next time you see this it might
say something as it dies.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-12 Thread Ian Campbell
(adding Ian J who knows a bit more about C xenstored than me...)

 On Fri, 2014-12-12 at 18:20 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 12.12.2014 17:56, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> >> On 12.12.2014 17:32, Ian Campbell wrote:
> >>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
>  We did enable tracing and now have the xenstored-trace.log of one crash:
>  It contains 1.6 billion lines and is 83 GiB.
>  It just shows xenstored to crash on TRANSACTION_START.
> 
>  Is there some tool to feed that trace back into a newly launched 
>  xenstored?
> >>>
> >>> Not that I know of I'm afraid.
> >>
> >> Okay, then I have to continue with my own tool.
> > 
> > If you do end up developing a tool to replay a xenstore trace then I
> > think that'd be something great to have in tree!
> 
> I just need to figure out how to talk to xenstored on the wire: for some
> strange reason xenstored is closing the connection to the UNIX socket on
> the first write inside a transaction.
> Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...
> 
> >>> Do you get a core dump when this happens? You might need to fiddle with
> >>> ulimits (some distros disable by default). IIRC there is also some /proc
> >>> nob which controls where core dumps go on the filesystem.
> >>
> >> Not for that specific trace: We first enabled generating core files, but
> >> only then discovered that this is not enough.
> > 
> > How wasn't it enough? You mean you couldn't use gdb to extract a
> > backtrace from the core file? Or was something else wrong?
> 
> The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.
> 
> (gdb) bt full
> #0  talloc_chunk_from_ptr (ptr=0xff) at talloc.c:116
> tc = 
> #1  0x00407edf in talloc_free (ptr=0xff) at talloc.c:551
> tc = 
> #2  0x0040a348 in tdb_open_ex (name=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0",

This is interesting actually. There are only a small number of calls to
talloc_free in tdb_open_ex (all wrapped in "SAFE_FREE") and they are all
in after the fail: error exit path. So I think the reason the crash is
rare is that you have to hit some other failure first.

About half of the "goto fail" statements are preceded by a TDB_LOG
statement. But given the presence of logfn= in the trace
that doesn't seem likely to be helpful right now.

It might be worth splurging some debug of your own before each of those
failure points and/or wiring up the tdb log function to xenstores
logging.

The calls to SAFE_FREE are
SAFE_FREE(tdb->map_ptr);
SAFE_FREE(tdb->name);
SAFE_FREE(tdb->locked);
SAFE_FREE(tdb);

I think those should all have been allocated by the time we get to fail
though, so not sure where 0xff00 in the trace comes from.

I've timed out for tonight will try and have another look next week.

Ian.

> hash_size=, tdb_flags=0, open_flags= optimized out>, mode=,
> log_fn=0x4093b0 , hash_fn=) at
> tdb.c:1958
> tdb = 0x1921270
> st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode =
> 33184, st_uid = 0, st_gid = 0, __pad0 = 0,
>   st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks =
> 592, st_atim = {tv_sec = 1415748063,
> tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec
> = 87562634}, st_ctim = {
> tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}}
> rev = 
> locked = 4232112
> vp = 
> #3  0x0040a684 in tdb_open (name=0xff  0xff out of bounds>, hash_size=0,
> tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
> No locals.
> #4  0x0040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
> "/var/lib/xenstored/tdb.0x1935bb0")
> at tdb.c:2124
> fd = 
> saved_errno = 
> copy = 0x0
> #5  0x00406c2d in do_transaction_start (conn=0x1939550,
> in=)
> at xenstored_transaction.c:164
> trans = 0x1935bb0
> exists = 
> id_str =
> "\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001"
> #6  0x004045ca in process_message (conn=0x1939550) at
> xenstored_core.c:1214
> trans = 
> #7  consider_message (conn=0x1939550) at xenstored_core.c:1261
> No locals.
> #8  handle_input (conn=0x1939550) at xenstored_core.c:1308
> bytes = 
> in = 
> #9  0x00405170 in main (argc=, argv= optimized out>) at xenstored_core.c:1964
> 
> A 3rd trace is somewhere completely different:
> (gdb) bt
> #0  0x7fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010,
> format=, ap=0x7fff46ac3170)
> at vfprintf.c:1617
> #1  0x7fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "",
> maxlen=,
> format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120
> #2  0x0040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4
> "%.*s", ap=0x7fff46ac31d0) at talloc.c:

Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-12 Thread Philipp Hahn
Hello Ian,

On 12.12.2014 17:56, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
>> On 12.12.2014 17:32, Ian Campbell wrote:
>>> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
 We did enable tracing and now have the xenstored-trace.log of one crash:
 It contains 1.6 billion lines and is 83 GiB.
 It just shows xenstored to crash on TRANSACTION_START.

 Is there some tool to feed that trace back into a newly launched xenstored?
>>>
>>> Not that I know of I'm afraid.
>>
>> Okay, then I have to continue with my own tool.
> 
> If you do end up developing a tool to replay a xenstore trace then I
> think that'd be something great to have in tree!

I just need to figure out how to talk to xenstored on the wire: for some
strange reason xenstored is closing the connection to the UNIX socket on
the first write inside a transaction.
Or switch to /usr/share/pyshared/xen/xend/xenstore/xstransact.py...

>>> Do you get a core dump when this happens? You might need to fiddle with
>>> ulimits (some distros disable by default). IIRC there is also some /proc
>>> nob which controls where core dumps go on the filesystem.
>>
>> Not for that specific trace: We first enabled generating core files, but
>> only then discovered that this is not enough.
> 
> How wasn't it enough? You mean you couldn't use gdb to extract a
> backtrace from the core file? Or was something else wrong?

The 1st and 2nd trace look like this: ptr in frame #2 looks very bogus.

(gdb) bt full
#0  talloc_chunk_from_ptr (ptr=0xff) at talloc.c:116
tc = 
#1  0x00407edf in talloc_free (ptr=0xff) at talloc.c:551
tc = 
#2  0x0040a348 in tdb_open_ex (name=0x1941fb0
"/var/lib/xenstored/tdb.0x1935bb0",
hash_size=, tdb_flags=0, open_flags=, mode=,
log_fn=0x4093b0 , hash_fn=) at
tdb.c:1958
tdb = 0x1921270
st = {st_dev = 17, st_ino = 816913342, st_nlink = 1, st_mode =
33184, st_uid = 0, st_gid = 0, __pad0 = 0,
  st_rdev = 0, st_size = 303104, st_blksize = 4096, st_blocks =
592, st_atim = {tv_sec = 1415748063,
tv_nsec = 87562634}, st_mtim = {tv_sec = 1415748063, tv_nsec
= 87562634}, st_ctim = {
tv_sec = 1415748063, tv_nsec = 87562634}, __unused = {0, 0, 0}}
rev = 
locked = 4232112
vp = 
#3  0x0040a684 in tdb_open (name=0xff , hash_size=0,
tdb_flags=4254928, open_flags=-1, mode=3119127560) at tdb.c:1773
No locals.
#4  0x0040a70b in tdb_copy (tdb=0x192e540, outfile=0x1941fb0
"/var/lib/xenstored/tdb.0x1935bb0")
at tdb.c:2124
fd = 
saved_errno = 
copy = 0x0
#5  0x00406c2d in do_transaction_start (conn=0x1939550,
in=)
at xenstored_transaction.c:164
trans = 0x1935bb0
exists = 
id_str =
"\300L\222\001\000\000\000\000\330!@\000\000\000\000\000P\225\223\001"
#6  0x004045ca in process_message (conn=0x1939550) at
xenstored_core.c:1214
trans = 
#7  consider_message (conn=0x1939550) at xenstored_core.c:1261
No locals.
#8  handle_input (conn=0x1939550) at xenstored_core.c:1308
bytes = 
in = 
#9  0x00405170 in main (argc=, argv=) at xenstored_core.c:1964

A 3rd trace is somewhere completely different:
(gdb) bt
#0  0x7fcbf066088d in _IO_vfprintf_internal (s=0x7fff46ac3010,
format=, ap=0x7fff46ac3170)
at vfprintf.c:1617
#1  0x7fcbf0682732 in _IO_vsnprintf (string=0x7fff46ac318f "",
maxlen=,
format=0x40d4a4 "%.*s", args=0x7fff46ac3170) at vsnprintf.c:120
#2  0x0040855b in talloc_vasprintf (t=0x17aaf20, fmt=0x40d4a4
"%.*s", ap=0x7fff46ac31d0) at talloc.c:1104
#3  0x00408666 in talloc_asprintf (t=0x1f, fmt=0xe938
)
at talloc.c:1129
#4  0x00403a38 in ask_parents (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel",
perm=XS_PERM_READ) at xenstored_core.c:492
#5  errno_from_parents (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
at xenstored_core.c:516
#6  get_node (conn=0x177a1f0, name=0x17aaf20
"/local/domain/0/backend/vif/1/0/accel", perm=XS_PERM_READ)
at xenstored_core.c:543
#7  0x0040481d in do_read (conn=0x177a1f0) at xenstored_core.c:744
#8  process_message (conn=0x177a1f0) at xenstored_core.c:1178
#9  consider_message (conn=0x177a1f0) at xenstored_core.c:1261
#10 handle_input (conn=0x177a1f0) at xenstored_core.c:1308
#11 0x00405170 in main (argc=, argv=) at xenstored_core.c:1964


>> It might be interesting to see what happens if you preserve the db and
>> reboot arranging for the new xenstored to start with the old file. If
>> the corruption is part of the file then maybe it can be induced to crash
>> again more quickly.
> 
> Thanks for the pointer, will try.

Didn't crash immediately.
Now running /usr/share/pyshared/xen/xend/xenstore/tests/stress_xs.py for
the weekend.

Thanks again.
Philipp

__

Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-12 Thread Ian Campbell
On Fri, 2014-12-12 at 17:45 +0100, Philipp Hahn wrote:
> Hello Ian,
> 
> On 12.12.2014 17:32, Ian Campbell wrote:
> > On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> >> We did enable tracing and now have the xenstored-trace.log of one crash:
> >> It contains 1.6 billion lines and is 83 GiB.
> >> It just shows xenstored to crash on TRANSACTION_START.
> >>
> >> Is there some tool to feed that trace back into a newly launched xenstored?
> > 
> > Not that I know of I'm afraid.
> 
> Okay, then I have to continue with my own tool.

If you do end up developing a tool to replay a xenstore trace then I
think that'd be something great to have in tree!

> > Do you get a core dump when this happens? You might need to fiddle with
> > ulimits (some distros disable by default). IIRC there is also some /proc
> > nob which controls where core dumps go on the filesystem.
> 
> Not for that specific trace: We first enabled generating core files, but
> only then discovered that this is not enough.

How wasn't it enough? You mean you couldn't use gdb to extract a
backtrace from the core file? Or was something else wrong?

>  Then we enabled
> --trace-file, but on that host something reseted generating the core file.
> We hopefully fixed all hosts so on the next crash we hopefully will get
> both a core file and the trace.

Great.

> >> My hope would be that xenstored crashes again, because then we could use
> >> all those other tools like valgrind more easily.
> > 
> > That would be handy. My fear would be that this bug is likely to be a
> > race condition of some sort, and the granularity/accuracy of the
> > playback would possibly need to be quite high to trigger the issue.
> 
> cxenstored looks single threaded to me, or am I wrong?

Nope, you are right, my mistake.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-12 Thread Philipp Hahn
Hello Ian,

On 12.12.2014 17:32, Ian Campbell wrote:
> On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
>> We did enable tracing and now have the xenstored-trace.log of one crash:
>> It contains 1.6 billion lines and is 83 GiB.
>> It just shows xenstored to crash on TRANSACTION_START.
>>
>> Is there some tool to feed that trace back into a newly launched xenstored?
> 
> Not that I know of I'm afraid.

Okay, then I have to continue with my own tool.

> Do you get a core dump when this happens? You might need to fiddle with
> ulimits (some distros disable by default). IIRC there is also some /proc
> nob which controls where core dumps go on the filesystem.

Not for that specific trace: We first enabled generating core files, but
only then discovered that this is not enough. Then we enabled
--trace-file, but on that host something reseted generating the core file.
We hopefully fixed all hosts so on the next crash we hopefully will get
both a core file and the trace.

>> My hope would be that xenstored crashes again, because then we could use
>> all those other tools like valgrind more easily.
> 
> That would be handy. My fear would be that this bug is likely to be a
> race condition of some sort, and the granularity/accuracy of the
> playback would possibly need to be quite high to trigger the issue.

cxenstored looks single threaded to me, or am I wrong?

>>> Do you rm the xenstore db on boot? It might have a persistent
>>> corruption, aiui most folks using C xenstored are doing so or even
>>> placing it on a tmpfs for performance reasons.
>>
>> We're using a tmpfs for /var/lib/xenstored/, as we had some sever
>> performance problem with something updating
>> /local/domain/0/backend/console/*/0/uuid too often, which put xenstored
>> in permanent D state.
> 
> But this is just a process crashing and not the whole host so you still
> have the db file at the point of the crash?

Yes: Running xs_tdb_dump or tdb_dump on it didn't show anything
obviously wrong.

> It might be interesting to see what happens if you preserve the db and
> reboot arranging for the new xenstored to start with the old file. If
> the corruption is part of the file then maybe it can be induced to crash
> again more quickly.

Thanks for the pointer, will try.

Thank you again for your fast reply.
Philipp Hahn


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-12 Thread Ian Campbell
On Fri, 2014-12-12 at 17:14 +0100, Philipp Hahn wrote:
> Hello,
> 
> On 13.11.2014 10:12, Ian Campbell wrote:
> > On Thu, 2014-11-13 at 08:45 +0100, Philipp Hahn wrote:
> >> To me this looks like some memory corruption by some unknown code
> >> writing into some random memory space, which happens to be the tdb here.
> > 
> > I wonder if running xenstored under valgrind would be useful. I think
> > you'd want to stop xenstored from starting during normal boot and then
> > launch it with:
> > valgrind /usr/local/sbin/xenstored -N
> > -N is to stay in the foreground, you might want to do this in a screen
> > session or something, alternatively you could investigate the --log-*
> > options in the valgrind manpage, together with the various
> > --trace-children* in order to follow the processes over its
> > daemonization.
> 
> We did enable tracing and now have the xenstored-trace.log of one crash:
> It contains 1.6 billion lines and is 83 GiB.
> It just shows xenstored to crash on TRANSACTION_START.
> 
> Is there some tool to feed that trace back into a newly launched xenstored?

Not that I know of I'm afraid.

Do you get a core dump when this happens? You might need to fiddle with
ulimits (some distros disable by default). IIRC there is also some /proc
nob which controls where core dumps go on the filesystem.

> My hope would be that xenstored crashes again, because then we could use
> all those other tools like valgrind more easily.

That would be handy. My fear would be that this bug is likely to be a
race condition of some sort, and the granularity/accuracy of the
playback would possibly need to be quite high to trigger the issue.
 
> > Do you rm the xenstore db on boot? It might have a persistent
> > corruption, aiui most folks using C xenstored are doing so or even
> > placing it on a tmpfs for performance reasons.
> 
> We're using a tmpfs for /var/lib/xenstored/, as we had some sever
> performance problem with something updating
> /local/domain/0/backend/console/*/0/uuid too often, which put xenstored
> in permanent D state.

But this is just a process crashing and not the whole host so you still
have the db file at the point of the crash?

It might be interesting to see what happens if you preserve the db and
reboot arranging for the new xenstored to start with the old file. If
the corruption is part of the file then maybe it can be induced to crash
again more quickly.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-12-12 Thread Philipp Hahn
Hello,

On 13.11.2014 10:12, Ian Campbell wrote:
> On Thu, 2014-11-13 at 08:45 +0100, Philipp Hahn wrote:
>> To me this looks like some memory corruption by some unknown code
>> writing into some random memory space, which happens to be the tdb here.
> 
> I wonder if running xenstored under valgrind would be useful. I think
> you'd want to stop xenstored from starting during normal boot and then
> launch it with:
> valgrind /usr/local/sbin/xenstored -N
> -N is to stay in the foreground, you might want to do this in a screen
> session or something, alternatively you could investigate the --log-*
> options in the valgrind manpage, together with the various
> --trace-children* in order to follow the processes over its
> daemonization.

We did enable tracing and now have the xenstored-trace.log of one crash:
It contains 1.6 billion lines and is 83 GiB.
It just shows xenstored to crash on TRANSACTION_START.

Is there some tool to feed that trace back into a newly launched xenstored?

My hope would be that xenstored crashes again, because then we could use
all those other tools like valgrind more easily.

>> 3. the crash happens rarely and the host run fine most of the time. The
>> crash mostly happens around midnight and seem to be guest-triggered, as
>> the logs on the host don't show any activity like starting new or
>> destroying running VMs. So far the problem only showed on host running
>> Linux VMs. Other host running Windows VMs so far never showed that crash.

Now we also observed a crash on a host running Windows VMs.

> If it is really mostly happening around midnight then it might be worth
> digging into the host and guest configs for cronjobs and the like, e.g.
> log rotation stuff like that which might be tweaking things somehow.
> 
> Does this happen on multiple hosts, or just the one?

Multiple host in two different data centers.

> Do you rm the xenstore db on boot? It might have a persistent
> corruption, aiui most folks using C xenstored are doing so or even
> placing it on a tmpfs for performance reasons.

We're using a tmpfs for /var/lib/xenstored/, as we had some sever
performance problem with something updating
/local/domain/0/backend/console/*/0/uuid too often, which put xenstored
in permanent D state.

> If you are running 4.1.x then I think oxenstored isn't an option, but it
> might be something to consider when you upgrade.

Thank you for the hint, I'll have another look at the Ocaml version.

Thank you again.
Philipp Hahn

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xenstored crashes with SIGSEGV

2014-11-13 Thread Ian Campbell
On Thu, 2014-11-13 at 08:45 +0100, Philipp Hahn wrote:
> To me this looks like some memory corruption by some unknown code
> writing into some random memory space, which happens to be the tdb here.

I wonder if running xenstored under valgrind would be useful. I think
you'd want to stop xenstored from starting during normal boot and then
launch it with:
valgrind /usr/local/sbin/xenstored -N
-N is to stay in the foreground, you might want to do this in a screen
session or something, alternatively you could investigate the --log-*
options in the valgrind manpage, together with the various
--trace-children* in order to follow the processes over its
daemonization.

I'm not sure what the impact on the system would be with this, but I
think it is probably ok unless you have massive xs load.

You'll need a version of valgrind with xen support in it, anything from
the last year or so should do I think.

Other than that we don't really have anyone who is an expert in that
aspect of the C xenstore/tdb who we can lean on for pointers (no pun
intended) etc, so in the absence of some sort of ability to trigger on
demand I'm not sure what else to suggest.

> 1. Has someone observed a similar crash?

I think you are the only one I've seen reporting this.

> 2. We've now also enabled "xenstored -T /log --verbose" to log the
> messages in the hope to find the triggering transaction, but until then
> is there something more we can do to track down the problem?
> 
> 3. the crash happens rarely and the host run fine most of the time. The
> crash mostly happens around midnight and seem to be guest-triggered, as
> the logs on the host don't show any activity like starting new or
> destroying running VMs. So far the problem only showed on host running
> Linux VMs. Other host running Windows VMs so far never showed that crash.

If it is really mostly happening around midnight then it might be worth
digging into the host and guest configs for cronjobs and the like, e.g.
log rotation stuff like that which might be tweaking things somehow.

Does this happen on multiple hosts, or just the one?

Do you rm the xenstore db on boot? It might have a persistent
corruption, aiui most folks using C xenstored are doing so or even
placing it on a tmpfs for performance reasons.

If you are running 4.1.x then I think oxenstored isn't an option, but it
might be something to consider when you upgrade.

Ian.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel