Infrequent heap corruption, XO-4, Fedora 20

2015-02-04 Thread James Cameron
Following up a thread from last September.

This problem has just become more interesting, because it hit during
an activity startup.

I'm quite used to seeing it with yum.  But seeing it without yum now
points us at kernel, glibc or python.

http://dev.laptop.org/ticket/12837#comment:4 has the details of the
most recent event.

On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote:
 G'day Peter,
 
 Thanks for any ideas you may have.
 
 The problem also reproduces on OLPC Fedora 20 image for XO-4:
 
 http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB)
 
 *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae ***
 === Backtrace: =
 /lib/libc.so.6(+0x6c8b4)[0xb6c828b4]
 /lib/libc.so.6(+0x754e8)[0xb6c8b4e8]
 === Memory map: 
 [...]
 
 The error varies in detail, but always suggests corruption of heap or
 pointers to heap.
 
 The triggering conditions are interactive use of yum, yum update, or
 yum used by olpc-os-builder.  The latter is a simple reproducer for me.
 
 I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8
 GB USB flash drive.
 
 While memory demand by yum is large by comparison to other programs,
 the available memory at the time of failure is ample.  There are no
 kernel out of memory (OOM) events.  It seems more likely to occur when
 the filesystem cache is under heavy demand.
 
 The method to recreate the problem was:
 
 1.  install the system image 41001o4.zd using fs-update and then boot,
 
 2.  configure wireless network,
 
 3.  yum install -y git olpc-os-builder
 
 4.  clone the master branch of
 git://dev.laptop.org/projects/olpc-os-builder
 (last verified with b87e6ee)
 
 5.  run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly
 until the error occurs (usually within about five attempts),
 
 
 I've also tried running under valgrind, but that causes illegal
 instruction.  It is quite likely I'm not using valgrind correctly.
 http://dev.laptop.org/~quozl/z/1XRYtO.txt
 
 The workaround at the moment is to build our Fedora 20 images on
 Fedora 18.  Fedora 18 shows no sign of the problem.  I'm worried that
 a low probability heap corruptor may cause instability of applications
 in the field.
 
 The exact same kernel is being used for Fedora 18 and Fedora 20.
 
 On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote:
  What version of OOB are you using, and what config files? I can try
  and recreate the problem here on other devices.
 
 -- 
 James Cameron
 http://quozl.linux.org.au/

-- 
James Cameron
http://quozl.linux.org.au/
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Infrequent heap corruption, XO-4, Fedora 20

2015-02-04 Thread James Cameron
Thanks.

Can I make it happen more often?

Is there a later version of the driver?

We have a different version that I may look into, on arm-3.5-android
branch.

On Wed, Feb 04, 2015 at 12:14:02PM +0100, Jon Nettleton wrote:
 It is a problem with the v4 version of the galcore driver.  We have replicated
 it on a couple of platforms.
 
 On Wed, Feb 4, 2015 at 11:26 AM, Peter Robinson [1]pbrobin...@gmail.com
 wrote:
 
 On Wed, Feb 4, 2015 at 8:10 AM, James Cameron [2]qu...@laptop.org wrote:
  Following up a thread from last September.
 
  This problem has just become more interesting, because it hit during
  an activity startup.
 
  I'm quite used to seeing it with yum.  But seeing it without yum now
  points us at kernel, glibc or python.
 
 We've not seen this in the wider F-20 Fedora ARM distro so my bet
 would be on the kernel.
 
 Peter
 
  [3]http://dev.laptop.org/ticket/12837#comment:4 has the details of the
  most recent event.
 
  On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote:
  G'day Peter,
 
  Thanks for any ideas you may have.
 
  The problem also reproduces on OLPC Fedora 20 image for XO-4:
 
  [4]http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB)
 
  *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae ***
  === Backtrace: =
  /lib/libc.so.6(+0x6c8b4)[0xb6c828b4]
  /lib/libc.so.6(+0x754e8)[0xb6c8b4e8]
  === Memory map: 
  [...]
 
  The error varies in detail, but always suggests corruption of heap or
  pointers to heap.
 
  The triggering conditions are interactive use of yum, yum update, or
  yum used by olpc-os-builder.  The latter is a simple reproducer for me.
 
  I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8
  GB USB flash drive.
 
  While memory demand by yum is large by comparison to other programs,
  the available memory at the time of failure is ample.  There are no
  kernel out of memory (OOM) events.  It seems more likely to occur when
  the filesystem cache is under heavy demand.
 
  The method to recreate the problem was:
 
  1.  install the system image 41001o4.zd using fs-update and then boot,
 
  2.  configure wireless network,
 
  3.  yum install -y git olpc-os-builder
 
  4.  clone the master branch of
  git://[5]dev.laptop.org/projects/olpc-os-builder
  (last verified with b87e6ee)
 
  5.  run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly
  until the error occurs (usually within about five attempts),
 
 
  I've also tried running under valgrind, but that causes illegal
  instruction.  It is quite likely I'm not using valgrind correctly.
  [6]http://dev.laptop.org/~quozl/z/1XRYtO.txt
 
  The workaround at the moment is to build our Fedora 20 images on
  Fedora 18.  Fedora 18 shows no sign of the problem.  I'm worried that
  a low probability heap corruptor may cause instability of applications
  in the field.
 
  The exact same kernel is being used for Fedora 18 and Fedora 20.
 
  On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote:
   What version of OOB are you using, and what config files? I can try
   and recreate the problem here on other devices.
 
  --
  James Cameron
  [7]http://quozl.linux.org.au/
 
  --
  James Cameron
  [8]http://quozl.linux.org.au/
 ___
 Devel mailing list
 [9]Devel@lists.laptop.org
 [10]http://lists.laptop.org/listinfo/devel
 
 References:
 
 [1] mailto:pbrobin...@gmail.com
 [2] mailto:qu...@laptop.org
 [3] http://dev.laptop.org/ticket/12837#comment:4
 [4] http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd
 [5] http://dev.laptop.org/projects/olpc-os-builder
 [6] http://dev.laptop.org/~quozl/z/1XRYtO.txt
 [7] http://quozl.linux.org.au/
 [8] http://quozl.linux.org.au/
 [9] mailto:Devel@lists.laptop.org
 [10] http://lists.laptop.org/listinfo/devel

-- 
James Cameron
http://quozl.linux.org.au/
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Infrequent heap corruption, XO-4, Fedora 20

2015-02-04 Thread Jon Nettleton
On Thu, Feb 5, 2015 at 8:00 AM, James Cameron qu...@laptop.org wrote:

 Thanks.

 Can I make it happen more often?

 Is there a later version of the driver?

 We have a different version that I may look into, on arm-3.5-android
 branch.


run memtester against the majority of your machines memory and then run
gtkperf in an X session.  That is usually enough to trigger it.

Considering that bug exists in all the 4.xx vivante galcore drivers I have
seen I doubt it is fixed in the other version.  Android is much simpler on
memory because it runs everything through a single GL context against a
framebuffer.

I have some tentative patches to fix parts of it in my trees but I doubt a
lot of them would apply to 3.5 without backporting a lot of upstream work.



 On Wed, Feb 04, 2015 at 12:14:02PM +0100, Jon Nettleton wrote:
  It is a problem with the v4 version of the galcore driver.  We have
 replicated
  it on a couple of platforms.
 
  On Wed, Feb 4, 2015 at 11:26 AM, Peter Robinson [1]pbrobin...@gmail.com
 
  wrote:
 
  On Wed, Feb 4, 2015 at 8:10 AM, James Cameron [2]qu...@laptop.org
 wrote:
   Following up a thread from last September.
  
   This problem has just become more interesting, because it hit
 during
   an activity startup.
  
   I'm quite used to seeing it with yum.  But seeing it without yum
 now
   points us at kernel, glibc or python.
 
  We've not seen this in the wider F-20 Fedora ARM distro so my bet
  would be on the kernel.
 
  Peter
 
   [3]http://dev.laptop.org/ticket/12837#comment:4 has the details
 of the
   most recent event.
  
   On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote:
   G'day Peter,
  
   Thanks for any ideas you may have.
  
   The problem also reproduces on OLPC Fedora 20 image for XO-4:
  
   [4]http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB)
  
   *** Error in `/usr/bin/python': free(): invalid pointer:
 0x047c79ae ***
   === Backtrace: =
   /lib/libc.so.6(+0x6c8b4)[0xb6c828b4]
   /lib/libc.so.6(+0x754e8)[0xb6c8b4e8]
   === Memory map: 
   [...]
  
   The error varies in detail, but always suggests corruption of
 heap or
   pointers to heap.
  
   The triggering conditions are interactive use of yum, yum update,
 or
   yum used by olpc-os-builder.  The latter is a simple reproducer
 for me.
  
   I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB
 eMMC, 8
   GB USB flash drive.
  
   While memory demand by yum is large by comparison to other
 programs,
   the available memory at the time of failure is ample.  There are
 no
   kernel out of memory (OOM) events.  It seems more likely to occur
 when
   the filesystem cache is under heavy demand.
  
   The method to recreate the problem was:
  
   1.  install the system image 41001o4.zd using fs-update and then
 boot,
  
   2.  configure wireless network,
  
   3.  yum install -y git olpc-os-builder
  
   4.  clone the master branch of
   git://[5]dev.laptop.org/projects/olpc-os-builder
   (last verified with b87e6ee)
  
   5.  run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini
 repeatedly
   until the error occurs (usually within about five attempts),
  
  
   I've also tried running under valgrind, but that causes illegal
   instruction.  It is quite likely I'm not using valgrind correctly.
   [6]http://dev.laptop.org/~quozl/z/1XRYtO.txt
  
   The workaround at the moment is to build our Fedora 20 images on
   Fedora 18.  Fedora 18 shows no sign of the problem.  I'm worried
 that
   a low probability heap corruptor may cause instability of
 applications
   in the field.
  
   The exact same kernel is being used for Fedora 18 and Fedora 20.
  
   On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote:
What version of OOB are you using, and what config files? I can
 try
and recreate the problem here on other devices.
  
   --
   James Cameron
   [7]http://quozl.linux.org.au/
  
   --
   James Cameron
   [8]http://quozl.linux.org.au/
  ___
  Devel mailing list
  [9]Devel@lists.laptop.org
  [10]http://lists.laptop.org/listinfo/devel
 
  References:
 
  [1] mailto:pbrobin...@gmail.com
  [2] mailto:qu...@laptop.org
  [3] http://dev.laptop.org/ticket/12837#comment:4
  [4] http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd
  [5] http://dev.laptop.org/projects/olpc-os-builder
  [6] http://dev.laptop.org/~quozl/z/1XRYtO.txt
  [7] http://quozl.linux.org.au/
  [8] http://quozl.linux.org.au/
  [9] mailto:Devel@lists.laptop.org
  [10] http://lists.laptop.org/listinfo/devel

 --
 James Cameron
 http://quozl.linux.org.au/

___
Devel 

Re: Infrequent heap corruption, XO-4, Fedora 20

2015-02-04 Thread Jon Nettleton
It is a problem with the v4 version of the galcore driver.  We have
replicated it on a couple of platforms.

On Wed, Feb 4, 2015 at 11:26 AM, Peter Robinson pbrobin...@gmail.com
wrote:

 On Wed, Feb 4, 2015 at 8:10 AM, James Cameron qu...@laptop.org wrote:
  Following up a thread from last September.
 
  This problem has just become more interesting, because it hit during
  an activity startup.
 
  I'm quite used to seeing it with yum.  But seeing it without yum now
  points us at kernel, glibc or python.

 We've not seen this in the wider F-20 Fedora ARM distro so my bet
 would be on the kernel.

 Peter

  http://dev.laptop.org/ticket/12837#comment:4 has the details of the
  most recent event.
 
  On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote:
  G'day Peter,
 
  Thanks for any ideas you may have.
 
  The problem also reproduces on OLPC Fedora 20 image for XO-4:
 
  http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB)
 
  *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae ***
  === Backtrace: =
  /lib/libc.so.6(+0x6c8b4)[0xb6c828b4]
  /lib/libc.so.6(+0x754e8)[0xb6c8b4e8]
  === Memory map: 
  [...]
 
  The error varies in detail, but always suggests corruption of heap or
  pointers to heap.
 
  The triggering conditions are interactive use of yum, yum update, or
  yum used by olpc-os-builder.  The latter is a simple reproducer for me.
 
  I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8
  GB USB flash drive.
 
  While memory demand by yum is large by comparison to other programs,
  the available memory at the time of failure is ample.  There are no
  kernel out of memory (OOM) events.  It seems more likely to occur when
  the filesystem cache is under heavy demand.
 
  The method to recreate the problem was:
 
  1.  install the system image 41001o4.zd using fs-update and then boot,
 
  2.  configure wireless network,
 
  3.  yum install -y git olpc-os-builder
 
  4.  clone the master branch of
  git://dev.laptop.org/projects/olpc-os-builder
  (last verified with b87e6ee)
 
  5.  run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly
  until the error occurs (usually within about five attempts),
 
 
  I've also tried running under valgrind, but that causes illegal
  instruction.  It is quite likely I'm not using valgrind correctly.
  http://dev.laptop.org/~quozl/z/1XRYtO.txt
 
  The workaround at the moment is to build our Fedora 20 images on
  Fedora 18.  Fedora 18 shows no sign of the problem.  I'm worried that
  a low probability heap corruptor may cause instability of applications
  in the field.
 
  The exact same kernel is being used for Fedora 18 and Fedora 20.
 
  On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote:
   What version of OOB are you using, and what config files? I can try
   and recreate the problem here on other devices.
 
  --
  James Cameron
  http://quozl.linux.org.au/
 
  --
  James Cameron
  http://quozl.linux.org.au/
 ___
 Devel mailing list
 Devel@lists.laptop.org
 http://lists.laptop.org/listinfo/devel

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: Infrequent heap corruption, XO-4, Fedora 20

2015-02-04 Thread Peter Robinson
On Wed, Feb 4, 2015 at 8:10 AM, James Cameron qu...@laptop.org wrote:
 Following up a thread from last September.

 This problem has just become more interesting, because it hit during
 an activity startup.

 I'm quite used to seeing it with yum.  But seeing it without yum now
 points us at kernel, glibc or python.

We've not seen this in the wider F-20 Fedora ARM distro so my bet
would be on the kernel.

Peter

 http://dev.laptop.org/ticket/12837#comment:4 has the details of the
 most recent event.

 On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote:
 G'day Peter,

 Thanks for any ideas you may have.

 The problem also reproduces on OLPC Fedora 20 image for XO-4:

 http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB)

 *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae ***
 === Backtrace: =
 /lib/libc.so.6(+0x6c8b4)[0xb6c828b4]
 /lib/libc.so.6(+0x754e8)[0xb6c8b4e8]
 === Memory map: 
 [...]

 The error varies in detail, but always suggests corruption of heap or
 pointers to heap.

 The triggering conditions are interactive use of yum, yum update, or
 yum used by olpc-os-builder.  The latter is a simple reproducer for me.

 I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8
 GB USB flash drive.

 While memory demand by yum is large by comparison to other programs,
 the available memory at the time of failure is ample.  There are no
 kernel out of memory (OOM) events.  It seems more likely to occur when
 the filesystem cache is under heavy demand.

 The method to recreate the problem was:

 1.  install the system image 41001o4.zd using fs-update and then boot,

 2.  configure wireless network,

 3.  yum install -y git olpc-os-builder

 4.  clone the master branch of
 git://dev.laptop.org/projects/olpc-os-builder
 (last verified with b87e6ee)

 5.  run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly
 until the error occurs (usually within about five attempts),


 I've also tried running under valgrind, but that causes illegal
 instruction.  It is quite likely I'm not using valgrind correctly.
 http://dev.laptop.org/~quozl/z/1XRYtO.txt

 The workaround at the moment is to build our Fedora 20 images on
 Fedora 18.  Fedora 18 shows no sign of the problem.  I'm worried that
 a low probability heap corruptor may cause instability of applications
 in the field.

 The exact same kernel is being used for Fedora 18 and Fedora 20.

 On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote:
  What version of OOB are you using, and what config files? I can try
  and recreate the problem here on other devices.

 --
 James Cameron
 http://quozl.linux.org.au/

 --
 James Cameron
 http://quozl.linux.org.au/
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel