Infrequent heap corruption, XO-4, Fedora 20
Following up a thread from last September. This problem has just become more interesting, because it hit during an activity startup. I'm quite used to seeing it with yum. But seeing it without yum now points us at kernel, glibc or python. http://dev.laptop.org/ticket/12837#comment:4 has the details of the most recent event. On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote: G'day Peter, Thanks for any ideas you may have. The problem also reproduces on OLPC Fedora 20 image for XO-4: http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB) *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae *** === Backtrace: = /lib/libc.so.6(+0x6c8b4)[0xb6c828b4] /lib/libc.so.6(+0x754e8)[0xb6c8b4e8] === Memory map: [...] The error varies in detail, but always suggests corruption of heap or pointers to heap. The triggering conditions are interactive use of yum, yum update, or yum used by olpc-os-builder. The latter is a simple reproducer for me. I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8 GB USB flash drive. While memory demand by yum is large by comparison to other programs, the available memory at the time of failure is ample. There are no kernel out of memory (OOM) events. It seems more likely to occur when the filesystem cache is under heavy demand. The method to recreate the problem was: 1. install the system image 41001o4.zd using fs-update and then boot, 2. configure wireless network, 3. yum install -y git olpc-os-builder 4. clone the master branch of git://dev.laptop.org/projects/olpc-os-builder (last verified with b87e6ee) 5. run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly until the error occurs (usually within about five attempts), I've also tried running under valgrind, but that causes illegal instruction. It is quite likely I'm not using valgrind correctly. http://dev.laptop.org/~quozl/z/1XRYtO.txt The workaround at the moment is to build our Fedora 20 images on Fedora 18. Fedora 18 shows no sign of the problem. I'm worried that a low probability heap corruptor may cause instability of applications in the field. The exact same kernel is being used for Fedora 18 and Fedora 20. On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote: What version of OOB are you using, and what config files? I can try and recreate the problem here on other devices. -- James Cameron http://quozl.linux.org.au/ -- James Cameron http://quozl.linux.org.au/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Infrequent heap corruption, XO-4, Fedora 20
Thanks. Can I make it happen more often? Is there a later version of the driver? We have a different version that I may look into, on arm-3.5-android branch. On Wed, Feb 04, 2015 at 12:14:02PM +0100, Jon Nettleton wrote: It is a problem with the v4 version of the galcore driver. We have replicated it on a couple of platforms. On Wed, Feb 4, 2015 at 11:26 AM, Peter Robinson [1]pbrobin...@gmail.com wrote: On Wed, Feb 4, 2015 at 8:10 AM, James Cameron [2]qu...@laptop.org wrote: Following up a thread from last September. This problem has just become more interesting, because it hit during an activity startup. I'm quite used to seeing it with yum. But seeing it without yum now points us at kernel, glibc or python. We've not seen this in the wider F-20 Fedora ARM distro so my bet would be on the kernel. Peter [3]http://dev.laptop.org/ticket/12837#comment:4 has the details of the most recent event. On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote: G'day Peter, Thanks for any ideas you may have. The problem also reproduces on OLPC Fedora 20 image for XO-4: [4]http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB) *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae *** === Backtrace: = /lib/libc.so.6(+0x6c8b4)[0xb6c828b4] /lib/libc.so.6(+0x754e8)[0xb6c8b4e8] === Memory map: [...] The error varies in detail, but always suggests corruption of heap or pointers to heap. The triggering conditions are interactive use of yum, yum update, or yum used by olpc-os-builder. The latter is a simple reproducer for me. I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8 GB USB flash drive. While memory demand by yum is large by comparison to other programs, the available memory at the time of failure is ample. There are no kernel out of memory (OOM) events. It seems more likely to occur when the filesystem cache is under heavy demand. The method to recreate the problem was: 1. install the system image 41001o4.zd using fs-update and then boot, 2. configure wireless network, 3. yum install -y git olpc-os-builder 4. clone the master branch of git://[5]dev.laptop.org/projects/olpc-os-builder (last verified with b87e6ee) 5. run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly until the error occurs (usually within about five attempts), I've also tried running under valgrind, but that causes illegal instruction. It is quite likely I'm not using valgrind correctly. [6]http://dev.laptop.org/~quozl/z/1XRYtO.txt The workaround at the moment is to build our Fedora 20 images on Fedora 18. Fedora 18 shows no sign of the problem. I'm worried that a low probability heap corruptor may cause instability of applications in the field. The exact same kernel is being used for Fedora 18 and Fedora 20. On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote: What version of OOB are you using, and what config files? I can try and recreate the problem here on other devices. -- James Cameron [7]http://quozl.linux.org.au/ -- James Cameron [8]http://quozl.linux.org.au/ ___ Devel mailing list [9]Devel@lists.laptop.org [10]http://lists.laptop.org/listinfo/devel References: [1] mailto:pbrobin...@gmail.com [2] mailto:qu...@laptop.org [3] http://dev.laptop.org/ticket/12837#comment:4 [4] http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd [5] http://dev.laptop.org/projects/olpc-os-builder [6] http://dev.laptop.org/~quozl/z/1XRYtO.txt [7] http://quozl.linux.org.au/ [8] http://quozl.linux.org.au/ [9] mailto:Devel@lists.laptop.org [10] http://lists.laptop.org/listinfo/devel -- James Cameron http://quozl.linux.org.au/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Infrequent heap corruption, XO-4, Fedora 20
On Thu, Feb 5, 2015 at 8:00 AM, James Cameron qu...@laptop.org wrote: Thanks. Can I make it happen more often? Is there a later version of the driver? We have a different version that I may look into, on arm-3.5-android branch. run memtester against the majority of your machines memory and then run gtkperf in an X session. That is usually enough to trigger it. Considering that bug exists in all the 4.xx vivante galcore drivers I have seen I doubt it is fixed in the other version. Android is much simpler on memory because it runs everything through a single GL context against a framebuffer. I have some tentative patches to fix parts of it in my trees but I doubt a lot of them would apply to 3.5 without backporting a lot of upstream work. On Wed, Feb 04, 2015 at 12:14:02PM +0100, Jon Nettleton wrote: It is a problem with the v4 version of the galcore driver. We have replicated it on a couple of platforms. On Wed, Feb 4, 2015 at 11:26 AM, Peter Robinson [1]pbrobin...@gmail.com wrote: On Wed, Feb 4, 2015 at 8:10 AM, James Cameron [2]qu...@laptop.org wrote: Following up a thread from last September. This problem has just become more interesting, because it hit during an activity startup. I'm quite used to seeing it with yum. But seeing it without yum now points us at kernel, glibc or python. We've not seen this in the wider F-20 Fedora ARM distro so my bet would be on the kernel. Peter [3]http://dev.laptop.org/ticket/12837#comment:4 has the details of the most recent event. On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote: G'day Peter, Thanks for any ideas you may have. The problem also reproduces on OLPC Fedora 20 image for XO-4: [4]http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB) *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae *** === Backtrace: = /lib/libc.so.6(+0x6c8b4)[0xb6c828b4] /lib/libc.so.6(+0x754e8)[0xb6c8b4e8] === Memory map: [...] The error varies in detail, but always suggests corruption of heap or pointers to heap. The triggering conditions are interactive use of yum, yum update, or yum used by olpc-os-builder. The latter is a simple reproducer for me. I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8 GB USB flash drive. While memory demand by yum is large by comparison to other programs, the available memory at the time of failure is ample. There are no kernel out of memory (OOM) events. It seems more likely to occur when the filesystem cache is under heavy demand. The method to recreate the problem was: 1. install the system image 41001o4.zd using fs-update and then boot, 2. configure wireless network, 3. yum install -y git olpc-os-builder 4. clone the master branch of git://[5]dev.laptop.org/projects/olpc-os-builder (last verified with b87e6ee) 5. run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly until the error occurs (usually within about five attempts), I've also tried running under valgrind, but that causes illegal instruction. It is quite likely I'm not using valgrind correctly. [6]http://dev.laptop.org/~quozl/z/1XRYtO.txt The workaround at the moment is to build our Fedora 20 images on Fedora 18. Fedora 18 shows no sign of the problem. I'm worried that a low probability heap corruptor may cause instability of applications in the field. The exact same kernel is being used for Fedora 18 and Fedora 20. On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote: What version of OOB are you using, and what config files? I can try and recreate the problem here on other devices. -- James Cameron [7]http://quozl.linux.org.au/ -- James Cameron [8]http://quozl.linux.org.au/ ___ Devel mailing list [9]Devel@lists.laptop.org [10]http://lists.laptop.org/listinfo/devel References: [1] mailto:pbrobin...@gmail.com [2] mailto:qu...@laptop.org [3] http://dev.laptop.org/ticket/12837#comment:4 [4] http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd [5] http://dev.laptop.org/projects/olpc-os-builder [6] http://dev.laptop.org/~quozl/z/1XRYtO.txt [7] http://quozl.linux.org.au/ [8] http://quozl.linux.org.au/ [9] mailto:Devel@lists.laptop.org [10] http://lists.laptop.org/listinfo/devel -- James Cameron http://quozl.linux.org.au/ ___ Devel
Re: Infrequent heap corruption, XO-4, Fedora 20
It is a problem with the v4 version of the galcore driver. We have replicated it on a couple of platforms. On Wed, Feb 4, 2015 at 11:26 AM, Peter Robinson pbrobin...@gmail.com wrote: On Wed, Feb 4, 2015 at 8:10 AM, James Cameron qu...@laptop.org wrote: Following up a thread from last September. This problem has just become more interesting, because it hit during an activity startup. I'm quite used to seeing it with yum. But seeing it without yum now points us at kernel, glibc or python. We've not seen this in the wider F-20 Fedora ARM distro so my bet would be on the kernel. Peter http://dev.laptop.org/ticket/12837#comment:4 has the details of the most recent event. On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote: G'day Peter, Thanks for any ideas you may have. The problem also reproduces on OLPC Fedora 20 image for XO-4: http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB) *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae *** === Backtrace: = /lib/libc.so.6(+0x6c8b4)[0xb6c828b4] /lib/libc.so.6(+0x754e8)[0xb6c8b4e8] === Memory map: [...] The error varies in detail, but always suggests corruption of heap or pointers to heap. The triggering conditions are interactive use of yum, yum update, or yum used by olpc-os-builder. The latter is a simple reproducer for me. I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8 GB USB flash drive. While memory demand by yum is large by comparison to other programs, the available memory at the time of failure is ample. There are no kernel out of memory (OOM) events. It seems more likely to occur when the filesystem cache is under heavy demand. The method to recreate the problem was: 1. install the system image 41001o4.zd using fs-update and then boot, 2. configure wireless network, 3. yum install -y git olpc-os-builder 4. clone the master branch of git://dev.laptop.org/projects/olpc-os-builder (last verified with b87e6ee) 5. run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly until the error occurs (usually within about five attempts), I've also tried running under valgrind, but that causes illegal instruction. It is quite likely I'm not using valgrind correctly. http://dev.laptop.org/~quozl/z/1XRYtO.txt The workaround at the moment is to build our Fedora 20 images on Fedora 18. Fedora 18 shows no sign of the problem. I'm worried that a low probability heap corruptor may cause instability of applications in the field. The exact same kernel is being used for Fedora 18 and Fedora 20. On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote: What version of OOB are you using, and what config files? I can try and recreate the problem here on other devices. -- James Cameron http://quozl.linux.org.au/ -- James Cameron http://quozl.linux.org.au/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: Infrequent heap corruption, XO-4, Fedora 20
On Wed, Feb 4, 2015 at 8:10 AM, James Cameron qu...@laptop.org wrote: Following up a thread from last September. This problem has just become more interesting, because it hit during an activity startup. I'm quite used to seeing it with yum. But seeing it without yum now points us at kernel, glibc or python. We've not seen this in the wider F-20 Fedora ARM distro so my bet would be on the kernel. Peter http://dev.laptop.org/ticket/12837#comment:4 has the details of the most recent event. On Wed, Sep 10, 2014 at 01:56:27PM +1000, James Cameron wrote: G'day Peter, Thanks for any ideas you may have. The problem also reproduces on OLPC Fedora 20 image for XO-4: http://build.laptop.org/14.1.0/os1/xo-4/41001o4.zd (552 MB) *** Error in `/usr/bin/python': free(): invalid pointer: 0x047c79ae *** === Backtrace: = /lib/libc.so.6(+0x6c8b4)[0xb6c828b4] /lib/libc.so.6(+0x754e8)[0xb6c8b4e8] === Memory map: [...] The error varies in detail, but always suggests corruption of heap or pointers to heap. The triggering conditions are interactive use of yum, yum update, or yum used by olpc-os-builder. The latter is a simple reproducer for me. I'm reproducing it on an XO-4, with 2GB of RAM, no swap, 8 GB eMMC, 8 GB USB flash drive. While memory demand by yum is large by comparison to other programs, the available memory at the time of failure is ample. There are no kernel out of memory (OOM) events. It seems more likely to occur when the filesystem cache is under heavy demand. The method to recreate the problem was: 1. install the system image 41001o4.zd using fs-update and then boot, 2. configure wireless network, 3. yum install -y git olpc-os-builder 4. clone the master branch of git://dev.laptop.org/projects/olpc-os-builder (last verified with b87e6ee) 5. run ./osbuilder.py examples/olpc-os-14.1.0-xo4.ini repeatedly until the error occurs (usually within about five attempts), I've also tried running under valgrind, but that causes illegal instruction. It is quite likely I'm not using valgrind correctly. http://dev.laptop.org/~quozl/z/1XRYtO.txt The workaround at the moment is to build our Fedora 20 images on Fedora 18. Fedora 18 shows no sign of the problem. I'm worried that a low probability heap corruptor may cause instability of applications in the field. The exact same kernel is being used for Fedora 18 and Fedora 20. On Tue, Sep 09, 2014 at 03:55:24PM +0100, Peter Robinson wrote: What version of OOB are you using, and what config files? I can try and recreate the problem here on other devices. -- James Cameron http://quozl.linux.org.au/ -- James Cameron http://quozl.linux.org.au/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel