[Bug 1640518] Re: MongoDB Memory corruption

2016-11-16 Thread Adam Conrad
** Also affects: glibc (Ubuntu Xenial) Importance: Undecided Status: New ** Also affects: glibc (Ubuntu Yakkety) Importance: Undecided Status: New ** Changed in: glibc (Ubuntu) Assignee: Taco Screen team (taco-screen-team) => Adam Conrad (adconrad) ** Changed in: glibc (

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-16 Thread William J. Schmidt
Hi Andrew, Canonical's plans for handling this in the short term are described here: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1642390. We will continue to work the POWER patch, but the SRU described there is all you should need to track for your relnote. Bill -- You received this

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-17 Thread Brian W Hart
Howdy, I'm the originator of: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1641241 (which is dup'd to this bug) I tested the new ("ubuntu5") libc6 packages from xenial-proposed. They prevent the crash with TensorFlow, and I have not noticed any other problems. Will update: https://bugs

[Bug 1640518] Re: MongoDB Memory corruption

2016-12-28 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: glibc (Ubuntu Xenial) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title:

[Bug 1640518] Re: MongoDB Memory corruption

2016-12-28 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: glibc (Ubuntu Yakkety) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title

[Bug 1640518] Re: MongoDB Memory corruption

2016-12-28 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users. ** Changed in: glibc (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: Mong

[Bug 1640518] Re: MongoDB Memory corruption

2017-01-24 Thread Bug Watch Updater
** Changed in: glibc Status: Unknown => Fix Released ** Changed in: glibc Importance: Unknown => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Memory corrupt

[Bug 1640518] Re: MongoDB Memory corruption

2017-04-07 Thread Launchpad Bug Tracker
This bug was fixed in the package glibc - 2.24-9ubuntu2 --- glibc (2.24-9ubuntu2) zesty; urgency=medium * debian/patches/any/cvs-resolv-internal-qtype.diff: Revert to avoid failure in name resolution on upgrades from yakkety (LP: #1674532) -- Adam Conrad Tue, 21 Mar 2017 15:

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread Matthias Klose
** Package changed: gcc-4.8 (Ubuntu) => gcc-5 (Ubuntu) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Memory corruption To manage notifications about this bug go to: https://b

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread William J. Schmidt
For what it's worth, GCC is just a placeholder package for now. The compiler and libraries aren't directly implicated, at least as of now. The origin of the stack corruption remains unknown. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubu

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread William J. Schmidt
Andrew, according to the valgrind community, the error Memcheck: mc_machine.c:329 (get_otrack_shadow_offset_wrk): the 'impossible' happened. is probably due to the lock elision code accessing a hardware register that valgrind doesn't know about, so there isn't a shadow register to consult. Our

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread William J. Schmidt
I had another thought this evening. In case this is a threading problem, have you tried building with Clang and using ThreadSanitizer? Support for this was added to ppc64el in 2015. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. http

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread Andrew Morrow
Here is the patch for the above comment ** Patch added: "Apply to 3220495083b0d678578a76591f54ee1d7a5ec5df" https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+attachment/4775111/+files/acm.nov9.patch -- You received this bug notification because you are a member of Ubuntu Bugs, wh

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread Andrew Morrow
The following are reproduction instructions for the behavior that we are observing on Ubuntu 16.04 ppc64le. Note that we have run this same test on RHEL 7.1 ppc64le, and we do not observe any stack corruption. Note also that building and running this repro may depend on certain system libraries (SS

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread Andrew Morrow
Bill - I will try again with valgrind without --track-origins=yes and post any interesting findings. Re ThreadSanitizer, we have tried before without success. The last time we tried, it didn't work because clang TSAN didn't support exceptions. Perhaps that has changed? We really like the sanitizer

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread William J. Schmidt
Hi Andrew -- not sure about Clang TSAN supporting exceptions at this point; it is probable that we don't have a solution there as I would expect that to require target support, and I've not heard of that happening for POWER. That said, I've been less connected to the Clang community for the last y

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-09 Thread Ubuntu Foundations Team Bug Bot
The attachment "Apply to 3220495083b0d678578a76591f54ee1d7a5ec5df" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team. [This is an automated message performed by a Lau

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Andrew Morrow
Overnight, I ran this test case on both an Ubuntu 16.04 ppc64le system and a RHEL 7.1 ppc64le system. The test ran 219 times on Ubuntu, with 15 cores, for a failure rate of around 5%. Most of the time corruption was detected in the Canary ctor (before doing other work), but a few times in the dtor

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Andrew Morrow
I tried valgrind as suggested above. By adding --show-mismatched-frees=no and removing --track-origins=yes I was able to get the process to start up without a lot of false positives. However, the server process fails to open its listening socket, because valgrind reports an unsupported syscall: [

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Andrew Morrow
OK, I upgraded valgrind to 3.12 on the power machine and I can now get it to run meaningfully. We are seeing many error reports of the following form: [js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+ s40019| ==34604== Thread 50: [js_test:fsm_all_sharded_replication] 2016-11-10T1

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread William J. Schmidt
Hi Andrew, That indeed looks suspicious. I've been talking with our libc team. It appears that the existing patch that provides for disabling lock elision dynamically isn't present in the libc on Ubuntu 16.04, which is very unfortunate. They are thinking about other possible solutions. This se

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Peter Bergner
The following might override the HTM lock elision. Can someone try it to see if it works? bergner@ampere:~$ cat pthread_mutex_lock.c #include #define PTHREAD_MUTEX_NO_ELISION_NP 512 extern int __pthread_mutex_lock (pthread_mutex_t *); int pthread_mutex_lock (pthread_mutex_t *mutex) { mutex-

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Andrew Morrow
I have the libfoo.so.1 interposer running, I will let it run overnight and report back tomorrow with any interesting findings. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Me

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Andrew Morrow
I don't think the interposition is working, or I'm doing something wrong. I changed pthread_mutex_lock.c to the following: $ cat pthread_mutex_lock.c #include #include #define PTHREAD_MUTEX_NO_ELISION_NP 512 extern int __pthread_mutex_lock (pthread_mutex_t *); int pthread_mutex_lock (pthread

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Peter Bergner
When I add the abort and use your C++ test case, I see the abort: bergner@ampere:~$ cat pthread_mutex_lock.c #include #include #define PTHREAD_MUTEX_NO_ELISION_NP 512 extern int __pthread_mutex_lock (pthread_mutex_t *); int pthread_mutex_lock (pthread_mutex_t *mutex) { abort(); mutex->__d

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-10 Thread Peter Bergner
gdb shows the abort is from the shim library too: bergner@ampere:~$ gdb -q ./a.out Reading symbols from ./a.out...(no debugging symbols found)...done. (gdb) set environment LD_PRELOAD=./libbar.so.1 (gdb) run Starting program: /home/bergner/a.out [Thread debugging using libthread_db enabled] Using

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Adam Conrad
A test build of glibc with lock elision disabled is in progress here: https://launchpad.net/~adconrad/+archive/ubuntu/nole/+packages That said, the above trace looks suspiciously like a double-unlock. That breaks pthread rules, but the software implementation has historically let you do it anyway

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Paul Clarke
from "man pthread_mutex_unlock": If the mutex type is PTHREAD_MUTEX_ERRORCHECK, then error checking shall be provided. If a thread attempts to relock a mutex that it has already locked, an error shall be returned. If a thread attempts to unlock a mutex that i

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread William J. Schmidt
Per the blog post mentioned from the thread in #35, this sort of problem should also manifest on a Broadwell or Skylake processor. Andrew, have you tried running on such machines? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https:

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread William J. Schmidt
>From that debian thread: "Per logs from message #15 on bug #842796: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15 SIGSEGV on __lll_unlock_elision is a signature (IME with very high confidence) of an attempt to unlock an already unlocked lock while running under hardware lock elisio

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Andrew Morrow
First, I'm not sure what I was doing wrong yesterday, but I now have the LD_PRELOAD lock-elision-disablement running. And, when running under valgrind, we no longer see the reports from valgrind. I'm now running without valgrind to see whether we still observe stack corruption. A few comments on t

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Andrew Morrow
That is good news that you have been able to reproduce the issue. I'm currently running the reproducer with the LD_PRELOAD disable-lock- elision hack in place, without valgrind, and I'm currently at 55 runs with no crashes. I will let it run overnight. Also, per the earlier comment about double un

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Peter Bergner
I'll note that the LD_PRELOAD interposer library is only needed for binaries that are already compiled and you want to override the pthread_mutex_lock() routine. If you can recompile your source, then you can place the interposer directly into your source and there is no need for LD_PRELOADing any

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Peter Bergner
...and possibly wrap the above in: #ifdef __powerpc__ ... #endif so it's only used on POWER? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Memory corruption To manage notif

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Aaron Sawdey
Question: Is there any magic I can do to this test case: python buildscripts/resmoke.py --suites=concurrency_sharded --storageEngine=wiredTiger --excludeWithAnyTags=requires_mmapv1 --dbpathPrefix=... --repeat=500 --continueOnFailure that would allow me to run multiple copies on the same machine?

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Aaron Sawdey
Found it, looks like the --basePort option to resmoke is what I want. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Memory corruption To manage notifications about this bug g

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Andrew Morrow
Peter, re #47, yes, that is certainly true. However, I'm actually finding it advantageous to load it via LD_PRELOAD exactly because I don't need to recompile. So I can toggle back and forth between lock elision on/off without needing to recompile. -- You received this bug notification because you

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Andrew Morrow
Arron, re #50, yes, you can run as may copies as you want simultaneously, as long as: 1) The --dbpathPrefix argument points to distinct paths. So resmoke.py ... --dbpathPrefix=/var/tmp/run1 and resmoke.py ... --dbpathPrefix=/var/tmp/run2, etc. 2) You specify disjoint "port ranges" with the --base

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Adam Conrad
Ahh, if you're never actually seeing a SEGV in __lll_unlock_elision, this may indeed be something far more subtle than a double unlock and, in fact, I'd be even MORE interested in having you run the same batch of testing on an otherwise identical (ie: Ubuntu 16.04, blah blah blah) Skylake system.

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Aaron Sawdey
Andrew, Yes, that is working nicely with separate DB dirs and basePort I'm running multiple copies on one machine. Thanks! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Mem

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Andrew Morrow
Adam I agree on all points. So far, my repro running with the LD_PRELOAD hack is at 118 iterations with no crashes and going strong. Given that we had an ~5% repro rate without the LD_PRELOAD hack, this is looking very encouraging, but I'm going to let it run all weekend just to be sure. As for s

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Aaron Sawdey
One other thing, if you use the mprotect thing, it may be necessary to bump up the value of /proc/sys/vm/max_map_count, depending on how many of these Canary objects get constructed. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. http

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Aaron Sawdey
This is the other thing I am trying. I've modified the Canary object to use a 128k stack zone and then use mprotect to mark the aligned 64k page that's in the middle of it read-only. When the destructor is called, it changes it back to read-write. This should cause any write to this region to get a

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-11 Thread Andrew Morrow
An engineer on our side did some Canary+mprotect experiments as well, but I don't happen to have details on what the approach/results were right now. I'll ask them to update this ticket with any interesting findings they may have. -- You received this bug notification because you are a member of

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-12 Thread William J. Schmidt
Here's another interesting data point. The original bug description specifies that the memory corruption is not seen on Ubuntu 15. Per https://bugzilla.linux.ibm.com/show_bug.cgi?id=117535, however, transactional lock elision has been enabled by default since Ubuntu 15.04 (glibc 2.21). Yet on 16

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-12 Thread Andrew Morrow
Regarding Ubuntu 15, I think that was a miscommunication somewhere along the line. The only versions of Ubuntu that we build for are the LTS releases (12.04, 14.04, and 16.04), and the only one of those we have ever built on POWER is 16.04. Other than Ubuntu 16.04, the only other POWER distro we t

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-12 Thread Aaron Sawdey
An update on my experiments: * 500 runs no failures with TLE disabled * 500 runs no failures TLE enabled but mprotect() syscall in Canary constructor/destructor * 500 runs 11 failed with TLE enabled so about 2% fail rate * Tried switching SMT off and interestingly got 200 runs no fails with TLE e

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-13 Thread Dimitri John Ledkov
Hello, could bugproxy please be silenced, and/or prevented from reposting the same python command line over and over again? It seems there are sometimes attachments/comments that get amplified and re- posted with every other comment. -- You received this bug notification because you are a member

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-13 Thread William J. Schmidt
We'll try to get to the bottom of the bugproxy oddity. It seems to be echoing a chunk out of comment #5 for no reason that I can see. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: Mo

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread William J. Schmidt
Given the results of Aaron's experiments over the weekend, here's a summary of what we think we're seeing: - There is an unsafe interaction between two threads. - This interaction can only be observed in a very small time window, and then relatively rarely. - The interaction is observable only

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread William J. Schmidt
Well, let me back that off a little. We're going to look into the TLE code a little more. The various __lll_*_elision routines are handed a pointer to short that they update, which certainly looks suspicious. So it's certainly possible that something in the pthreads implementation that calls thi

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread William J. Schmidt
Ulrich Weigand made an interesting comment on the glibc code in our internal bug (no longer mirrored), so I'm mirroring it by hand here. --- Comment #77 from Ulrich Weigand --- According to comment #45, we see an invalid read at __lll_unlock_elision (elision-unlock.c:36) and and invalid write at

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread Andrew Morrow
I think it is very likely that we are doing the sort of stack-based mutex pattern described above, or something similar. In particular, I'd expect that we certainly have states where we wait on a stack mutex, and then immediately unwind and destroy the mutex after we unblock. I'm working on gettin

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread Aaron Sawdey
So, I rebuilt the glibc 2.23 from the 16.04 sources and modified the values written to the adapt_count parm in the lock elision code. It's a short and the original code may store values 0, 1, 2, 3. We were seeing either 1 (canary hit in constructor) or 0 (canary hit in destructor). I changed it to

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread Andrew Morrow
Aaron thank you very much for running that experiment and confirming that this is an issue in libc. I think the component should probably be updated? Also, would you like us to try to continue to repro on a Skylake machine, or is this all architecture neutral code and therefore the POWER repro is

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread William J. Schmidt
Hi Andrew, This is a POWER-specific "optimization" that dates to last December (so it in fact wouldn't show up in Ubuntu 15.04 or 15.10, it appears). The decrement used to be attached to the lock rather than the unlock and it was apparently moved to the present location because it showed a good p

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread Andrew Morrow
Hi Bill - Thanks for the update, and for clarifying that this is POWER 16.04 only. We are very happy to be at a root cause for this issue - it had us pretty worried! We really appreciate all the help from everyone involved here. Will there be an upstream glibc bug associated with this ticket that

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-14 Thread William J. Schmidt
Hi Andrew, Aaron has just opened https://sourceware.org/bugzilla/show_bug.cgi?id=20822. An odd confluence of vacations, world holidays, and family leave has conspired to take all of our glibc experts out of the office tomorrow, so we may be a little delayed with testing and submitting the final f

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-15 Thread Andrew Morrow
Hi Bill - Thanks for the glibc bug link. Totally understand about people being out, not a problem. However, I'm not very familiar with the development process for upstream glibc fixes to make their way into an LTS release. Do you have a rough estimate of the timeline for that landing in somethin

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-15 Thread William J. Schmidt
Hi Andrew - I don't work directly in the glibc community, so I'm not completely familiar with their policies. However, the first step is to get a bug approved upstream so that it can be backported to the 2.23 release (in this case). Adam Conrad at Canonical has volunteered to help us shepherd th

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-15 Thread Adam Conrad
** Also affects: glibc via https://sourceware.org/bugzilla/show_bug.cgi?id=20822 Importance: Unknown Status: Unknown -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB

[Bug 1640518] Re: MongoDB Memory corruption

2016-11-16 Thread William J. Schmidt
Patch submission is here: https://sourceware.org/ml/libc- alpha/2016-11/msg00568.html -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Memory corruption To manage notifications