On 09.02.2016 21:01, Nathaniel Smith wrote: > On Tue, Feb 9, 2016 at 11:37 AM, Julian Taylor > <jtaylor.deb...@googlemail.com> wrote: >> On 09.02.2016 04:59, Nathaniel Smith wrote: >>> On Mon, Feb 8, 2016 at 6:07 PM, Nathaniel Smith <n...@pobox.com> wrote: >>>> On Mon, Feb 8, 2016 at 6:04 PM, Matthew Brett <matthew.br...@gmail.com> >>>> wrote: >>>>> On Mon, Feb 8, 2016 at 5:26 PM, Nathaniel Smith <n...@pobox.com> wrote: >>>>>> On Mon, Feb 8, 2016 at 4:37 PM, Matthew Brett <matthew.br...@gmail.com> >>>>>> wrote: >>>>>> [...] >>>>>>> I can't replicate the segfault with manylinux wheels and scipy. On >>>>>>> the other hand, I get a new test error for numpy from manylinux, scipy >>>>>>> from manylinux, like this: >>>>>>> >>>>>>> $ python -c 'import scipy.linalg; scipy.linalg.test()' >>>>>>> >>>>>>> ====================================================================== >>>>>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, >>>>>>> 4)) >>>>>>> ---------------------------------------------------------------------- >>>>>>> Traceback (most recent call last): >>>>>>> File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line >>>>>>> 197, in runTest >>>>>>> self.test(*self.arg) >>>>>>> File >>>>>>> "/usr/local/lib/python2.7/dist-packages/scipy/linalg/tests/test_decomp.py", >>>>>>> line 658, in eigenhproblem_general >>>>>>> assert_array_almost_equal(diag2_, ones(diag2_.shape[0]), >>>>>>> DIGITS[dtype]) >>>>>>> File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", >>>>>>> line 892, in assert_array_almost_equal >>>>>>> precision=decimal) >>>>>>> File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", >>>>>>> line 713, in assert_array_compare >>>>>>> raise AssertionError(msg) >>>>>>> AssertionError: >>>>>>> Arrays are not almost equal to 4 decimals >>>>>>> >>>>>>> (mismatch 100.0%) >>>>>>> x: array([ 0., 0., 0.], dtype=float32) >>>>>>> y: array([ 1., 1., 1.]) >>>>>>> >>>>>>> ---------------------------------------------------------------------- >>>>>>> Ran 1507 tests in 14.928s >>>>>>> >>>>>>> FAILED (KNOWNFAIL=4, SKIP=1, failures=1) >>>>>>> >>>>>>> This is a very odd error, which we don't get when running over a numpy >>>>>>> installed from source, linked to ATLAS, and doesn't happen when >>>>>>> running the tests via: >>>>>>> >>>>>>> nosetests /usr/local/lib/python2.7/dist-packages/scipy/linalg >>>>>>> >>>>>>> So, something about the copy of numpy (linked to openblas) is >>>>>>> affecting the results of scipy (also linked to openblas), and only >>>>>>> with a particular environment / test order. >>>>>>> >>>>>>> If you'd like to try and see whether y'all can do a better job of >>>>>>> debugging than me: >>>>>>> >>>>>>> # Run this script inside a docker container started with this >>>>>>> incantation: >>>>>>> # docker run -ti --rm ubuntu:12.04 /bin/bash >>>>>>> apt-get update >>>>>>> apt-get install -y python curl >>>>>>> apt-get install libpython2.7 # this won't be necessary with next >>>>>>> iteration of manylinux wheel builds >>>>>>> curl -LO https://bootstrap.pypa.io/get-pip.py >>>>>>> python get-pip.py >>>>>>> pip install -f https://nipy.bic.berkeley.edu/manylinux numpy scipy nose >>>>>>> python -c 'import scipy.linalg; scipy.linalg.test()' >>>>>> >>>>>> I just tried this and on my laptop it completed without error. >>>>>> >>>>>> Best guess is that we're dealing with some memory corruption bug >>>>>> inside openblas, so it's getting perturbed by things like exactly what >>>>>> other calls to openblas have happened (which is different depending on >>>>>> whether numpy is linked to openblas), and which core type openblas has >>>>>> detected. >>>>>> >>>>>> On my laptop, which *doesn't* show the problem, running with >>>>>> OPENBLAS_VERBOSE=2 says "Core: Haswell". >>>>>> >>>>>> Guess the next step is checking what core type the failing machines >>>>>> use, and running valgrind... anyone have a good valgrind suppressions >>>>>> file? >>>>> >>>>> My machine (which does give the failure) gives >>>>> >>>>> Core: Core2 >>>>> >>>>> with OPENBLAS_VERBOSE=2 >>>> >>>> Yep, that allows me to reproduce it: >>>> >>>> root@f7153f0cc841:/# OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=Core2 python >>>> -c 'import scipy.linalg; scipy.linalg.test()' >>>> Core: Core2 >>>> [...] >>>> ====================================================================== >>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4)) >>>> ---------------------------------------------------------------------- >>>> [...] >>>> >>>> So this is indeed sounding like an OpenBLAS issue... next stop >>>> valgrind, I guess :-/ >>> >>> Here's the valgrind output: >>> https://gist.github.com/njsmith/577d028e79f0a80d2797 >>> >>> There's a lot of it, but no smoking guns have jumped out at me :-/ >>> >>> -n >>> >> >> plenty of smoking guns, e.g.: >> >> .............==3695== Invalid read of size 8 >> 3417 ==3695== at 0x7AAA9C0: daxpy_k_CORE2 (in >> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0) >> 3418 ==3695== by 0x76BEEFC: ger_kernel (in >> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0) >> 3419 ==3695== by 0x788F618: exec_blas (in >> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0) >> 3420 ==3695== by 0x76BF099: dger_thread (in >> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0) >> 3421 ==3695== by 0x767DC37: dger_ (in >> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0) >> >> >> I think I have reported that to openblas already, they said do that >> intentionally, though last I checked they are missing the code that >> verifies this is actually allowed (if your not crossing a page you can >> read beyond the boundaries). Its pretty likely its a pointless micro >> optimization, you normally only use that trick for string functions >> where you don't know the size of the string. > > Yeah, I thought that was intentional, and we're not getting a segfault > so I don't think they're hitting any page boundaries. It's possible > they're screwing it up and somehow the random data they're reading can > affect the results, and that's why we get the wrong answer sometimes, > but that's just a wild guess.
with openblas everything is possible, especially this exact type of issue. See e.g.: https://github.com/xianyi/OpenBLAS/issues/171 here it loaded too much data, partly uninitialized, and if its filled with nan it spreads into the actually used data. That was a lot of fun to debug, and openblas is riddled with this stuff... e.g. here my favourite comment in openblas (which is probably the source of https://github.com/scipy/scipy/issues/5528): 51 /* make it volatile because some function (ex: dgemv_n.S) */ \ 52 /* do not restore all register */ \ https://github.com/xianyi/OpenBLAS/blob/develop/common_stackalloc.h#L51 _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion