On 08.05.2012 08:59, Alexander Grayver wrote: > Barry, > > Not it works.
I mean NOW, sorry :) > Thanks everybody! > > On 07.05.2012 22:36, Barry Smith wrote: >> Alexander >> >> Satish and I have determined the problem (took some valgrind and >> debugger work). We were not allocating enough "workspace" for one of >> the work arrays passed to zgesvd(). We have fixed it in petsc-dev. >> You should be able to do a hg pull -u then recompile the libraries >> with make cmake then relink and run your example. >> >> Thank you for your patience. >> >> Barry >> >> On May 7, 2012, at 9:10 AM, Alexander Grayver wrote: >> >>> On 07.05.2012 15:04, Barry Smith wrote: >>>> I am also running complex. >>>> >>>> Look in the file dlasq2.f (it will be in the externalpackages >>>> subdirectory of the PETSc directory. Look at line 215, this is >>>> where valgrind has a problem. In my copy >>>> >>>> END IF >>>> * >>>> * Check for negative data and compute sums of q's and e's. >>>> *<------ this is line 215 >>>> Z( 2*N ) = ZERO >>>> >>>> it is a comment, which is not good. Is lione 215 also a comment in >>>> your copy of dlasq2.f? >>> Barry, >>> >>> * >>> * Rearrange data for locality: Z=(q1,qq1,e1,ee1,q2,qq2,e2,ee2,...). >>> * >>> DO 30 K = 2*N, 2, -2 >>> Z( 2*K ) = ZERO >>> !<----------- LINE 215 >>> Z( 2*K-1 ) = Z( K ) >>> Z( 2*K-2 ) = ZERO >>> Z( 2*K-3 ) = Z( K-1 ) >>> 30 CONTINUE >>> >>> In valgrind log you can see that it complaints about following lines >>> as well: >>> >>> ==9009== Invalid write of size 8 >>> ==9009== at 0x10651D5: dlasq2_ (dlasq2.f:215) >>> ==9009== by 0x1064683: dlasq1_ (dlasq1.f:135) >>> ==9009== by 0x104EB3F: zbdsqr_ (zbdsqr.f:225) >>> ==9009== by 0x1023B74: zgesvd_ (zgesvd.f:2040) >>> ==9009== by 0xD38725: KSPComputeExtremeSingularValues_GMRES >>> (gmreig.c:46) >>> ==9009== by 0xCB3CC7: KSPComputeExtremeSingularValues (itfunc.c:47) >>> ==9009== by 0x406DF2: main (solveTest.c:47) >>> ==9009== Address 0x6ef5d88 is 8 bytes before a block of size 832 >>> alloc'd >>> ==9009== at 0x4C2786E: memalign (vg_replace_malloc.c:581) >>> ==9009== by 0x47E3CB: PetscMallocAlign (mal.c:30) >>> ==9009== by 0xD2E286: KSPSetUp_GMRES (gmres.c:73) >>> ==9009== by 0xCB5464: KSPSetUp (itfunc.c:239) >>> ==9009== by 0xCB6E56: KSPSolve (itfunc.c:402) >>> ==9009== by 0x406DDB: main (solveTest.c:46) >>> ==9009== >>> ==9009== Invalid write of size 8 >>> ==9009== at 0x1065204: dlasq2_ (dlasq2.f:216) >>> .... >>> ==9009== >>> ==9009== Invalid write of size 8 >>> ==9009== at 0x1065223: dlasq2_ (dlasq2.f:217) >>> .... >>> ==9009== >>> ==9009== Invalid write of size 8 >>> ==9009== at 0x1065255: dlasq2_ (dlasq2.f:218) >>> .... >>> >>> All further output is also related to the Z array. >>> Hard to believe this is a LAPACK problem... I tried 3 >>> implementations over 2 machines. >>> I have bad feeling it's my stupid mistake somewhere... :) >>> >>> Just in case, I run ubuntu 11.1 and PETSc is configured like this >>> with default gcc compiler: >>> ./configure --with-petsc-arch=mpich-gcc-complex-debug-c >>> --download-f-blas-lapack --with-precision=double >>> --with-scalar-type=complex --download-mpich >>> >>>> There are two possible causes I can think of for your problem >>>> >>>> 1) PETSc does not allocate enough work space for zgesvd() or >>>> 2) the BLAS/LAPACK routines have a bug where they sometimes access >>>> out of their work space. >>>> >>>> >>>> Satish, >>>> >>>> Can you try the same build options on a Linux machine as >>>> close to Alexander as we have and see if you can reproduce this? >>>> >>>> >>>> Barry >>>> >>>> >>>> >>>> On May 7, 2012, at 2:16 AM, Alexander Grayver wrote: >>>> >>>>> On 06.05.2012 22:24, Barry Smith wrote: >>>>>> Alexander, >>>>>> >>>>>> I cannot reproduce this on my mac with 3 different >>>>>> blas/lapack. >>>>> Barry, >>>>> >>>>> I'm surprised. I ran it on my home PC with ubuntu and PETSc >>>>> configured from scratch as following: >>>>> --download-mpich --with-fortran-interfaces=1 --download-scalapack >>>>> --download-blacs --with-scalar-type=complex --download-blas-lapack >>>>> --with-precision=double >>>>> >>>>> And it's still there. >>>>> Please note that all my numbers are complex. >>>>> >>>>>> Could you please run the case below but with >>>>>> --download-f-blas-lapack (you forgot the -f last time)? Send us >>>>>> the valgrind results. This will tell use the exact line number in >>>>>> dlasq3() that is triggering the bad read. >>>>> I did: >>>>> ./configure --with-petsc-arch=openmpi-intel-complex-debug-c >>>>> --download-scalapack --download-blacs --download-f-blas-lapack >>>>> --with-precision=double --with-scalar-type=complex >>>>> >>>>> And then valgrind program. The first message from log: >>>>> >>>>> ==27656== Invalid write of size 8 >>>>> ==27656== at 0x15A8E9E: dlasq2_ (dlasq2.f:215) >>>>> ==27656== by 0x15A83A4: dlasq1_ (dlasq1.f:135) >>>>> ==27656== by 0x158ACEC: zbdsqr_ (zbdsqr.f:225) >>>>> ==27656== by 0x154EC27: zgesvd_ (zgesvd.f:2038) >>>>> ==27656== by 0x695DD3: KSPComputeExtremeSingularValues_GMRES >>>>> (gmreig.c:46) >>>>> ==27656== by 0x69DD76: KSPComputeExtremeSingularValues >>>>> (itfunc.c:47) >>>>> ==27656== by 0x44E98C: main (solveTest.c:62) >>>>> ==27656== Address 0xfad2d98 is 8 bytes before a block of size 832 >>>>> alloc'd >>>>> ==27656== at 0x4C25D66: memalign (vg_replace_malloc.c:694) >>>>> ==27656== by 0x4B642B: PetscMallocAlign (mal.c:30) >>>>> ==27656== by 0x687775: KSPSetUp_GMRES (gmres.c:73) >>>>> ==27656== by 0x69FE4A: KSPSetUp (itfunc.c:239) >>>>> ==27656== by 0x6A2058: KSPSolve (itfunc.c:402) >>>>> ==27656== by 0x44E969: main (solveTest.c:61) >>>>> >>>>> Please find full log attached. >>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> On May 6, 2012, at 9:16 AM, Alexander Grayver wrote: >>>>>> >>>>>>> On 06.05.2012 15:34, Matthew Knepley wrote: >>>>>>>> On Sun, May 6, 2012 at 9:24 AM, Alexander >>>>>>>> Grayver<agrayver at gfz-potsdam.de> wrote: >>>>>>>> Hm, valgrind gives a lot of output like that (see full log in >>>>>>>> previous message): >>>>>>>> >>>>>>>> Can you run this with --download-f-blas-lapack? This sounds >>>>>>>> much more like an MKL bug. >>>>>>> I did: >>>>>>> --download-scalapack --download-blacs --download-blas-lapack >>>>>>> --with-precision=double --with-scalar-type=complex >>>>>>> >>>>>>> The error is still there. I checked "ldd solveTest", mkl is not >>>>>>> used for sure. This is not an MKL problem I guess: >>>>>>> >>>>>>> ==13600== Invalid read of size 8 >>>>>>> ==13600== at 0x58636AF: dlasq3_ (in >>>>>>> /usr/local/lib/liblapack.so.3.2.2) >>>>>>> ==13600== by 0x5862C84: dlasq2_ (in >>>>>>> /usr/local/lib/liblapack.so.3.2.2) >>>>>>> ==13600== by 0x5861F2C: dlasq1_ (in >>>>>>> /usr/local/lib/liblapack.so.3.2.2) >>>>>>> ==13600== by 0x571A479: zbdsqr_ (in >>>>>>> /usr/local/lib/liblapack.so.3.2.2) >>>>>>> ==13600== by 0x57466A7: zgesvd_ (in >>>>>>> /usr/local/lib/liblapack.so.3.2.2) >>>>>>> ==13600== by 0x694687: KSPComputeExtremeSingularValues_GMRES >>>>>>> (gmreig.c:46) >>>>>>> ==13600== by 0x69C62A: KSPComputeExtremeSingularValues >>>>>>> (itfunc.c:47) >>>>>>> ==13600== by 0x44E02C: main (solveTest.c:62) >>>>>>> ==13600== Address 0x10826b90 is 16 bytes before a block of size >>>>>>> 832 alloc'd >>>>>>> ==13600== at 0x4C25D66: memalign (vg_replace_malloc.c:694) >>>>>>> ==13600== by 0x4B5ACB: PetscMallocAlign (mal.c:30) >>>>>>> ==13600== by 0x686181: KSPSetUp_GMRES (gmres.c:73) >>>>>>> ==13600== by 0x69E6FE: KSPSetUp (itfunc.c:239) >>>>>>> ==13600== by 0x6A090C: KSPSolve (itfunc.c:402) >>>>>>> ==13600== by 0x44E009: main (solveTest.c:61) >>>>>>> >>>>>>> The weird thing is that the it gives correct result, so zgesvd >>>>>>> works fine. >>>>>>> >>>>>>> And also running this program with 10 iterations in valgrind >>>>>>> doesn't produce error. The low above is with 100 iterations. >>>>>>> Without valgrind the error is always there. >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Alexander >>>>>>> >>>>> -- >>>>> Regards, >>>>> Alexander >>>>> >>>>> <valgrind.zip> >>> >>> -- >>> Regards, >>> Alexander >>> > > -- Regards, Alexander