This is great to hear! More to do ..
On Monday, May 26, 2014 7:36:30 PM UTC-4, Jason Riedy wrote: > > First, the caveats: > > - on large matrices, > - with few threads, and > - punching a hole directly to the BLAS library. > > Sivan Toledo's recursive LU factorization ( > http://dx.doi.org/10.1137/S0895479896297744 ) isn't just good for > optimizing cache behavior but also avoiding top-level overheads. A somewhat > straight-forward iterative implementation in Julia, attached, performs > about as quickly as the current LAPACK call for square matrices of > dimension 3000 or above so long as you limit OpenBLAS to a single thread. I > haven't tested against MKL. > > Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM > and xTRSV directly. The allocations and construction of StridedMatrix views > at every step destroys any performance. Similarly with a light-weight > wrapper attempt called LASubDomain below. Also, I have to write loops > rather than vector expressions. That leads to ugly code compared to modern > Fortran versions. > > Current ratio of recursive LU time to the Julia LAPACK dispatch on a > dual-core, Sandy Bridge server by dimension (vertical) and number of > threads (horizontal): > N NT: 1 2 4 8 16 24 32 10 2301.01 2291.16 2335.39 2268.1 2532.12 > 2304.23 2320.84 30 661.25 560.34 443.07 514.52 392.04 425.56 472 100 > 98.62 90.38 96.69 94.31 74.6 73.61 76.33 300 8.12 8.14 12.19 16.15 14.55 > 14.16 15.82 1000 1.6 1.55 2.36 3.35 3.5 3.86 3.48 3000 1.04 1.13 1.36 > 1.99 1.62 1.36 1.39 10000 1.02 1.06 1.13 1.26 1.55 1.54 1.58 > > The best case for traditional LU factorization relying on xGEMM for the > Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far > worse. > > If there's interest, I'd love to work this into a faster generic > base/linalg/lu.jl for use with other data types. I know I need to consider > if the laswap!, lutrisolve!, and luschur! functions are the right utility > methods. I'm interested in supporting doubled double, etc. with relatively > high performance. That implies building everything out of at least dot > products. For "multiplied" precisions, those can be optimized to avoid > continual re-normalization (see Siegfried Rump, et al.'s work). > > (I actually wrote the code last November, but this is first chance I've > had to describe it. Might be a tad out of date style-wise with respect to > current Julia. Given that lag, I suppose I shouldn't promise to beat this > into shape at any fast paceā¦) > > <#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" > disposition=inline> > <#/part> > <#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" > disposition=inline> > <#/part> >