This is great to hear!  More to do ..

On Monday, May 26, 2014 7:36:30 PM UTC-4, Jason Riedy wrote:
>
> First, the caveats:
>  
>    - on large matrices,
>     - with few threads, and
>     - punching a hole directly to the BLAS library.
>     
>  Sivan Toledo's recursive LU factorization ( 
> http://dx.doi.org/10.1137/S0895479896297744 ) isn't just good for 
> optimizing cache behavior but also avoiding top-level overheads. A somewhat 
> straight-forward iterative implementation in Julia, attached, performs 
> about as quickly as the current LAPACK call for square matrices of 
> dimension 3000 or above so long as you limit OpenBLAS to a single thread. I 
> haven't tested against MKL.
>  
> Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM 
> and xTRSV directly. The allocations and construction of StridedMatrix views 
> at every step destroys any performance. Similarly with a light-weight 
> wrapper attempt called LASubDomain below. Also, I have to write loops 
> rather than vector expressions. That leads to ugly code compared to modern 
> Fortran versions.
>  
> Current ratio of recursive LU time to the Julia LAPACK dispatch on a 
> dual-core, Sandy Bridge server by dimension (vertical) and number of 
> threads (horizontal):
>        N NT: 1 2 4 8 16 24 32   10 2301.01 2291.16 2335.39 2268.1 2532.12 
> 2304.23 2320.84  30 661.25 560.34 443.07 514.52 392.04 425.56 472  100 
> 98.62 90.38 96.69 94.31 74.6 73.61 76.33  300 8.12 8.14 12.19 16.15 14.55 
> 14.16 15.82  1000 1.6 1.55 2.36 3.35 3.5 3.86 3.48  3000 1.04 1.13 1.36 
> 1.99 1.62 1.36 1.39  10000 1.02 1.06 1.13 1.26 1.55 1.54 1.58  
>
> The best case for traditional LU factorization relying on xGEMM for the 
> Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far 
> worse.
>  
> If there's interest, I'd love to work this into a faster generic 
> base/linalg/lu.jl for use with other data types. I know I need to consider 
> if the laswap!, lutrisolve!, and luschur! functions are the right utility 
> methods. I'm interested in supporting doubled double, etc. with relatively 
> high performance. That implies building everything out of at least dot 
> products. For "multiplied" precisions, those can be optimized to avoid 
> continual re-normalization (see Siegfried Rump, et al.'s work).
>  
> (I actually wrote the code last November, but this is first chance I've 
> had to describe it. Might be a tad out of date style-wise with respect to 
> current Julia. Given that lag, I suppose I shouldn't promise to beat this 
> into shape at any fast paceā€¦)
>  
> <#part type="text/plain" filename="~/src/julia-stuff/reclu/reclu.jl" 
> disposition=inline>
> <#/part>
> <#part type="text/plain" filename="~/src/julia-stuff/reclu/tst.jl" 
> disposition=inline>
> <#/part>
>  

Reply via email to