How silly of me, of course Matlab is using Tim Davis' code here, but it's 
the more advanced SSMULT code (it's GPL, you can see it under 
SuiteSparse/MATLAB_Tools/SSMULT if the license isn't a concern for you), 
rather than the simplistic version in CSparse and his book.

According to "On the Representation and Multiplication of Hypersparse 
Matrices" by Buluc and Gilbert, the Sulatycke-Ghose algorithm "examines all 
possible (i, j) positions of the input matrix A in the outermost loop and 
tests whether they are nonzero. Therefore, their algorithm has O(flops + 
n^2) complexity, performing unnecessary operations when flops < n^2."

One thing you can play with is trying
    x = Base.LinAlg.CHOLMOD.CholmodSparse(1.0*x)
which cuts another 5 seconds off your test case, but be warned that it 
appears to leak memory (CholmodSparse apparently needs a finalizer?) if you 
don't run gc() every few iterations. Changing
    y = x*transpose(x)
to
    y = x*x'
cuts another few seconds, very comparable to Matlab. CholmodSparse does 
have A_mul_Bt methods defined, unlike SparseMatrixCSC.

I can get the pure-Julia version down to about 16 seconds by splitting the 
operation up into two passes, one just to determine the number of nonzeros 
in each column of the product then a separate pass to do the row indices 
and nonzeros. This avoids having to dynamically reallocate memory multiple 
times: https://gist.github.com/tkelman/9175190

Michael, your matrix generator is a much more representative test case of 
real sparse matrices than the current sparse matrix multiplication that is 
tracked in Julia's performance tests 
(https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/perf.jl#L25, 
multiplying a "sparse" matrix of all ones with itself), I think your case 
would be good to add.

-Tony


On Saturday, February 22, 2014 4:34:43 AM UTC-8, Tim Holy wrote:
>
> Looks like our algorithm is based on Gustavson 78, and on modern machines 
> (i.e., cache-miss dominated) there seems to be a much faster, very simple 
> algorithm available. They advertise multithreading in the title, but note 
> they 
> show ~10x better performance even for single-threaded. 
>
> CACHING–EFFICIENT MULTITHREADED FAST MULTIPLICATION OF 
> SPARSE MATRICES 
> Peter D. Sulatycke and Kanad Ghose 
>
> Their improvements boil down to changing the loop order, which does not 
> seem 
> like it would be a very challenging thing to implement. Would be great if 
> someone who uses sparse matrices (currently, I don't) looked into this. 
>
> --Tim 
>
> On Friday, February 21, 2014 06:18:42 PM Michael Schnall-Levin wrote: 
> > I've been doing some benchmarking of Julia vs Scipy for sparse matrix 
> > multiplication and I'm finding that julia is significantly (~4X - 5X) 
> > faster in some instances. 
> > 
> > I'm wondering if I'm doing something wrong, or if this is really true. 
> >  Below are some code snippets for Julia and python.  Any help would be 
> very 
> > appreciated! 
> > 
> > ----- Julia: 
> > Elapsed Time on my laptop: 24.9 seconds ----- 
> > x_inds = Int[] 
> > y_inds = Int[] 
> > vals = Int[] 
> > 
> > for n = 1:10000 
> >     inds = rand(1:2000,10,1) 
> >     for ind in inds 
> >         push!(x_inds, ind) 
> >         push!(y_inds, n) 
> >         push!(vals,1) 
> >     end 
> > end 
> > 
> > x = sparse(x_inds, y_inds, vals, 2000, 10000) 
> > 
> > t = time() 
> > for j = 1:250 
> >     y = x*transpose(x) 
> > end 
> > print(string(time() - t, "\n")) 
> > ----- 
> > 
> > ---- Python       Elapsed Time on my laptop: 5.8 seconds ----- 
> > import numpy 
> > import scipy.sparse 
> > import time 
> > 
> > x_inds = [] 
> > y_inds = [] 
> > vals = [] 
> > for n in xrange(10000): 
> >     inds = numpy.random.randint(0, 2000,10) 
> > 
> >     for ind in inds: 
> >         x_inds.append(ind) 
> >         y_inds.append(n) 
> >         vals.append(1) 
> > 
> > x_inds = numpy.array(x_inds) 
> > y_inds = numpy.array(y_inds) 
> > vals = numpy.array(vals) 
> > 
> > x = scipy.sparse.csc_matrix((vals, (x_inds, y_inds)), shape=(2000, 
> 10000)) 
> > 
> > 
> > t = time.time() 
> > for j in xrange(250): 
> >     y = x*x.transpose() 
> > print time.time() - t 
>

Reply via email to