How silly of me, of course Matlab is using Tim Davis' code here, but it's the more advanced SSMULT code (it's GPL, you can see it under SuiteSparse/MATLAB_Tools/SSMULT if the license isn't a concern for you), rather than the simplistic version in CSparse and his book.
According to "On the Representation and Multiplication of Hypersparse Matrices" by Buluc and Gilbert, the Sulatycke-Ghose algorithm "examines all possible (i, j) positions of the input matrix A in the outermost loop and tests whether they are nonzero. Therefore, their algorithm has O(flops + n^2) complexity, performing unnecessary operations when flops < n^2." One thing you can play with is trying x = Base.LinAlg.CHOLMOD.CholmodSparse(1.0*x) which cuts another 5 seconds off your test case, but be warned that it appears to leak memory (CholmodSparse apparently needs a finalizer?) if you don't run gc() every few iterations. Changing y = x*transpose(x) to y = x*x' cuts another few seconds, very comparable to Matlab. CholmodSparse does have A_mul_Bt methods defined, unlike SparseMatrixCSC. I can get the pure-Julia version down to about 16 seconds by splitting the operation up into two passes, one just to determine the number of nonzeros in each column of the product then a separate pass to do the row indices and nonzeros. This avoids having to dynamically reallocate memory multiple times: https://gist.github.com/tkelman/9175190 Michael, your matrix generator is a much more representative test case of real sparse matrices than the current sparse matrix multiplication that is tracked in Julia's performance tests (https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/perf.jl#L25, multiplying a "sparse" matrix of all ones with itself), I think your case would be good to add. -Tony On Saturday, February 22, 2014 4:34:43 AM UTC-8, Tim Holy wrote: > > Looks like our algorithm is based on Gustavson 78, and on modern machines > (i.e., cache-miss dominated) there seems to be a much faster, very simple > algorithm available. They advertise multithreading in the title, but note > they > show ~10x better performance even for single-threaded. > > CACHING–EFFICIENT MULTITHREADED FAST MULTIPLICATION OF > SPARSE MATRICES > Peter D. Sulatycke and Kanad Ghose > > Their improvements boil down to changing the loop order, which does not > seem > like it would be a very challenging thing to implement. Would be great if > someone who uses sparse matrices (currently, I don't) looked into this. > > --Tim > > On Friday, February 21, 2014 06:18:42 PM Michael Schnall-Levin wrote: > > I've been doing some benchmarking of Julia vs Scipy for sparse matrix > > multiplication and I'm finding that julia is significantly (~4X - 5X) > > faster in some instances. > > > > I'm wondering if I'm doing something wrong, or if this is really true. > > Below are some code snippets for Julia and python. Any help would be > very > > appreciated! > > > > ----- Julia: > > Elapsed Time on my laptop: 24.9 seconds ----- > > x_inds = Int[] > > y_inds = Int[] > > vals = Int[] > > > > for n = 1:10000 > > inds = rand(1:2000,10,1) > > for ind in inds > > push!(x_inds, ind) > > push!(y_inds, n) > > push!(vals,1) > > end > > end > > > > x = sparse(x_inds, y_inds, vals, 2000, 10000) > > > > t = time() > > for j = 1:250 > > y = x*transpose(x) > > end > > print(string(time() - t, "\n")) > > ----- > > > > ---- Python Elapsed Time on my laptop: 5.8 seconds ----- > > import numpy > > import scipy.sparse > > import time > > > > x_inds = [] > > y_inds = [] > > vals = [] > > for n in xrange(10000): > > inds = numpy.random.randint(0, 2000,10) > > > > for ind in inds: > > x_inds.append(ind) > > y_inds.append(n) > > vals.append(1) > > > > x_inds = numpy.array(x_inds) > > y_inds = numpy.array(y_inds) > > vals = numpy.array(vals) > > > > x = scipy.sparse.csc_matrix((vals, (x_inds, y_inds)), shape=(2000, > 10000)) > > > > > > t = time.time() > > for j in xrange(250): > > y = x*x.transpose() > > print time.time() - t >