Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Tim Holy Thu, 13 Feb 2014 16:33:10 -0800

Very impressive! It's pretty cool to have native-Julia performance matching 
MKL! And linear performance up to 40 threads...not too shabby...


I was actually hoping that sparse matvec multiplication would be one of the 
first major uses of ShareArrays, and you've just done it. This will certainly 
raise the value of IterativeSolvers. This is probably the best demonstration 
of the power of parallelization yet. Congratulations, and thanks!

When you're ready, it would be great to register this as a package. 

Best,
--Tim

On Thursday, February 13, 2014 03:49:20 PM Madeleine Udell wrote:
> Thanks for all the advice, everyone. I've just finished a parallel sparse
> matrix vector multiplication library written in straight julia for shared
> memory machines, using Amit Murthy's SharedArrays. You can find it at
> https://github.com/madeleineudell/ParallelSparseMatMul.jl
> 
> For a matrix with 1E9 nonzeros, its performance on 6 processors is the same
> as MKL, and it retains linear parallel scaling up to the highest I've
> tested (40 threads).
> 
> (serial julia)
> julia> @time S'*x;
> elapsed time: 17.63098703 seconds (8410364 bytes allocated)
> 
> (mkl, 6 threads)
> julia> @time
> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y);
> elapsed time: 4.334168257 seconds (48 bytes allocated)
> 
> (ParallelSparseMatMul, 6 threads)
> julia> @time y = A'*x;
> elapsed time: 4.18557078 seconds (1858156 bytes allocated)
> 
> 
> 
> 
> 
> On Tue, Feb 11, 2014 at 1:44 PM, Madeleine Udell
> 
> <madeleine.ud...@gmail.com>wrote:
> > Thanks, Jake. It seems like I'm still not able to go above 6 threads, but
> > that may be deep in my MKL install.
> > 
> > I've begun implementing a shared sparse matrix class in native Julia that
> > seems to scale better. The code is attached to this email.  How can I
> > overload A'*b for my SharedBilinearOperator class, so that I can call
> > At_mul_B on a stored copy of A', rather than computing A' each time I want
> > to multiply by it? ie can I write something like
> > 
> > ('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x)
> > 
> > ?
> > 
> > On Tue, Feb 11, 2014 at 8:40 AM, Jake Bolewski 
<jakebolew...@gmail.com>wrote:
> >> Hey Madeleine,
> >> 
> >> First I would check that your global environment
> >> variables<http://software.intel.com/sites/products/documentation/hpc/mkl
> >> /mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are set
> >> up correctly.  If you want to set up the number of threads>> 
> >> programmatically:
> >>    _       _ _(_)_     |  A fresh approach to technical computing
> >>   
> >>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
> >>   
> >>    _ _   _| |_  __ _   |  Type "help()" to list help topics
> >>    
> >>   | | | | | | |/ _` |  |
> >>   | | |
> >>   | | |_| | | | (_| |  |  Version 0.3.0-prerelease+1470 (2014-02-08 16:23
> >> 
> >> UTC)
> >> 
> >>  _/ |\__'_|_|_|\__'_|  |  Commit 596d5c4* (3 days old master)
> >>  
> >> |__/                   |  x86_64-unknown-linux-gnu
> >> 
> >> julia> Base.blas_vendor()
> >> 
> >> :mkl
> >> 
> >> julia> Base.blas_set_num_threads(12)
> >> 
> >> 
> >> You can see the relevant code here:
> >> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053
> >> d310f42/base/util.jl#L293 .
> >> It's always worth a quick search through the code base to figure out what
> >> is going on behind the scenes.
> >> 
> >> Hope that helps!
> >> Jake
> >> 
> >> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote:
> >>> Jake, thanks for the
> >>> reference<http://software.intel.com/en-us/forums/topic/294954>; I have
> >>> 32 hyperthreaded cores, so there's definitely something else going on
> >>> to limit me to 6 in addition to not exploiting hyperthreading.
> >>> 
> >>> 
> >>> 
> >>> Perhaps I need to call something like omp_set_num_threads()? But there
> >>> doesn't seem to be a function by this name in the libmkl_rt library.
> >>> 
> >>> julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},(
> >>> Uint8,),32)
> >>> ERROR: ccall: could not find function omp_set_num_threads in library
> >>> libmkl_rt
> >>> 
> >>>  in anonymous at no file
> >>> 
> >>> On Mon, Feb 10, 2014 at 4:05 PM, Jake Bolewski 
<jakebo...@gmail.com>wrote:
> >>>> Are those hyper-threaded "cores"?  If so, check Intel MKL's
> >>>> documentation on hyper-threading.
> >>>> 
> >>>> -Best
> >>>> Jake
> >>>> 
> >>>> On Monday, February 10, 2014 6:38:50 PM UTC-5, Madeleine Udell wrote:
> >>>>> It looks like only 6 threads are being used when I use mkl from julia.
> >>>>> If I do blas_set_num_threads(n), then using top, I see julia is
> >>>>> running at min(n,6)*100% cpu. Any idea why this would be, or how to
> >>>>> get mkl to use more threads? I'm not sure if this is an issue in julia
> >>>>> or with my installation of mkl.
> >>>>> 
> >>>>> On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen
> >>>>> 
> >>>>> <andreasno...@gmail.com> wrote:
> >>>>> > You are welcome. Good to hear that it worked.
> >>>>> > 
> >>>>> > 2014-02-10 22:35 GMT+01:00 Madeleine Udell <madelei...@gmail.com>:
> >>>>> >> fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and
> >>>>> 
> >>>>> am able
> >>>>> 
> >>>>> >> to run the following test successfully:
> >>>>> >> 
> >>>>> >> transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A'
> >>>>> >> matdescra = "GXXF" # G = general, X = ignored, F = one-based
> >>>>> 
> >>>>> indexing
> >>>>> 
> >>>>> >> m,n = 50,100
> >>>>> >> A = sprand(m,n,.01)
> >>>>> >> y = zeros(m)
> >>>>> >> x = rand(n)
> >>>>> >> alpha = 1.0
> >>>>> >> beta = 1.0
> >>>>> >> 
> >>>>> >> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y)
> >>>>> >> y_builtin = A*x
> >>>>> >> 
> >>>>> >> julia> y==y_builtin
> >>>>> >> true
> >>>>> >> 
> >>>>> >> 
> >>>>> >> On Mon, Feb 10, 2014 at 12:08 PM, Andreas Noack Jensen
> >>>>> >> 
> >>>>> >> <andreasno...@gmail.com> wrote:
> >>>>> >>> Hi Madeleine
> >>>>> >>> 
> >>>>> >>> When compiling julia with MKL you'll have to do make cleanall as
> >>>>> 
> >>>>> well as
> >>>>> 
> >>>>> >>> rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and
> >>>>> 
> >>>>> make -C
> >>>>> 
> >>>>> >>> distclean-suitesparse. It is also easier to create a Make.user
> >>>>> 
> >>>>> there you set
> >>>>> 
> >>>>> >>> USE_MKL=1 and MKLROOT to the location of your MKL library files,
> >>>>> 
> >>>>> e.g.
> >>>>> 
> >>>>> >>> /opt/intel/mkl. The arguments are explained here
> >>>>> >>> 
> >>>>> >>> 
> >>>>> >>> http://software.intel.com/sites/products/documentation/hpc/
> >>>>> 
> >>>>> mkl/mklman/GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2.htm#
> >>>>> GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2
> >>>>> 
> >>>>> >>> but transa determines if the operation is Ax or A'x and matdescra
> >>>>> 
> >>>>> has
> >>>>> 
> >>>>> >>> information about the structure of the matrix, e.g. if it is
> >>>>> 
> >>>>> triangular.
> >>>>> 
> >>>>> >>> When you have succeeded in compiling julia with MKL, the libblas
> >>>>> 
> >>>>> variable
> >>>>> 
> >>>>> >>> should just be Base.libblas_name.
> >>>>> >>> 
> >>>>> >>> 2014-02-10 20:37 GMT+01:00 Madeleine Udell <madelei...@gmail.com>:
> >>>>> >>>> I'm having trouble using MKL in julia. I changed Make.inc so that
> >>>>> >>>> USE_MKL = 1,
> >>>>> >>>> but when I make and run julia, I find that Base.libblas_name =
> >>>>> >>>> "libopenblas". Is this expected? I would have thought it would be
> >>>>> 
> >>>>> eg
> >>>>> 
> >>>>> >>>> "libmkl_core".
> >>>>> >>>> 
> >>>>> >>>> Andreas, I found your wrappers for MKL in this PR. I've not used
> >>>>> 
> >>>>> MKL
> >>>>> 
> >>>>> >>>> before, so I don't understand the call signature of those
> >>>>> 
> >>>>> functions in order
> >>>>> 
> >>>>> >>>> to call MKL directly. Any chance you could explain what are
> >>>>> 
> >>>>> transa::BlasChar
> >>>>> 
> >>>>> >>>> and matdescra::ASCIIString in the following function, and which
> >>>>> 
> >>>>> mkl library
> >>>>> 
> >>>>> >>>> is expected in the libblas variable? I see many .so files in the
> >>>>> 
> >>>>> lib/intel64
> >>>>> 
> >>>>> >>>> directory of my mkl installation, and I'm not sure which one to
> >>>>> 
> >>>>> point julia
> >>>>> 
> >>>>> >>>> to.
> >>>>> >>>> 
> >>>>> >>>> function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString,
> >>>>> >>>> A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T,
> >>>>> >>>> y::StridedVector{$T})
> >>>>> >>>> length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n)
> >>>>> 
> >>>>> columns
> >>>>> 
> >>>>> >>>> multiplied with vector of length $(length(x))"))
> >>>>> >>>> length(y) == A.m || throw(DimensionMismatch("Vector of length
> >>>>> 
> >>>>> $(A.m)
> >>>>> 
> >>>>> >>>> added to vector of length $(length(y))")) #
> >>>>> >>>> ccall(($(string(mv)), libblas), Void,
> >>>>> >>>> (Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T},
> >>>>> >>>> Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt},
> >>>>> >>>> Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}),
> >>>>> >>>> &transa, &A.m, &A.n, &α,
> >>>>> >>>> matdescra, A.nzval, A.rowval, pointer(A.colptr, 1),
> >>>>> >>>> pointer(A.colptr, 2), x, &β, y)
> >>>>> >>>> return y
> >>>>> >>>> end
> >>>>> >>>> 
> >>>>> >>>> Thanks for your help!
> >>>>> >>>> Madeleine
> >>>>> >>>> 
> >>>>> >>>> 
> >>>>> >>>> On Wed, Feb 5, 2014 at 1:49 PM, Andreas Noack Jensen
> >>>>> >>>> 
> >>>>> >>>> <andreasno...@gmail.com> wrote:
> >>>>> >>>>> A*b will not call MKL when A is sparse. There has been some
> >>>>> 
> >>>>> writing
> >>>>> 
> >>>>> >>>>> about making a MKL package that overwrites
> >>>>> 
> >>>>> A_mul_B(Matrix,Vector) with the
> >>>>> 
> >>>>> >>>>> MKL versions and I actually wrote wrappers for the sparse MKL
> >>>>> 
> >>>>> subroutines in
> >>>>> 
> >>>>> >>>>> the fall for the same reason.
> >>>>> >>>>> 
> >>>>> >>>>> 2014-02-05 Madeleine Udell <madelei...@gmail.com>:
> >>>>> >>>>>> Miles, you're right that writing sparse matrix vector products
> >>>>> 
> >>>>> in
> >>>>> 
> >>>>> >>>>>> native Julia probably won't be the best idea given Julia's
> >>>>> 
> >>>>> model of
> >>>>> 
> >>>>> >>>>>> parallelism. That's why I'm interested in calling an outside
> >>>>> 
> >>>>> library like
> >>>>> 
> >>>>> >>>>>> PETSc.
> >>>>> >>>>>> 
> >>>>> >>>>>> I see it's possible to link Julia with MKL. I haven't tried
> >>>>> 
> >>>>> this yet,
> >>>>> 
> >>>>> >>>>>> but if I do, will A*b (where A is sparse) call MKL to perform
> >>>>> 
> >>>>> the matrix
> >>>>> 
> >>>>> >>>>>> vector product?
> >>>>> >>>>>> 
> >>>>> >>>>>> 
> >>>>> >>>>>> On Wed, Feb 5, 2014 at 11:43 AM, Miles Lubin <
> >>>>> 
> >>>>> miles...@gmail.com>
> >>>>> 
> >>>>> >>>>>> wrote:
> >>>>> >>>>>>> Memory access is typically a significant bottleneck in sparse
> >>>>> >>>>>>> mat-vec, so unfortunately I'm skeptical that one could achieve
> >>>>> 
> >>>>> good
> >>>>> 
> >>>>> >>>>>>> performance using Julia's current distributed memory approach
> >>>>> 
> >>>>> on a multicore
> >>>>> 
> >>>>> >>>>>>> machine. This really calls for something like OpenMP.
> >>>>> >>>>>>> 
> >>>>> >>>>>>> 
> >>>>> >>>>>>> On Wednesday, February 5, 2014 11:42:00 AM UTC-5, Madeleine
> >>>>> 
> >>>>> Udell
> >>>>> 
> >>>>> >>>>>>> wrote:
> >>>>> >>>>>>>> I'm developing an iterative optimization algorithm in Julia
> >>>>> 
> >>>>> along
> >>>>> 
> >>>>> >>>>>>>> the lines of other contributions to the Iterative Solvers
> >>>>> 
> >>>>> project or Krylov
> >>>>> 
> >>>>> >>>>>>>> Subspace module whose only computationally intensive step is
> >>>>> 
> >>>>> computing A*b
> >>>>> 
> >>>>> >>>>>>>> or A'*b. I would like to parallelize the method by using a
> >>>>> 
> >>>>> parallel sparse
> >>>>> 
> >>>>> >>>>>>>> matrix vector multiply. Is there a standard backend
> >>>>> 
> >>>>> matrix-vector multiply
> >>>>> 
> >>>>> >>>>>>>> that's recommended in Julia if I'm targeting a shared memory
> >>>>> 
> >>>>> computer with a
> >>>>> 
> >>>>> >>>>>>>> large number of processors? Similarly, is there a recommended
> >>>>> 
> >>>>> backend for
> >>>>> 
> >>>>> >>>>>>>> targeting a cluster? My matrices can easily reach 10 million
> >>>>> 
> >>>>> rows by 1
> >>>>> 
> >>>>> >>>>>>>> million columns, with sparsity anywhere from .01% to problems
> >>>>> 
> >>>>> that are
> >>>>> 
> >>>>> >>>>>>>> nearly diagonal.
> >>>>> >>>>>>>> 
> >>>>> >>>>>>>> I've seen many posts talking about integrating PETSc as a
> >>>>> 
> >>>>> backend
> >>>>> 
> >>>>> >>>>>>>> for this purpose, but it looks like the project has stalled -
> >>>>> 
> >>>>> the last
> >>>>> 
> >>>>> >>>>>>>> commits I see are a year ago. I'm also interested in other
> >>>>> 
> >>>>> backends, eg
> >>>>> 
> >>>>> >>>>>>>> Spark, SciDB, etc.
> >>>>> >>>>>>>> 
> >>>>> >>>>>>>> I'm more interested in solving sparse problems, but as a side
> >>>>> 
> >>>>> note,
> >>>>> 
> >>>>> >>>>>>>> the built-in BLAS acceleration by changing the number of
> >>>>> 
> >>>>> threads
> >>>>> 
> >>>>> >>>>>>>> `blas_set_num_threads` works ok for dense problems using a
> >>>>> 
> >>>>> moderate number
> >>>>> 
> >>>>> >>>>>>>> of processors. I wonder why the number of threads isn't set
> >>>>> 
> >>>>> higher than one
> >>>>> 
> >>>>> >>>>>>>> by default, for example, using as many as nprocs() cores?
> >>>>> >>>>>> 
> >>>>> >>>>>> --
> >>>>> >>>>>> Madeleine Udell
> >>>>> >>>>>> PhD Candidate in Computational and Mathematical Engineering
> >>>>> >>>>>> Stanford University
> >>>>> >>>>>> www.stanford.edu/~udell
> >>>>> >>>>> 
> >>>>> >>>>> --
> >>>>> >>>>> Med venlig hilsen
> >>>>> >>>>> 
> >>>>> >>>>> Andreas Noack Jensen
> >>>>> >>>> 
> >>>>> >>>> --
> >>>>> >>>> Madeleine Udell
> >>>>> >>>> PhD Candidate in Computational and Mathematical Engineering
> >>>>> >>>> Stanford University
> >>>>> >>>> www.stanford.edu/~udell
> >>>>> >>> 
> >>>>> >>> --
> >>>>> >>> Med venlig hilsen
> >>>>> >>> 
> >>>>> >>> Andreas Noack Jensen
> >>>>> >> 
> >>>>> >> --
> >>>>> >> Madeleine Udell
> >>>>> >> PhD Candidate in Computational and Mathematical Engineering
> >>>>> >> Stanford University
> >>>>> >> www.stanford.edu/~udell
> >>>>> > 
> >>>>> > --
> >>>>> > Med venlig hilsen
> >>>>> > 
> >>>>> > Andreas Noack Jensen
> >>>>> 
> >>>>> --
> >>>>> Madeleine Udell
> >>>>> PhD Candidate in Computational and Mathematical Engineering
> >>>>> Stanford University
> >>>>> www.stanford.edu/~udell
> >>> 
> >>> --
> >>> Madeleine Udell
> >>> PhD Candidate in Computational and Mathematical Engineering
> >>> Stanford University
> >>> www.stanford.edu/~udell
> > 
> > --
> > Madeleine Udell
> > PhD Candidate in Computational and Mathematical Engineering
> > Stanford University
> > www.stanford.edu/~udell

Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Reply via email to