Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Madeleine Udell Tue, 11 Feb 2014 13:45:11 -0800

Thanks, Jake. It seems like I'm still not able to go above 6 threads, but
that may be deep in my MKL install.


I've begun implementing a shared sparse matrix class in native Julia that
seems to scale better. The code is attached to this email.  How can I
overload A'*b for my SharedBilinearOperator class, so that I can call
At_mul_B on a stored copy of A', rather than computing A' each time I want
to multiply by it? ie can I write something like

('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x)

?


On Tue, Feb 11, 2014 at 8:40 AM, Jake Bolewski <jakebolew...@gmail.com>wrote:

> Hey Madeleine,
>
> First I would check that your global environment 
> variables<http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are
>  set up correctly.  If you want to set up the number of threads
> programmatically:
>
>    _       _ _(_)_     |  A fresh approach to technical computing
>   (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
>    _ _   _| |_  __ _   |  Type "help()" to list help topics
>   | | | | | | |/ _` |  |
>   | | |_| | | | (_| |  |  Version 0.3.0-prerelease+1470 (2014-02-08 16:23
> UTC)
>  _/ |\__'_|_|_|\__'_|  |  Commit 596d5c4* (3 days old master)
> |__/                   |  x86_64-unknown-linux-gnu
>
> julia> Base.blas_vendor()
> :mkl
>
> julia> Base.blas_set_num_threads(12)
>
>
> You can see the relevant code here:
> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053d310f42/base/util.jl#L293
> .
> It's always worth a quick search through the code base to figure out what
> is going on behind the scenes.
>
> Hope that helps!
> Jake
>
> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote:
>>
>> Jake, thanks for the 
>> reference<http://software.intel.com/en-us/forums/topic/294954>;
>> I have 32 hyperthreaded cores, so there's definitely something else going
>> on to limit me to 6 in addition to not exploiting hyperthreading.
>>
>
>
>> Perhaps I need to call something like omp_set_num_threads()? But there
>> doesn't seem to be a function by this name in the libmkl_rt library.
>>
>> julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},(
>> Uint8,),32)
>> ERROR: ccall: could not find function omp_set_num_threads in library
>> libmkl_rt
>>  in anonymous at no file
>>
>>
>> On Mon, Feb 10, 2014 at 4:05 PM, Jake Bolewski <jakebo...@gmail.com>wrote:
>>
>>> Are those hyper-threaded "cores"?  If so, check Intel MKL's
>>> documentation on hyper-threading.
>>>
>>> -Best
>>> Jake
>>>
>>> On Monday, February 10, 2014 6:38:50 PM UTC-5, Madeleine Udell wrote:
>>>
>>>> It looks like only 6 threads are being used when I use mkl from julia.
>>>> If I do blas_set_num_threads(n), then using top, I see julia is
>>>> running at min(n,6)*100% cpu. Any idea why this would be, or how to
>>>> get mkl to use more threads? I'm not sure if this is an issue in julia
>>>> or with my installation of mkl.
>>>>
>>>> On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen
>>>> <andreasno...@gmail.com> wrote:
>>>> > You are welcome. Good to hear that it worked.
>>>> >
>>>> >
>>>> > 2014-02-10 22:35 GMT+01:00 Madeleine Udell <madelei...@gmail.com>:
>>>> >
>>>> >> fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and
>>>> am able
>>>> >> to run the following test successfully:
>>>> >>
>>>> >> transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A'
>>>> >> matdescra = "GXXF" # G = general, X = ignored, F = one-based
>>>> indexing
>>>> >> m,n = 50,100
>>>> >> A = sprand(m,n,.01)
>>>> >> y = zeros(m)
>>>> >> x = rand(n)
>>>> >> alpha = 1.0
>>>> >> beta = 1.0
>>>> >>
>>>> >> Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y)
>>>> >> y_builtin = A*x
>>>> >>
>>>> >> julia> y==y_builtin
>>>> >> true
>>>> >>
>>>> >>
>>>> >> On Mon, Feb 10, 2014 at 12:08 PM, Andreas Noack Jensen
>>>> >> <andreasno...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi Madeleine
>>>> >>>
>>>> >>> When compiling julia with MKL you'll have to do make cleanall as
>>>> well as
>>>> >>> rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and
>>>> make -C
>>>> >>> distclean-suitesparse. It is also easier to create a Make.user
>>>> there you set
>>>> >>> USE_MKL=1 and MKLROOT to the location of your MKL library files,
>>>> e.g.
>>>> >>> /opt/intel/mkl. The arguments are explained here
>>>> >>>
>>>> >>>
>>>> >>> http://software.intel.com/sites/products/documentation/hpc/
>>>> mkl/mklman/GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2.htm#
>>>> GUID-C2EE93F0-B573-4538-A994-202CB3ADFFC2
>>>> >>>
>>>> >>> but transa determines if the operation is Ax or A'x and matdescra
>>>> has
>>>> >>> information about the structure of the matrix, e.g. if it is
>>>> triangular.
>>>> >>> When you have succeeded in compiling julia with MKL, the libblas
>>>> variable
>>>> >>> should just be Base.libblas_name.
>>>> >>>
>>>> >>>
>>>> >>> 2014-02-10 20:37 GMT+01:00 Madeleine Udell <madelei...@gmail.com>:
>>>> >>>
>>>> >>>> I'm having trouble using MKL in julia. I changed Make.inc so that
>>>> >>>> USE_MKL = 1,
>>>> >>>> but when I make and run julia, I find that Base.libblas_name =
>>>> >>>> "libopenblas". Is this expected? I would have thought it would be
>>>> eg
>>>> >>>> "libmkl_core".
>>>> >>>>
>>>> >>>> Andreas, I found your wrappers for MKL in this PR. I've not used
>>>> MKL
>>>> >>>> before, so I don't understand the call signature of those
>>>> functions in order
>>>> >>>> to call MKL directly. Any chance you could explain what are
>>>> transa::BlasChar
>>>> >>>> and matdescra::ASCIIString in the following function, and which
>>>> mkl library
>>>> >>>> is expected in the libblas variable? I see many .so files in the
>>>> lib/intel64
>>>> >>>> directory of my mkl installation, and I'm not sure which one to
>>>> point julia
>>>> >>>> to.
>>>> >>>>
>>>> >>>> function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString,
>>>> >>>> A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T,
>>>> >>>> y::StridedVector{$T})
>>>> >>>> length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n)
>>>> columns
>>>> >>>> multiplied with vector of length $(length(x))"))
>>>> >>>> length(y) == A.m || throw(DimensionMismatch("Vector of length
>>>> $(A.m)
>>>> >>>> added to vector of length $(length(y))")) #
>>>> >>>> ccall(($(string(mv)), libblas), Void,
>>>> >>>> (Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T},
>>>> >>>> Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt},
>>>> >>>> Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}),
>>>> >>>> &transa, &A.m, &A.n, &α,
>>>> >>>> matdescra, A.nzval, A.rowval, pointer(A.colptr, 1),
>>>> >>>> pointer(A.colptr, 2), x, &β, y)
>>>> >>>> return y
>>>> >>>> end
>>>> >>>>
>>>> >>>> Thanks for your help!
>>>> >>>> Madeleine
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Feb 5, 2014 at 1:49 PM, Andreas Noack Jensen
>>>> >>>> <andreasno...@gmail.com> wrote:
>>>> >>>>>
>>>> >>>>> A*b will not call MKL when A is sparse. There has been some
>>>> writing
>>>> >>>>> about making a MKL package that overwrites A_mul_B(Matrix,Vector)
>>>> with the
>>>> >>>>> MKL versions and I actually wrote wrappers for the sparse MKL
>>>> subroutines in
>>>> >>>>> the fall for the same reason.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> 2014-02-05 Madeleine Udell <madelei...@gmail.com>:
>>>> >>>>>
>>>> >>>>>> Miles, you're right that writing sparse matrix vector products
>>>> in
>>>> >>>>>> native Julia probably won't be the best idea given Julia's model
>>>> of
>>>> >>>>>> parallelism. That's why I'm interested in calling an outside
>>>> library like
>>>> >>>>>> PETSc.
>>>> >>>>>>
>>>> >>>>>> I see it's possible to link Julia with MKL. I haven't tried this
>>>> yet,
>>>> >>>>>> but if I do, will A*b (where A is sparse) call MKL to perform
>>>> the matrix
>>>> >>>>>> vector product?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Feb 5, 2014 at 11:43 AM, Miles Lubin <miles...@gmail.com>
>>>>
>>>> >>>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Memory access is typically a significant bottleneck in sparse
>>>> >>>>>>> mat-vec, so unfortunately I'm skeptical that one could achieve
>>>> good
>>>> >>>>>>> performance using Julia's current distributed memory approach
>>>> on a multicore
>>>> >>>>>>> machine. This really calls for something like OpenMP.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Wednesday, February 5, 2014 11:42:00 AM UTC-5, Madeleine
>>>> Udell
>>>> >>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> I'm developing an iterative optimization algorithm in Julia
>>>> along
>>>> >>>>>>>> the lines of other contributions to the Iterative Solvers
>>>> project or Krylov
>>>> >>>>>>>> Subspace module whose only computationally intensive step is
>>>> computing A*b
>>>> >>>>>>>> or A'*b. I would like to parallelize the method by using a
>>>> parallel sparse
>>>> >>>>>>>> matrix vector multiply. Is there a standard backend
>>>> matrix-vector multiply
>>>> >>>>>>>> that's recommended in Julia if I'm targeting a shared memory
>>>> computer with a
>>>> >>>>>>>> large number of processors? Similarly, is there a recommended
>>>> backend for
>>>> >>>>>>>> targeting a cluster? My matrices can easily reach 10 million
>>>> rows by 1
>>>> >>>>>>>> million columns, with sparsity anywhere from .01% to problems
>>>> that are
>>>> >>>>>>>> nearly diagonal.
>>>> >>>>>>>>
>>>> >>>>>>>> I've seen many posts talking about integrating PETSc as a
>>>> backend
>>>> >>>>>>>> for this purpose, but it looks like the project has stalled -
>>>> the last
>>>> >>>>>>>> commits I see are a year ago. I'm also interested in other
>>>> backends, eg
>>>> >>>>>>>> Spark, SciDB, etc.
>>>> >>>>>>>>
>>>> >>>>>>>> I'm more interested in solving sparse problems, but as a side
>>>> note,
>>>> >>>>>>>> the built-in BLAS acceleration by changing the number of
>>>> threads
>>>> >>>>>>>> `blas_set_num_threads` works ok for dense problems using a
>>>> moderate number
>>>> >>>>>>>> of processors. I wonder why the number of threads isn't set
>>>> higher than one
>>>> >>>>>>>> by default, for example, using as many as nprocs() cores?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> --
>>>> >>>>>> Madeleine Udell
>>>> >>>>>> PhD Candidate in Computational and Mathematical Engineering
>>>> >>>>>> Stanford University
>>>> >>>>>> www.stanford.edu/~udell
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Med venlig hilsen
>>>> >>>>>
>>>> >>>>> Andreas Noack Jensen
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Madeleine Udell
>>>> >>>> PhD Candidate in Computational and Mathematical Engineering
>>>> >>>> Stanford University
>>>> >>>> www.stanford.edu/~udell
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Med venlig hilsen
>>>> >>>
>>>> >>> Andreas Noack Jensen
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Madeleine Udell
>>>> >> PhD Candidate in Computational and Mathematical Engineering
>>>> >> Stanford University
>>>> >> www.stanford.edu/~udell
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Med venlig hilsen
>>>> >
>>>> > Andreas Noack Jensen
>>>>
>>>>
>>>>
>>>> --
>>>> Madeleine Udell
>>>> PhD Candidate in Computational and Mathematical Engineering
>>>> Stanford University
>>>> www.stanford.edu/~udell
>>>>
>>>
>>
>>
>> --
>> Madeleine Udell
>> PhD Candidate in Computational and Mathematical Engineering
>> Stanford University
>> www.stanford.edu/~udell
>>
>


-- 
Madeleine Udell
PhD Candidate in Computational and Mathematical Engineering
Stanford University
www.stanford.edu/~udell

parallel_matmul.jl
Description: Binary data

test_parallel_matmul.jl
Description: Binary data

Re: [julia-users] Re: Parallel sparse matrix vector multiplication

Reply via email to