Hi again,

 > So fast_copy still copies the memory and has copying overhead, even with
> MAIN_MEMORY context?

Yes. It's a copy() operation, so it just does what the name suggests.

> Is there a way to do shallow copying  (i.e. just pointer initialization)
> to the matrix data buffer? Isn't it what some constructors of matrix or
> matrix_base do?

Yes, you can pass your pointer via the constructors, e.g.
https://github.com/viennacl/viennacl-dev/blob/master/viennacl/matrix.hpp#L721


> What i am getting at, it looks like i am getting a significant overhead
> for just copying -- actually, it seems i am getting double overhead --
> once when i prepare padding and all as required by the internal_size?(),
> and then i pass it into the fast_copy() which apparently does copying
> again, even if we are using host memory matrices.

If you want to 'wrap' your data in a ViennaCL matrix, pass the pointer 
to the constructors. If you want to quickly copy your data over to 
memory managed by a ViennaCL matrix, use copy() or fast_copy(). From 
your description it looks like you are now looking for the constructor 
calls, but from your earlier email I thought that you are looking for a 
fast_copy().



> all in all, by my estimates this copying back and forth (which is,
> granted, is not greatly optimized on our side) takes ~15..17 seconds out
> of 60 seconds total when multiplying 10k x 10k dense arguments via
> ViennaCL. I also optimize to -march=haswell  and use -ffast-math,
> without those i seem to fall too far behind what R + openblas can do in
> this test. Then, my processing time swells up to 2 minutes without
> optimizing for non-compliant arithmetics.

15 seconds of copying for a 10k-by-10k matrix looks way too much. 
10k-by-10k is 800 MB of data for double precision, so this should not 
take much more than 100 ms on a low-range laptop (10 GB/sec memory 
bandwidth). Even with multiple matrices and copies you should stay in 
the 1 second regime.


> If i can wrap the buffer and avoid copying for MAIN_MEMORY context, i'd
> be shaving off another 10% or so of the execution time. Which would make
> me happier, as i probably would be able to beat openblas given custom
> cpu architecture flags.

Why do you expect to beat OpenBLAS? Their kernels are really well 
optimized, and for lare dense matrix-matrix you are always FLOP-limited.


> On the other hand, bidmat (which allegedly uses mkl) does the same test,
> double precision, in under 10 seconds. I can't fathom how, but it does.
> I have a haswell-E platform.

Multiplication of 10k-by-10k matrices amounts to 200 GFLOP of compute in 
double precision. A Haswell-E machine provides that within a few 
seconds, depending on the number of cores (2.4 GHz * 4 doubles with AVX 
* 2 for FMA = 19.2 GFLOP/sec per core. MKL achieves about 15 GFLOP/sec 
per core).

ViennaCL's host-backend is not strong on dense matrix-matrix multiplies 
(even though we've got some improvements in a pull request), so for this 
particular operation you will get better performance from MKL, OpenBLAS, 
or libflame.

Best regards,
Karli





> On Tue, Jul 12, 2016 at 9:27 AM, Karl Rupp <r...@iue.tuwien.ac.at
> <mailto:r...@iue.tuwien.ac.at>> wrote:
>
>     Hi,
>
>     > One question: you mentioned padding for the `matrix` type. When i
>
>         initialize the `matrix` instance, i only specify dimensions. how
>         do I
>         know padding values?
>
>
>     if you want to provide your own padded dimensions, consider using
>     matrix_base directly. If you want to query the padded dimensions,
>     use internal_size1() and internal_size2() for the internal number of
>     rows and columns.
>
>     http://viennacl.sourceforge.net/doc/manual-types.html#manual-types-matrix
>
>     Best regards,
>     Karli
>
>
>
>
>         On Tue, Jul 12, 2016 at 5:53 AM, Karl Rupp
>         <r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>
>         <mailto:r...@iue.tuwien.ac.at <mailto:r...@iue.tuwien.ac.at>>>
>         wrote:
>
>              Hi Dmitriy,
>
>              On 07/12/2016 07:17 AM, Dmitriy Lyubimov wrote:
>
>                  Hi,
>
>                  I am trying to create some elementary wrappers for VCL
>         in javacpp.
>
>                  Everything goes fine, except i really would rather not
>         use those
>                  "cpu"
>                  types (std::map,
>                  std::vector) and rather initialize matrices directly by
>         feeding
>                  row-major or CCS formats.
>
>                  I see that matrix () constructor accepts this form of
>                  initialization;
>                  but it really states that
>                  it does "wrapping" for the device memory.
>
>
>              Yes, the constructors either create their own memory buffer
>              (zero-initialized) or wrap an existing buffer. These are
>         the only
>              reasonable options.
>
>
>                  Now, i can create a host matrix() using host memory and
>         row-major
>                  packing. This works ok it seems.
>
>                  However, these are still host instances. Can i copy host
>                  instances to
>                  instances on opencl context?
>
>
>              Did you look at viennacl::copy() or viennacl::fast_copy()?
>
>
>                  That might be one way bypassing unnecessary (in my case)
>                  complexities of
>                  working with std::vector and std::map classes from java
>         side.
>
>                  But it looks like there's no copy() variation that
>         would accept a
>                  matrix-on-host and matrix-on-opencl arguments (or
>         rather, it of
>                  course
>                  declares those to be ambiguous since two methods fit).
>
>
>              If you want to copy your OpenCL data into a
>         viennacl::matrix, you
>              may wrap the memory handle (obtained with .elements()) into
>         a vector
>              and copy that. If you have plain host data, use
>              viennacl::fast_copy() and mind the data layout (padding of
>              rows/columns!)
>
>
>                  For compressed_matrix, there seems to be a set()
>         method, but i guess
>                  this also requires CCS arrays in the device memory if I
>         use it. Same
>                  question, is there a way to send-and-wrap CCS arrays to an
>                  opencl device
>                  instance of compressed matrix without using std::map?
>
>
>              Currently you have to use .set() if you want to bypass
>              viennacl::copy() and std::map.
>
>              I acknowledge that the C++ type system is a pain when
>         interfacing
>              from other languages. We will make this much more convenient in
>              ViennaCL 2.0. The existing interface in ViennaCL 1.x is too
>         hard to
>              fix without breaking lots of user code, so we won't invest
>         time in
>              that (contributions welcome, though :-) )
>
>              Best regards,
>              Karli
>
>
>
>
>


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to