So just to make sure I understand correctly: right now we compiled the few example kernels with PTX version 4.3, implying that this is the minimum requirement and SystemML's GPU backend will not run, for example, on Kepler devices (with PTX version 3), right?

Also, is there a performance difference (generated code, or just-in-time compilation overhead) between CUBIN and PTX files? If so, can we quantify this difference to make a decision here? Thanks.

Regards,
Matthias

On 11/24/2016 8:34 AM, Nakul Jindal wrote:
@Matthias -
PTX (parallel thread execution) objects are intermediate compiled objects.

As of the current master, they are maintained under git version control.
This decision was agreed upon after discussing the hassle that a developer
of systemml without the nvidia cuda compiler might face.
It was decided that a person modifying the .cu files will be responsible
for regenerating the .ptx file and committing it to version control.
So far, between the active developers of systemml, this practice has not
disrupted their regular workflow.

About PTX version.
Newer PTX versions support newer architectures. As and when we upgrade to
newer CUDA versions, we shall use the cuda compiler that ships with that
version of the toolkit and compile the .cu files in the project and commit
the resulting .ptx files.

Thoughts, comments?

-Nakul






On Wed, Nov 23, 2016 at 2:43 PM, Matthias Boehm <mboe...@googlemail.com>
wrote:

thanks for sharing Nakul. Could you please also comment on the PTX story
for custom kernels and different PTX versions?

Regards,
Matthias


On 11/23/2016 10:13 PM, Nakul Jindal wrote:

Hi,

SystemML has experimental GPU support, which we are working to solidify.
Currently, GPU is supported in CP (Standalone/Single Node) mode. It uses a
single GPU (even if the node has more than 1 GPU).

Communication between the GPU and JVM happens through JCuda (MIT License)
-
a light java wrapper over CUDA that uses JNI. To that end, JCuda needs to
compile a platform specific shared library which is then used to
communicate with the locally installed Cuda.
To help with not having to compile a piece of C/C++ code each time, we use
a project Mavenized-Jcuda(MIT-License). This project internally has a
repository which contains compiled shared objects (for JCuda) for
different
platforms for different versions of Cuda.


For developers of SystemML (People who compile SystemML from source) :
As of today, one can checkout the master branch and follow a series of
setup steps to get SystemML in GPU mode running.
These are the steps -
https://github.com/apache/incubator-systemml/blob/master/
docs/devdocs/gpu-backend.md

1a)
Broadly,
0. Compile systemml & mavenized jcuda.
1. Mavenized JCuda jars are put into the classpath of SystemML.
2. The native shared library should be put in the LD_LIBRARY_PATH or
java.library.path.
3. SystemML should be run with the "-gpu" flag. Like so:
(In the incubator-systemml directory)

bin/systemml "file.dml" -gpu force=true

PR 291 (https://github.com/apache/incubator-systemml/pull/291) tries to
change this so that setup becomes simpler.  (Given that mavenized-jcuda is
available in one of the repositories specified in systemml's pom.xml)

1b)
0. Compule systemml
1. Run systemml

bin/systemml "file.dml" -gpu force=true



For users of SystemML:
We haven't yet decided on how to ship SystemML with GPU support. Here are
the 2 ways we can think of:

2a)
0. User installs pre-requisites (java, cuda, etc)
1. User "installs" Mavenized-JCuda or JCuda. (i.e. the package jars are
made available in the classpath). Also the relevant shared object library
files (.so, .dll) files are made available to the JVM through the
LD_LIBRARY_PATH environment variable or through java.library.path setting
variable. (Note this needs to happen if using cuda <8.0)
2. Download and run the systemml jar.

2b)
We package JCuda/Mavenized-JCuda with the SystemML distribution. We
already
package ANTLR and Wink with our jar. Our other dependencies are "provided"
scope and are not pulled in by the maven shade plugin.
A separate jar will be released for every platform.
0. User installs pre-requisites
1. Download and run systemml jar


There is also the matter of running SystemML with GPU in distributed mode.
In hybrid_spark mode, with option 2a, we'd need to install
JCuda/Mavenized-JCuda on all the worker nodes.
With option 2b, we wouldn't need to.


Berthold, Niketan and I have had a discussion and agree on option 2a, for
now.

Are there any thoughts? Inputs?

-Nakul Jindal



Reply via email to