Re: Build and distribution related issues for GPU support

Nakul Jindal Fri, 02 Dec 2016 12:03:45 -0800

@Matthias,

Thanks for your questions :)
This will help point us to a public discussion about the decision to put
the ptx under version control.



>From what I understand, we compile for a certain virtual architecture and
for a certain GPU (using the -code and -arch).
Currently, we compile for sm_20.
https://github.com/apache/incubator-systemml/blob/master/src/main/cpp/kernels/SystemML.ptx#L26
This ptx is good for "higher" REAL architectures also (sm_30, sm_32. sm_35,
sm 50, sm_52, sm_53).

Further Reading / References:
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

So to answer your first question, whether it will run on Kepler devices -
Yes, it will, because it is higher than sm_20.


For your second question - is there a performance diff between CUBIN and
PTX - yes there is.
CUBINs are compiled for a target architecture, PTX is for the virtual GPU
ISA (forward compatible) which is compiled at runtime by the JIT.
There is a startup cost. This post describes approaches to mitigate that
startup cost:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/
The blog post suggest either shipping a fat cubin - which has the ptx and
compiled code for more than one target GPU architecture - or using JIT
caching, which is controlled by setting environment variables.

Shipping a fat cubin is obviously much more heavy weight than just the ptx.
Realistically, the PTX JIT compilation adds about <5 seconds of startup
overhead (on the platforms I tested on), if the "-gpu" flag option is used.
It can be argued that in a long running job, a constant cost is justified.


-Nakul












On Thu, Nov 24, 2016 at 12:53 AM, Matthias Boehm <[email protected]>
wrote:

> So just to make sure I understand correctly: right now we compiled the few
> example kernels with PTX version 4.3, implying that this is the minimum
> requirement and SystemML's GPU backend will not run, for example, on Kepler
> devices (with PTX version 3), right?
>
> Also, is there a performance difference (generated code, or just-in-time
> compilation overhead) between CUBIN and PTX files? If so, can we quantify
> this difference to make a decision here? Thanks.
>
> Regards,
> Matthias
>
>
> On 11/24/2016 8:34 AM, Nakul Jindal wrote:
>
>> @Matthias -
>> PTX (parallel thread execution) objects are intermediate compiled objects.
>>
>> As of the current master, they are maintained under git version control.
>> This decision was agreed upon after discussing the hassle that a developer
>> of systemml without the nvidia cuda compiler might face.
>> It was decided that a person modifying the .cu files will be responsible
>> for regenerating the .ptx file and committing it to version control.
>> So far, between the active developers of systemml, this practice has not
>> disrupted their regular workflow.
>>
>> About PTX version.
>> Newer PTX versions support newer architectures. As and when we upgrade to
>> newer CUDA versions, we shall use the cuda compiler that ships with that
>> version of the toolkit and compile the .cu files in the project and commit
>> the resulting .ptx files.
>>
>> Thoughts, comments?
>>
>> -Nakul
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 23, 2016 at 2:43 PM, Matthias Boehm <[email protected]>
>> wrote:
>>
>> thanks for sharing Nakul. Could you please also comment on the PTX story
>>> for custom kernels and different PTX versions?
>>>
>>> Regards,
>>> Matthias
>>>
>>>
>>> On 11/23/2016 10:13 PM, Nakul Jindal wrote:
>>>
>>> Hi,
>>>>
>>>> SystemML has experimental GPU support, which we are working to solidify.
>>>> Currently, GPU is supported in CP (Standalone/Single Node) mode. It
>>>> uses a
>>>> single GPU (even if the node has more than 1 GPU).
>>>>
>>>> Communication between the GPU and JVM happens through JCuda (MIT
>>>> License)
>>>> -
>>>> a light java wrapper over CUDA that uses JNI. To that end, JCuda needs
>>>> to
>>>> compile a platform specific shared library which is then used to
>>>> communicate with the locally installed Cuda.
>>>> To help with not having to compile a piece of C/C++ code each time, we
>>>> use
>>>> a project Mavenized-Jcuda(MIT-License). This project internally has a
>>>> repository which contains compiled shared objects (for JCuda) for
>>>> different
>>>> platforms for different versions of Cuda.
>>>>
>>>>
>>>> For developers of SystemML (People who compile SystemML from source) :
>>>> As of today, one can checkout the master branch and follow a series of
>>>> setup steps to get SystemML in GPU mode running.
>>>> These are the steps -
>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>> docs/devdocs/gpu-backend.md
>>>>
>>>> 1a)
>>>> Broadly,
>>>> 0. Compile systemml & mavenized jcuda.
>>>> 1. Mavenized JCuda jars are put into the classpath of SystemML.
>>>> 2. The native shared library should be put in the LD_LIBRARY_PATH or
>>>> java.library.path.
>>>> 3. SystemML should be run with the "-gpu" flag. Like so:
>>>> (In the incubator-systemml directory)
>>>>
>>>> bin/systemml "file.dml" -gpu force=true
>>>>
>>>> PR 291 (https://github.com/apache/incubator-systemml/pull/291) tries to
>>>> change this so that setup becomes simpler.  (Given that mavenized-jcuda
>>>> is
>>>> available in one of the repositories specified in systemml's pom.xml)
>>>>
>>>> 1b)
>>>> 0. Compule systemml
>>>> 1. Run systemml
>>>>
>>>> bin/systemml "file.dml" -gpu force=true
>>>>
>>>>
>>>>
>>>> For users of SystemML:
>>>> We haven't yet decided on how to ship SystemML with GPU support. Here
>>>> are
>>>> the 2 ways we can think of:
>>>>
>>>> 2a)
>>>> 0. User installs pre-requisites (java, cuda, etc)
>>>> 1. User "installs" Mavenized-JCuda or JCuda. (i.e. the package jars are
>>>> made available in the classpath). Also the relevant shared object
>>>> library
>>>> files (.so, .dll) files are made available to the JVM through the
>>>> LD_LIBRARY_PATH environment variable or through java.library.path
>>>> setting
>>>> variable. (Note this needs to happen if using cuda <8.0)
>>>> 2. Download and run the systemml jar.
>>>>
>>>> 2b)
>>>> We package JCuda/Mavenized-JCuda with the SystemML distribution. We
>>>> already
>>>> package ANTLR and Wink with our jar. Our other dependencies are
>>>> "provided"
>>>> scope and are not pulled in by the maven shade plugin.
>>>> A separate jar will be released for every platform.
>>>> 0. User installs pre-requisites
>>>> 1. Download and run systemml jar
>>>>
>>>>
>>>> There is also the matter of running SystemML with GPU in distributed
>>>> mode.
>>>> In hybrid_spark mode, with option 2a, we'd need to install
>>>> JCuda/Mavenized-JCuda on all the worker nodes.
>>>> With option 2b, we wouldn't need to.
>>>>
>>>>
>>>> Berthold, Niketan and I have had a discussion and agree on option 2a,
>>>> for
>>>> now.
>>>>
>>>> Are there any thoughts? Inputs?
>>>>
>>>> -Nakul Jindal
>>>>
>>>>
>>>>
>>

Re: Build and distribution related issues for GPU support

Reply via email to