@Matthias, Thanks for your questions :) This will help point us to a public discussion about the decision to put the ptx under version control.
>From what I understand, we compile for a certain virtual architecture and for a certain GPU (using the -code and -arch). Currently, we compile for sm_20. https://github.com/apache/incubator-systemml/blob/master/src/main/cpp/kernels/SystemML.ptx#L26 This ptx is good for "higher" REAL architectures also (sm_30, sm_32. sm_35, sm 50, sm_52, sm_53). Further Reading / References: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list So to answer your first question, whether it will run on Kepler devices - Yes, it will, because it is higher than sm_20. For your second question - is there a performance diff between CUBIN and PTX - yes there is. CUBINs are compiled for a target architecture, PTX is for the virtual GPU ISA (forward compatible) which is compiled at runtime by the JIT. There is a startup cost. This post describes approaches to mitigate that startup cost: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/ The blog post suggest either shipping a fat cubin - which has the ptx and compiled code for more than one target GPU architecture - or using JIT caching, which is controlled by setting environment variables. Shipping a fat cubin is obviously much more heavy weight than just the ptx. Realistically, the PTX JIT compilation adds about <5 seconds of startup overhead (on the platforms I tested on), if the "-gpu" flag option is used. It can be argued that in a long running job, a constant cost is justified. -Nakul On Thu, Nov 24, 2016 at 12:53 AM, Matthias Boehm <[email protected]> wrote: > So just to make sure I understand correctly: right now we compiled the few > example kernels with PTX version 4.3, implying that this is the minimum > requirement and SystemML's GPU backend will not run, for example, on Kepler > devices (with PTX version 3), right? > > Also, is there a performance difference (generated code, or just-in-time > compilation overhead) between CUBIN and PTX files? If so, can we quantify > this difference to make a decision here? Thanks. > > Regards, > Matthias > > > On 11/24/2016 8:34 AM, Nakul Jindal wrote: > >> @Matthias - >> PTX (parallel thread execution) objects are intermediate compiled objects. >> >> As of the current master, they are maintained under git version control. >> This decision was agreed upon after discussing the hassle that a developer >> of systemml without the nvidia cuda compiler might face. >> It was decided that a person modifying the .cu files will be responsible >> for regenerating the .ptx file and committing it to version control. >> So far, between the active developers of systemml, this practice has not >> disrupted their regular workflow. >> >> About PTX version. >> Newer PTX versions support newer architectures. As and when we upgrade to >> newer CUDA versions, we shall use the cuda compiler that ships with that >> version of the toolkit and compile the .cu files in the project and commit >> the resulting .ptx files. >> >> Thoughts, comments? >> >> -Nakul >> >> >> >> >> >> >> On Wed, Nov 23, 2016 at 2:43 PM, Matthias Boehm <[email protected]> >> wrote: >> >> thanks for sharing Nakul. Could you please also comment on the PTX story >>> for custom kernels and different PTX versions? >>> >>> Regards, >>> Matthias >>> >>> >>> On 11/23/2016 10:13 PM, Nakul Jindal wrote: >>> >>> Hi, >>>> >>>> SystemML has experimental GPU support, which we are working to solidify. >>>> Currently, GPU is supported in CP (Standalone/Single Node) mode. It >>>> uses a >>>> single GPU (even if the node has more than 1 GPU). >>>> >>>> Communication between the GPU and JVM happens through JCuda (MIT >>>> License) >>>> - >>>> a light java wrapper over CUDA that uses JNI. To that end, JCuda needs >>>> to >>>> compile a platform specific shared library which is then used to >>>> communicate with the locally installed Cuda. >>>> To help with not having to compile a piece of C/C++ code each time, we >>>> use >>>> a project Mavenized-Jcuda(MIT-License). This project internally has a >>>> repository which contains compiled shared objects (for JCuda) for >>>> different >>>> platforms for different versions of Cuda. >>>> >>>> >>>> For developers of SystemML (People who compile SystemML from source) : >>>> As of today, one can checkout the master branch and follow a series of >>>> setup steps to get SystemML in GPU mode running. >>>> These are the steps - >>>> https://github.com/apache/incubator-systemml/blob/master/ >>>> docs/devdocs/gpu-backend.md >>>> >>>> 1a) >>>> Broadly, >>>> 0. Compile systemml & mavenized jcuda. >>>> 1. Mavenized JCuda jars are put into the classpath of SystemML. >>>> 2. The native shared library should be put in the LD_LIBRARY_PATH or >>>> java.library.path. >>>> 3. SystemML should be run with the "-gpu" flag. Like so: >>>> (In the incubator-systemml directory) >>>> >>>> bin/systemml "file.dml" -gpu force=true >>>> >>>> PR 291 (https://github.com/apache/incubator-systemml/pull/291) tries to >>>> change this so that setup becomes simpler. (Given that mavenized-jcuda >>>> is >>>> available in one of the repositories specified in systemml's pom.xml) >>>> >>>> 1b) >>>> 0. Compule systemml >>>> 1. Run systemml >>>> >>>> bin/systemml "file.dml" -gpu force=true >>>> >>>> >>>> >>>> For users of SystemML: >>>> We haven't yet decided on how to ship SystemML with GPU support. Here >>>> are >>>> the 2 ways we can think of: >>>> >>>> 2a) >>>> 0. User installs pre-requisites (java, cuda, etc) >>>> 1. User "installs" Mavenized-JCuda or JCuda. (i.e. the package jars are >>>> made available in the classpath). Also the relevant shared object >>>> library >>>> files (.so, .dll) files are made available to the JVM through the >>>> LD_LIBRARY_PATH environment variable or through java.library.path >>>> setting >>>> variable. (Note this needs to happen if using cuda <8.0) >>>> 2. Download and run the systemml jar. >>>> >>>> 2b) >>>> We package JCuda/Mavenized-JCuda with the SystemML distribution. We >>>> already >>>> package ANTLR and Wink with our jar. Our other dependencies are >>>> "provided" >>>> scope and are not pulled in by the maven shade plugin. >>>> A separate jar will be released for every platform. >>>> 0. User installs pre-requisites >>>> 1. Download and run systemml jar >>>> >>>> >>>> There is also the matter of running SystemML with GPU in distributed >>>> mode. >>>> In hybrid_spark mode, with option 2a, we'd need to install >>>> JCuda/Mavenized-JCuda on all the worker nodes. >>>> With option 2b, we wouldn't need to. >>>> >>>> >>>> Berthold, Niketan and I have had a discussion and agree on option 2a, >>>> for >>>> now. >>>> >>>> Are there any thoughts? Inputs? >>>> >>>> -Nakul Jindal >>>> >>>> >>>> >>
