[jira] [Comment Edited] (FLINK-5782) Support GPU calculations

Kate Eri (JIRA) Mon, 20 Feb 2017 08:03:06 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871803#comment-15871803
 ]


Kate Eri edited comment on FLINK-5782 at 2/20/17 4:02 PM:
----------------------------------------------------------

I have checked out ND4J and Apache Mahout for their GPU integrations and have 
the following:
1)      ND4J :
        Advantages: 
        * seems to be more adopted to different computational tasks
        * supports wide range of different natively optimized libraries like 
OpenBLAS, Intel MKL, cuBLAS and so on. 
         * ND4J automatically determines when CPU/GPU is available and uses 
appropriate backend when required: [ND4J Backbends: How They 
Work|http://nd4j.org/backend.html], and source code: Nd4jBackend.java. There is 
no API to say which backend to use first, except priority mechanism. 

        Disadvantages:  
        * ND4J doesn’t support sparse matrixes. 

2)      Apache Mahout:
        Advantages: 
        * supports sparse matrixes
        * GPU/CPU switch, when GPU is not available 
        * Supported distributed algebraic calculations

        Disadvantages: 
        * No natively optimized linalgebra calculations (BLAS, LAPACk and so 
on) 

But main problem of both integrations, as I see it now is described here: 
[Caching/Persisting RDD<DataSets> and 
RDD<INDArrays>|https://deeplearning4j.org/spark]. 
Shortly speaking: Flink has custom memory model, operating with memory, both 
for heap and offheap. Flink allocates memory segments also in case of offheap 
usage. But now we have Apache Mahout and ND4J which operating with native 
memory on their own, and they don’t have any API that can describe: 
* how much memory they have consumed
* or addresses of data in offheap memory in case we need to move this data 
through network without coping of it to the heap first.

If Flink will have no control of consuming memory, this could cause OOM 
situations, when all RAM was consumed, but Flink wasn’t able to manage this, 
and at least spill some data on disk.  



was (Author: kateri):
I have checked out ND4J and Apache Mahout for their GPU integrations and have 
the following:
1)      ND4J :
        Advantages: 
        * seems to be more adopted to different computational tasks
        * supports wide range of different natively optimized libraries like 
OpenBLAS, Intel MKL, cuBLAS and so on. 

        Disadvantages:  
        * ND4J doesn’t support sparse matrixes. 
        * It’s impossible for now programmatically configure ND4J to use 
currently for calculations CPU or GPU, to switch you need to take another 
library. See  [Configuring the POM.xml file|http://nd4j.org/dependencies.html]

2)      Apache Mahout:
        Advantages: 
        * supports sparse matrixes
        * GPU/CPU switch, when GPU is not available 
        * Supported distributed algebraic calculations

        Disadvantages: 
        * No natively optimized linalgebra calculations (BLAS, LAPACk and so 
on) 

But main problem of both integrations, as I see it now is described here: 
[Caching/Persisting RDD<DataSets> and 
RDD<INDArrays>|https://deeplearning4j.org/spark]. 
Shortly speaking: Flink has custom memory model, operating with memory, both 
for heap and offheap. Flink allocates memory segments also in case of offheap 
usage. But now we have Apache Mahout and ND4J which operating with native 
memory on their own, and they don’t have any API that can describe: 
* how much memory they have consumed
* or addresses of data in offheap memory in case we need to move this data 
through network without coping of it to the heap first.

If Flink will have no control of consuming memory, this could cause OOM 
situations, when all RAM was consumed, but Flink wasn’t able to manage this, 
and at least spill some data on disk.  


> Support GPU calculations
> ------------------------
>
>                 Key: FLINK-5782
>                 URL: https://issues.apache.org/jira/browse/FLINK-5782
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.3.0
>            Reporter: Kate Eri
>            Priority: Minor
>
> This ticket was initiated as continuation of the dev discussion thread: [New 
> Flink team member - Kate Eri (Integration with DL4J 
> topic)|http://mail-archives.apache.org/mod_mbox/flink-dev/201702.mbox/browser]
>   
> Recently we have proposed the idea to integrate 
> [Deeplearning4J|https://deeplearning4j.org/index.html] with Apache Flink. 
> It is known that DL models training is resource demanding process, so 
> training on CPU could converge much longer than on GPU.  
> But not only for DL training GPU usage could be supposed, but also for 
> optimization of graph analytics and other typical data manipulations, nice 
> overview of GPU related problems is presented [Accelerating Spark workloads 
> using 
> GPUs|https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus].
> Currently the community pointed the following issues to consider:
> 1)    Flink would like to avoid to write one more time its own GPU support, 
> to reduce engineering burden. That’s why such libraries like 
> [ND4J|http://nd4j.org/userguide]  should be considered. 
> 2)    Currently Flink uses [Breeze|https://github.com/scalanlp/breeze], to 
> optimize linear algebra calculations, ND4J can’t be integrated as is, because 
> it still doesn’t support [sparse arrays|http://nd4j.org/userguide#faq]. Maybe 
> this issue should be simply contributed to ND4J to enable its usage?
> 3)    The calculations would have to work with both available and not 
> available GPUs. If the system detects that GPUs are available, then ideally 
> it would exploit them. Thus GPU resource management could be incorporated in 
> [FLINK-5131|https://issues.apache.org/jira/browse/FLINK-5131] (only 
> suggested).
> 4)    It was mentioned that as far Flink takes care of shipping data around 
> the cluster, also it will perform its dump out to GPU for calculation and 
> load back up. In practice, the lack of a persist method for intermediate 
> results makes this troublesome (not because of GPUs but for calculating any 
> sort of complex algorithm we expect to be able to cache intermediate results).
> That’s why the Ticket 
> [FLINK-1730|https://issues.apache.org/jira/browse/FLINK-1730] must be 
> implemented to solve such problem.  
> 5)    Also it was recommended to take a look at Apache Mahout, at least to 
> get the experience with  GPU integration and check its
> https://github.com/apache/mahout/tree/master/viennacl-omp
> https://github.com/apache/mahout/tree/master/viennacl 
> 6)  For now, GPU proposed only for batch calculations optimization, to 
> support GPU for streaming should be started another ticket, because 
> optimization of streaming by GPU requires additional research.     
> 7) Also experience of Netflix regarding this question could be considered: 
> [Distributed Neural Networks with GPUs in the AWS 
> Cloud|http://techblog.netflix.com/search/label/CUDA]   
> This is considered as master ticket for GPU related ticktes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (FLINK-5782) Support GPU calculations

Reply via email to