Re: [nupic-dev] NuPIC performance requirements

Oreste Villa Thu, 22 Aug 2013 11:00:17 -0700

As today, server GPUs have typically 4-6 GB of memory capacity while
consumer GPUs I would say 2-4 GB.


Transfer from CPU to GPU is something around 4-6GB/sec (PCI express
limited) with contiguous data transfer larger than 256KB (better at 1MB).
With smaller transfer size BW is much lower. Also you pay a 10 microsec
every-time you pass thought PCI express.

Please can someone answer to these questions:

1) During a time-step is the SP the bottleneck (I thought it was the TP)?
2) Let's assume is the SP, how many milliseconds are spent doing the the
sparse matrix vector operations?
3) how big are the matrices and the vectors and how many matrices and
vecotrs?
4) what are these operations (I ma assuming they are sparse matrix vector
multiplications but I am not sure).

Thanks,

Oreste


On Thu, Aug 22, 2013 at 9:47 AM, Ian Danforth <[email protected]>wrote:

> Oreste,
>
>  For the novices on the list can you give us some reference numbers for
> modern GPU's, their memory spaces and transfer speeds? For example a
> single, trained CLA region varies greatly in size, but for single step
> prediction 100MB resident in memory is not unusual. Under the current
> implementation, if increase the number of prediction steps you double the
> memory requirements. (Scott, correct me if this has changed recently).
>
>  I've always thought that a parallel CLA implementation would start at the
> region level, and what would be passed between regions would be the state
> of the columns at each time step (2048 bit sparse vector) which could be
> quite small. But that is thinking about a hierarchy of regions where data
> moves up or down, not necessarily within a level.
>
>   Perhaps Subutai could also comment on thoughts for within-region
> parallel implementations. Can a region be logically subdivided? If so,
> where would the cross-sub-region TP connections live?
>
> Thanks!
>
> Ian
>
>
> On Thu, Aug 22, 2013 at 8:36 AM, Oreste Villa <[email protected]>wrote:
>
>> Thanks for the info, I am starting to have a better picture now.
>>
>> A quick question, It is my understanding from the pseudo-code in the
>> white paper that the most computationally intensive part of the algorithm
>> is actually the TP and not the SP (simply because one iterates on the
>> columns and the other iterates on the all the cells).
>> Is there a reason why you have started the C++ porting from the SP
>> instead of the TP?
>>
>> Also, "porting" to GPUs is a all different story than porting from Python
>> to C++. MPI is even more different.
>>
>> For instance moving the sparse matrix and vector for every iteration
>> to-from CPU-GPU memory starting from the existing code is most likely not
>> going to make it (specially if the matrices are relatively small and
>> computation on the GPU is order sub-millisecond). Kernel launch and copy
>> latencies are going to kill performance. The data needs to be resident in
>> GPU memory and only the input/output data to regions (if needed) needs to
>> be moved at each time-step, the larger the regions the better.
>>
>> More specific question for the current C++ implementation of the SP.
>>
>> 1) What is the size of the sparse matrices and vectors for the example 2048
>> columns and 65536 cells?
>> 2) What is the exact operation on sparse matrices and vectors? (is this a
>> simple multiplication among them? only one per time step?)
>>
>> Also usually "porting" to high performance AFTER having implemented a lot
>> of functionalities on single threaded code is much more difficult than
>> starting designing a high performance code and then adding slowly new
>> functionalities.
>>
>> Having said that, one idea I am thinking about (for which I would like to
>> get your feedback) is to actually make a "simple" MPI + GPU (or just GPU to
>> start) prototype.
>> I am thinking to start from scratch form the white-paper (or the latest
>> incarnation of it) to evaluate trade-offs and benefits of the approach.
>> It is my intuition that speeding up significantly a single region with 2048
>> columns and 65536 cells is not going to be easy with GPUs unless the
>> full code and data is running inside the GPU and only the minimal necessary
>> input/output is done.
>> However when you have 10s of regions with 1000000s cells each, things are
>> going to be much different. It would be nice to know where is the trade-off
>> point, hence a simple prototype.
>>
>> Please let me know if the above makes some sense to you.
>>
>> Oreste
>>
>>
>>
>> On Wed, Aug 21, 2013 at 9:21 PM, Subutai Ahmad <[email protected]>wrote:
>>
>>>
>>> Oreste and Doug - thank you for the comments! It is great to see these
>>> ideas. I agree that performance optimization is extremely important. I want
>>> to encourage you guys to continue the discussion and hopefully implement
>>> something too. A speed improvement will help the community and accelerate
>>> CLA based experimentation. (One of the main reasons we stopped focusing on
>>> vision was the inability to run large experiments in reasonable time.)
>>>
>>> I don't know if this is useful, but here's a quick guide to the current
>>> optimized code:
>>>
>>> Currently within NuPIC the CLA is implemented as a combination of Python
>>> and C++. Over time we moved the slower portions of CLA to C++.   Today, the
>>> code is reasonably fast. A CLA region with 2048 columns and 65536 cells
>>> takes about 20 msecs per iteration on my laptop. This includes SP, TP, the
>>> classifier and full online learning turned on.
>>>
>>> There are two main building blocks in C++ from a performance standpoint.
>>>
>>> 1) There are a set of SparseMatrix classes that implements fast data
>>> structures  for sparse vectors and matrices. They include a large number of
>>> utility routines that are optimized for CLA functions.  Today these are
>>> primarily used  by the Spatial Pooler. There are Python bindings for these
>>> classes. You can go to 
>>> nupic/examples/bindings/sparse_matrix_how_to.py<https://github.com/numenta/nupic/blob/master/examples/bindings/sparse_matrix_how_to.py>for
>>>  a tutorial on this.
>>>
>>> 2) There are a set of classes the implement fast data structures for the
>>> temporal pooler. These are very specific to temporal pooling. Sparse
>>> matrices were not enough for the TP. The input is very sparse and we
>>> implemented some strategies for evaluating only cells that connected to
>>> those ON input bits. The main starting point for this code is
>>> nupic/nta/algorithms/Cells4.hpp<https://github.com/numenta/nupic/blob/master/nta/algorithms/Cells4.hpp>.
>>> There are python bindings for this as well, and they are called by the
>>> Python temporal pooler class 
>>> TP10X2.py<https://github.com/numenta/nupic/blob/master/py/nupic/research/TP10X2.py>
>>>
>>> Our curent plan is to create a pure C++ spatial pooler implementation
>>> (see Gil's email). This will also be much cleaner than the current
>>> implementation. Perhaps it can serve as a base for some of the ideas that
>>> Oreste and Doug have mentioned.
>>>
>>> We have not extensively explored multi-threaded options. We mainly
>>> focused on serial optimizations so far.
>>>
>>> Hope this helps!
>>>
>>> --Subutai
>>>
>>>
>>>
>>> On Wed, Aug 21, 2013 at 2:02 PM, Oreste Villa <[email protected]>wrote:
>>>
>>>> Thanks for the pointers, I can start looking into the code and the
>>>> tickets. Unfortunately I have limited time and I will be looking into it as
>>>> a "hobby" project for the moment.
>>>>
>>>> I think there is still some time before moving to opencl (or other). As
>>>> you said, the first thing in the list is to have a very good c++
>>>> parallelazible cpu implementation. I would say the best think is most
>>>> likely openmp, also gcc4.9 should in the very near future support openacc
>>>> which is basically the equivalent of openmp but for gpus. So maybe we can
>>>> skip opencl completely. BTW I am not a big fan of opencl (because I think
>>>> is very verbose and tedious to use) and I have worked a lot with CUDA and
>>>> MPI. I agree opencl is more portable than CUDA but openacc targets both
>>>> and  should solve this issue.
>>>>
>>>> Regarding unified address space from the programming model point of
>>>> view is going to come very soon but hardware support with indistinguishable
>>>> performance should come in 2 generations (crossing fingers).
>>>>
>>>> Oreste
>>>> On Aug 21, 2013 10:32 AM, "Doug King" <[email protected]> wrote:
>>>>
>>>>> Hi Oreste,
>>>>>
>>>>> Good points and observations. My comments below.
>>>>>
>>>>> *1) Make sure all the SP and TP (basically the code with a lot of
>>>>> nested loops) is at least C++. I would also make sure I am not abusing too
>>>>> much of STL library or boost library (like for instance in the list of
>>>>> active columns , etc) as this is usually tricky (or low performance) when
>>>>> ported to GPUs.*
>>>>>
>>>>> We are doing some of this in the current codebase. See Jira ticket
>>>>> https://issues.numenta.org/browse/NPC-286  - porting entire CLA to
>>>>> c++ with language bindings for other languages. First steps are to migrate
>>>>> spatial pooler to c++. To see progress on this check here:
>>>>> https://issues.numenta.org/browse/NPC-246  I am not sure of the
>>>>> approach they are taking regarding STL and Boost library.
>>>>>
>>>>> *2) I would convert the above nested loops (for all columns, for all
>>>>> cells, etc) to OpenCL. We have to be careful about data movement today and
>>>>> try to keep most of the data in GPU space (although this problem is going
>>>>> to disappear with next or next-next generation of GPUs which are going to
>>>>> have unified address space). *
>>>>>
>>>>> Yes, you are right about data movement. I don't know much about how
>>>>> OpenCL abstracts storage and message passing to nodes. Data movement could
>>>>> kill any gains you get from parallel computing if not handled correctly -
>>>>> biggest issue is, how to move input data onto the GPU and extract
>>>>> prediction back out. I/O for each time frame is the problem. First steps I
>>>>> think would be to convert those areas to CPU manycore parallel code in C++
>>>>> to sort out any issues, then OpenCL.
>>>>>
>>>>> Interesting about next-gen GPUs and unified address space - how close
>>>>> are we to getting unified memory space GPU?
>>>>>
>>>>> *3) Support for multiple GPU within the a single node. I would map
>>>>> different regions on different host-threads and GPUs and for large regions
>>>>> I would try would partition them across multiple host-threads and GPUs.
>>>>> Considering that they are 2D regions and communication is mostly localized
>>>>> around the columns it should be doable. However on the boundaries of the
>>>>> partitions is going to be tricky as updates to cells or columns will 
>>>>> depend
>>>>> on values that are in another GPU address space (again next-next 
>>>>> generation
>>>>> of accelerators should solve this problem with unified address space).
>>>>> *
>>>>>
>>>>> Perhaps we should be looking at cheap mulit-core CPUs for now to
>>>>> address this, or would we need to port the entire CLA into OpenCL to run 
>>>>> on
>>>>> shared memory in the GPU ? I don't know enough about the architecture of
>>>>> current code, OpenCL and GPU memory space to understand the right 
>>>>> approach,
>>>>> but we should take logical steps that we can build on to get there.
>>>>>
>>>>> *4) To parallelize across multiple nodes in a cluster I would
>>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce
>>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to
>>>>> compute the final result. In our case considering that the "computation" 
>>>>> is
>>>>> based around the concept of step (or clock) at every step there is going 
>>>>> to
>>>>> be a significant amount of communication across regions and within the
>>>>> region. Thankfully if we stick to the concept of time step (not exactly
>>>>> brain like) we can batch that communication and perform it at the end of
>>>>> each step. If using MPI, I would map different regions on different MPI
>>>>> ranks and for large regions I would partition them on multiple MPI ranks 
>>>>> as
>>>>> explained for point 3 within the node (basically a hierarchy of
>>>>> parallelization).*
>>>>>
>>>>> Agreed - Map-Reduce is not the right paradigm for the reasons you
>>>>> state. For moving lots of data across multiple compute nodes MPI is the
>>>>> standard and might be appropriate to adopt here I think.
>>>>>
>>>>> When we start talking about hierarchy though, I think operation would
>>>>> be similar to map/reduce with lower regions computing on nodes that are
>>>>> independent of each other and which feed higher regions at a slower rate.
>>>>> For example, audio (speech) prediction - break audio into spectrum
>>>>> (frequency bands), feed each band in the spectrum to individual regions.
>>>>> Feed predictions of each region into a higher region that aggregates all
>>>>> the predictions of individual audio bands.
>>>>>
>>>>> If we want to move this forward we should come up with a roadmap and
>>>>> next steps. Input from others here would be appreciated.
>>>>>
>>>>> -Doug
>>>>>
>>>>>
>>>>> On Wed, Aug 21, 2013 at 8:26 AM, Oreste Villa 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> So if I understand correctly mostly of the code is Python but the
>>>>>> "core" is mostly C++ or is going to be C++.
>>>>>>
>>>>>> The way I would approach this problem is the following (in temporal
>>>>>> order):
>>>>>>
>>>>>> 1) Make sure all the SP and TP (basically the code with a lot of
>>>>>> nested loops) is at least C++. I would also make sure I am not abusing 
>>>>>> too
>>>>>> much of STL library or boost library (like for instance in the list of
>>>>>> active columns , etc) as this is usually tricky (or low performance) when
>>>>>> ported to GPUs.
>>>>>>
>>>>>> 2) I would convert the above nested loops (for all columns, for all
>>>>>> cells, etc) to OpenCL. We have to be careful about data movement today 
>>>>>> and
>>>>>> try to keep most of the data in GPU space (although this problem is going
>>>>>> to disappear with next or next-next generation of GPUs which are going to
>>>>>> have unified address space).
>>>>>>
>>>>>> 3) Support for multiple GPU within the a single node. I would map
>>>>>> different regions on different host-threads and GPUs and for large 
>>>>>> regions
>>>>>> I would try would partition them across multiple host-threads and GPUs.
>>>>>> Considering that they are 2D regions and communication is mostly 
>>>>>> localized
>>>>>> around the columns it should be doable. However on the boundaries of the
>>>>>> partitions is going to be tricky as updates to cells or columns will 
>>>>>> depend
>>>>>> on values that are in another GPU address space (again next-next 
>>>>>> generation
>>>>>> of accelerators should solve this problem with unified address space).
>>>>>>
>>>>>> 4) To parallelize across multiple nodes in a cluster I would
>>>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce
>>>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to
>>>>>> compute the final result. In our case considering that the "computation" 
>>>>>> is
>>>>>> based around the concept of step (or clock) at every step there is going 
>>>>>> to
>>>>>> be a significant amount of communication across regions and within the
>>>>>> region. Thankfully if we stick to the concept of time step (not exactly
>>>>>> brain like) we can batch that communication and perform it at the end of
>>>>>> each step. If using MPI, I would map different regions on different MPI
>>>>>> ranks and for large regions I would partition them on multiple MPI ranks 
>>>>>> as
>>>>>> explained for point 3 within the node (basically a hierarchy of
>>>>>> parallelization).
>>>>>>
>>>>>> The code would obliviously work also with a single MPI process and
>>>>>> therefore on a normal workstation with or without GPUs. BTW with GPU I 
>>>>>> mean
>>>>>> also Intel Phi. Also I estimate from point 1 to 4 at least 1 or 2 years 
>>>>>> of
>>>>>> work depending on the number of people involved.
>>>>>>
>>>>>> Regarding hardware implementation, I also feel is the right way to go
>>>>>> in the long term but for now I would definitively go with the above
>>>>>> solution (considering most likely the algorithm will change in the next
>>>>>> years).
>>>>>> If well implemented the above approach could increase performance of
>>>>>> at least 2 orders of magnitude within a node and most likely scale 
>>>>>> linearly
>>>>>> across a moderate number of cluster nodes.
>>>>>>
>>>>>> I know some of the people on this mailing list have implemented their
>>>>>> own C++ version of HTM in the past, so I am sure they would be 
>>>>>> definitively
>>>>>> interested. Comments are welcome.
>>>>>>
>>>>>> Oreste
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2013 at 11:22 PM, Doug King <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Oreste,
>>>>>>>
>>>>>>> you are right, performance will be a central issue. There are a few
>>>>>>> bottlenecks in the algorithm that can be attacked with hardware
>>>>>>> acceleration. The best approach I can think of for now is to use
>>>>>>> parallelization (some form of map-reduce) to solve this. OpenCL would 
>>>>>>> be a
>>>>>>> good choice to use in place of some of the C++ or Python code. The rest 
>>>>>>> of
>>>>>>> the Python code could be kept as-is to allow for easy experimentation 
>>>>>>> for
>>>>>>> optimization of parameters or changes to features that are not core CLA
>>>>>>> algorithms.
>>>>>>>
>>>>>>> There are many OpenCL drivers for GPUs and there is even a platform
>>>>>>> for converting OpenCL code to FPGA hardware. Eventually the CLA will be
>>>>>>> ported to some sort of digital/analog hybrid device that simulates
>>>>>>> dendrite/synapse connection on neuromorphic silicone. This will not be 
>>>>>>> far
>>>>>>> off - maybe 5 years or less for early experiments, 10 years for cheap
>>>>>>> commodity devices.
>>>>>>>
>>>>>>> For now, most of us are trying to get results that are proof of
>>>>>>> concept with the current code base, then we will figure out how to 
>>>>>>> scale up
>>>>>>> and optimize.
>>>>>>>
>>>>>>> Another key to acceleration will be the sharing of trained networks
>>>>>>> that have encapsulated many CPU hours of training on fundamental 
>>>>>>> streams of
>>>>>>> data, for example speech audio, that once trained will be shared or 
>>>>>>> sold.
>>>>>>> If this happens the building blocks of lower HTM regions could be 
>>>>>>> leveraged
>>>>>>> to get to the next level. We need to work towards some CLA network
>>>>>>> serialization standards for this to happen.
>>>>>>>
>>>>>>> I think you are correct in your assumptions, and if you want to
>>>>>>> contribute to the effort to move to a more performant version of the 
>>>>>>> code I
>>>>>>> would love to see someone port some of the critial segments of the CLA 
>>>>>>> code
>>>>>>> to OpenCL. For an analysis of where the bottlenecks are in the CLA and
>>>>>>> hardware solutions you can start by checking out this paper:
>>>>>>> http://www.pdx.edu/sites/www.pdx.edu.sysc/files/SySc.Seminar.Hammestrom.May.2011.pdf
>>>>>>>
>>>>>>> -Doug
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 20, 2013 at 9:59 PM, oreste villa <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello everybody, this is my first post on this list so please
>>>>>>>> forgive me if this has already been addressed before.
>>>>>>>>
>>>>>>>> I have seen that the current NuPIC source code is mostly Phyton
>>>>>>>> and I am wondering....
>>>>>>>>
>>>>>>>> I don't know about the problems people are trying to solve today
>>>>>>>> (maybe for demand and response of power in a building this is not 
>>>>>>>> true) but
>>>>>>>> in the future I believe performance is going to be a central issue. 
>>>>>>>> Python
>>>>>>>> seems to be a non-optimal choice in this respect (as single threaded 
>>>>>>>> Java,
>>>>>>>> single threaded C# or single threaded C++, or everything not parallel).
>>>>>>>>
>>>>>>>> I keep thinking for instance that the the Large Hadron Collider at 
>>>>>>>> CERN produces
>>>>>>>> something like 3 GByte <http://en.wikipedia.org/wiki/Megabyte>/s
>>>>>>>> of raw data and it would be really nice if we were able to feed at
>>>>>>>> full year of experiments in real time to a system based on the CLA. 
>>>>>>>> Also in
>>>>>>>> robotic, performance and I/O bandwidth requirements for vision, 
>>>>>>>> sensing and
>>>>>>>> motion control are impressive.
>>>>>>>>
>>>>>>>> The question/discussion point I wanted to make is, where does the
>>>>>>>> project stand in terms of performance? More specifically, are there any
>>>>>>>> plans to design high performance code inside NuPIC (openMP, CUDA,
>>>>>>>> MPI)? Is this something much less emphasized because the focus of the
>>>>>>>> project is more on learning the basic CLA principles?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Oreste
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> nupic mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] NuPIC performance requirements

Reply via email to