Re: [nupic-dev] NuPIC performance requirements

Subutai Ahmad Fri, 23 Aug 2013 09:33:20 -0700

Hi Ian,

Subdividing a region is a bit tricky since the CLA models a continuous
sheet of neurons. It's much easier when you have a topology since
connections are local. In that case you can make a cut anywhere, you just
need to guarantee that cells on both sides of the cut get the correct input.


However you don't have to subdivide a region. As you point out, when you
have a hierarchy you will have multiple regions that are logically
separate. In this case one or more regions feed into a second, higher level
region and so on.  It can make sense to parallelize each region separately.
The trick here is to keep each region busy. You can do that by pipelining:
cells at level 1 can process input from time T while cells at level 2 are
processing inputs at time T-1.

--Subutai





On Thu, Aug 22, 2013 at 9:47 AM, Ian Danforth <[email protected]>wrote:

> Oreste,
>
>  For the novices on the list can you give us some reference numbers for
> modern GPU's, their memory spaces and transfer speeds? For example a
> single, trained CLA region varies greatly in size, but for single step
> prediction 100MB resident in memory is not unusual. Under the current
> implementation, if increase the number of prediction steps you double the
> memory requirements. (Scott, correct me if this has changed recently).
>
>  I've always thought that a parallel CLA implementation would start at the
> region level, and what would be passed between regions would be the state
> of the columns at each time step (2048 bit sparse vector) which could be
> quite small. But that is thinking about a hierarchy of regions where data
> moves up or down, not necessarily within a level.
>
>   Perhaps Subutai could also comment on thoughts for within-region
> parallel implementations. Can a region be logically subdivided? If so,
> where would the cross-sub-region TP connections live?
>
> Thanks!
>
> Ian
>
>
> On Thu, Aug 22, 2013 at 8:36 AM, Oreste Villa <[email protected]>wrote:
>
>> Thanks for the info, I am starting to have a better picture now.
>>
>> A quick question, It is my understanding from the pseudo-code in the
>> white paper that the most computationally intensive part of the algorithm
>> is actually the TP and not the SP (simply because one iterates on the
>> columns and the other iterates on the all the cells).
>> Is there a reason why you have started the C++ porting from the SP
>> instead of the TP?
>>
>> Also, "porting" to GPUs is a all different story than porting from Python
>> to C++. MPI is even more different.
>>
>> For instance moving the sparse matrix and vector for every iteration
>> to-from CPU-GPU memory starting from the existing code is most likely not
>> going to make it (specially if the matrices are relatively small and
>> computation on the GPU is order sub-millisecond). Kernel launch and copy
>> latencies are going to kill performance. The data needs to be resident in
>> GPU memory and only the input/output data to regions (if needed) needs to
>> be moved at each time-step, the larger the regions the better.
>>
>> More specific question for the current C++ implementation of the SP.
>>
>> 1) What is the size of the sparse matrices and vectors for the example 2048
>> columns and 65536 cells?
>> 2) What is the exact operation on sparse matrices and vectors? (is this a
>> simple multiplication among them? only one per time step?)
>>
>> Also usually "porting" to high performance AFTER having implemented a lot
>> of functionalities on single threaded code is much more difficult than
>> starting designing a high performance code and then adding slowly new
>> functionalities.
>>
>> Having said that, one idea I am thinking about (for which I would like to
>> get your feedback) is to actually make a "simple" MPI + GPU (or just GPU to
>> start) prototype.
>> I am thinking to start from scratch form the white-paper (or the latest
>> incarnation of it) to evaluate trade-offs and benefits of the approach.
>> It is my intuition that speeding up significantly a single region with 2048
>> columns and 65536 cells is not going to be easy with GPUs unless the
>> full code and data is running inside the GPU and only the minimal necessary
>> input/output is done.
>> However when you have 10s of regions with 1000000s cells each, things are
>> going to be much different. It would be nice to know where is the trade-off
>> point, hence a simple prototype.
>>
>> Please let me know if the above makes some sense to you.
>>
>> Oreste
>>
>>
>>
>> On Wed, Aug 21, 2013 at 9:21 PM, Subutai Ahmad <[email protected]>wrote:
>>
>>>
>>> Oreste and Doug - thank you for the comments! It is great to see these
>>> ideas. I agree that performance optimization is extremely important. I want
>>> to encourage you guys to continue the discussion and hopefully implement
>>> something too. A speed improvement will help the community and accelerate
>>> CLA based experimentation. (One of the main reasons we stopped focusing on
>>> vision was the inability to run large experiments in reasonable time.)
>>>
>>> I don't know if this is useful, but here's a quick guide to the current
>>> optimized code:
>>>
>>> Currently within NuPIC the CLA is implemented as a combination of Python
>>> and C++. Over time we moved the slower portions of CLA to C++.   Today, the
>>> code is reasonably fast. A CLA region with 2048 columns and 65536 cells
>>> takes about 20 msecs per iteration on my laptop. This includes SP, TP, the
>>> classifier and full online learning turned on.
>>>
>>> There are two main building blocks in C++ from a performance standpoint.
>>>
>>> 1) There are a set of SparseMatrix classes that implements fast data
>>> structures  for sparse vectors and matrices. They include a large number of
>>> utility routines that are optimized for CLA functions.  Today these are
>>> primarily used  by the Spatial Pooler. There are Python bindings for these
>>> classes. You can go to 
>>> nupic/examples/bindings/sparse_matrix_how_to.py<https://github.com/numenta/nupic/blob/master/examples/bindings/sparse_matrix_how_to.py>for
>>>  a tutorial on this.
>>>
>>> 2) There are a set of classes the implement fast data structures for the
>>> temporal pooler. These are very specific to temporal pooling. Sparse
>>> matrices were not enough for the TP. The input is very sparse and we
>>> implemented some strategies for evaluating only cells that connected to
>>> those ON input bits. The main starting point for this code is
>>> nupic/nta/algorithms/Cells4.hpp<https://github.com/numenta/nupic/blob/master/nta/algorithms/Cells4.hpp>.
>>> There are python bindings for this as well, and they are called by the
>>> Python temporal pooler class 
>>> TP10X2.py<https://github.com/numenta/nupic/blob/master/py/nupic/research/TP10X2.py>
>>>
>>> Our curent plan is to create a pure C++ spatial pooler implementation
>>> (see Gil's email). This will also be much cleaner than the current
>>> implementation. Perhaps it can serve as a base for some of the ideas that
>>> Oreste and Doug have mentioned.
>>>
>>> We have not extensively explored multi-threaded options. We mainly
>>> focused on serial optimizations so far.
>>>
>>> Hope this helps!
>>>
>>> --Subutai
>>>
>>>
>>>
>>> On Wed, Aug 21, 2013 at 2:02 PM, Oreste Villa <[email protected]>wrote:
>>>
>>>> Thanks for the pointers, I can start looking into the code and the
>>>> tickets. Unfortunately I have limited time and I will be looking into it as
>>>> a "hobby" project for the moment.
>>>>
>>>> I think there is still some time before moving to opencl (or other). As
>>>> you said, the first thing in the list is to have a very good c++
>>>> parallelazible cpu implementation. I would say the best think is most
>>>> likely openmp, also gcc4.9 should in the very near future support openacc
>>>> which is basically the equivalent of openmp but for gpus. So maybe we can
>>>> skip opencl completely. BTW I am not a big fan of opencl (because I think
>>>> is very verbose and tedious to use) and I have worked a lot with CUDA and
>>>> MPI. I agree opencl is more portable than CUDA but openacc targets both
>>>> and  should solve this issue.
>>>>
>>>> Regarding unified address space from the programming model point of
>>>> view is going to come very soon but hardware support with indistinguishable
>>>> performance should come in 2 generations (crossing fingers).
>>>>
>>>> Oreste
>>>> On Aug 21, 2013 10:32 AM, "Doug King" <[email protected]> wrote:
>>>>
>>>>> Hi Oreste,
>>>>>
>>>>> Good points and observations. My comments below.
>>>>>
>>>>> *1) Make sure all the SP and TP (basically the code with a lot of
>>>>> nested loops) is at least C++. I would also make sure I am not abusing too
>>>>> much of STL library or boost library (like for instance in the list of
>>>>> active columns , etc) as this is usually tricky (or low performance) when
>>>>> ported to GPUs.*
>>>>>
>>>>> We are doing some of this in the current codebase. See Jira ticket
>>>>> https://issues.numenta.org/browse/NPC-286  - porting entire CLA to
>>>>> c++ with language bindings for other languages. First steps are to migrate
>>>>> spatial pooler to c++. To see progress on this check here:
>>>>> https://issues.numenta.org/browse/NPC-246  I am not sure of the
>>>>> approach they are taking regarding STL and Boost library.
>>>>>
>>>>> *2) I would convert the above nested loops (for all columns, for all
>>>>> cells, etc) to OpenCL. We have to be careful about data movement today and
>>>>> try to keep most of the data in GPU space (although this problem is going
>>>>> to disappear with next or next-next generation of GPUs which are going to
>>>>> have unified address space). *
>>>>>
>>>>> Yes, you are right about data movement. I don't know much about how
>>>>> OpenCL abstracts storage and message passing to nodes. Data movement could
>>>>> kill any gains you get from parallel computing if not handled correctly -
>>>>> biggest issue is, how to move input data onto the GPU and extract
>>>>> prediction back out. I/O for each time frame is the problem. First steps I
>>>>> think would be to convert those areas to CPU manycore parallel code in C++
>>>>> to sort out any issues, then OpenCL.
>>>>>
>>>>> Interesting about next-gen GPUs and unified address space - how close
>>>>> are we to getting unified memory space GPU?
>>>>>
>>>>> *3) Support for multiple GPU within the a single node. I would map
>>>>> different regions on different host-threads and GPUs and for large regions
>>>>> I would try would partition them across multiple host-threads and GPUs.
>>>>> Considering that they are 2D regions and communication is mostly localized
>>>>> around the columns it should be doable. However on the boundaries of the
>>>>> partitions is going to be tricky as updates to cells or columns will 
>>>>> depend
>>>>> on values that are in another GPU address space (again next-next 
>>>>> generation
>>>>> of accelerators should solve this problem with unified address space).
>>>>> *
>>>>>
>>>>> Perhaps we should be looking at cheap mulit-core CPUs for now to
>>>>> address this, or would we need to port the entire CLA into OpenCL to run 
>>>>> on
>>>>> shared memory in the GPU ? I don't know enough about the architecture of
>>>>> current code, OpenCL and GPU memory space to understand the right 
>>>>> approach,
>>>>> but we should take logical steps that we can build on to get there.
>>>>>
>>>>> *4) To parallelize across multiple nodes in a cluster I would
>>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce
>>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to
>>>>> compute the final result. In our case considering that the "computation" 
>>>>> is
>>>>> based around the concept of step (or clock) at every step there is going 
>>>>> to
>>>>> be a significant amount of communication across regions and within the
>>>>> region. Thankfully if we stick to the concept of time step (not exactly
>>>>> brain like) we can batch that communication and perform it at the end of
>>>>> each step. If using MPI, I would map different regions on different MPI
>>>>> ranks and for large regions I would partition them on multiple MPI ranks 
>>>>> as
>>>>> explained for point 3 within the node (basically a hierarchy of
>>>>> parallelization).*
>>>>>
>>>>> Agreed - Map-Reduce is not the right paradigm for the reasons you
>>>>> state. For moving lots of data across multiple compute nodes MPI is the
>>>>> standard and might be appropriate to adopt here I think.
>>>>>
>>>>> When we start talking about hierarchy though, I think operation would
>>>>> be similar to map/reduce with lower regions computing on nodes that are
>>>>> independent of each other and which feed higher regions at a slower rate.
>>>>> For example, audio (speech) prediction - break audio into spectrum
>>>>> (frequency bands), feed each band in the spectrum to individual regions.
>>>>> Feed predictions of each region into a higher region that aggregates all
>>>>> the predictions of individual audio bands.
>>>>>
>>>>> If we want to move this forward we should come up with a roadmap and
>>>>> next steps. Input from others here would be appreciated.
>>>>>
>>>>> -Doug
>>>>>
>>>>>
>>>>> On Wed, Aug 21, 2013 at 8:26 AM, Oreste Villa 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> So if I understand correctly mostly of the code is Python but the
>>>>>> "core" is mostly C++ or is going to be C++.
>>>>>>
>>>>>> The way I would approach this problem is the following (in temporal
>>>>>> order):
>>>>>>
>>>>>> 1) Make sure all the SP and TP (basically the code with a lot of
>>>>>> nested loops) is at least C++. I would also make sure I am not abusing 
>>>>>> too
>>>>>> much of STL library or boost library (like for instance in the list of
>>>>>> active columns , etc) as this is usually tricky (or low performance) when
>>>>>> ported to GPUs.
>>>>>>
>>>>>> 2) I would convert the above nested loops (for all columns, for all
>>>>>> cells, etc) to OpenCL. We have to be careful about data movement today 
>>>>>> and
>>>>>> try to keep most of the data in GPU space (although this problem is going
>>>>>> to disappear with next or next-next generation of GPUs which are going to
>>>>>> have unified address space).
>>>>>>
>>>>>> 3) Support for multiple GPU within the a single node. I would map
>>>>>> different regions on different host-threads and GPUs and for large 
>>>>>> regions
>>>>>> I would try would partition them across multiple host-threads and GPUs.
>>>>>> Considering that they are 2D regions and communication is mostly 
>>>>>> localized
>>>>>> around the columns it should be doable. However on the boundaries of the
>>>>>> partitions is going to be tricky as updates to cells or columns will 
>>>>>> depend
>>>>>> on values that are in another GPU address space (again next-next 
>>>>>> generation
>>>>>> of accelerators should solve this problem with unified address space).
>>>>>>
>>>>>> 4) To parallelize across multiple nodes in a cluster I would
>>>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce
>>>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to
>>>>>> compute the final result. In our case considering that the "computation" 
>>>>>> is
>>>>>> based around the concept of step (or clock) at every step there is going 
>>>>>> to
>>>>>> be a significant amount of communication across regions and within the
>>>>>> region. Thankfully if we stick to the concept of time step (not exactly
>>>>>> brain like) we can batch that communication and perform it at the end of
>>>>>> each step. If using MPI, I would map different regions on different MPI
>>>>>> ranks and for large regions I would partition them on multiple MPI ranks 
>>>>>> as
>>>>>> explained for point 3 within the node (basically a hierarchy of
>>>>>> parallelization).
>>>>>>
>>>>>> The code would obliviously work also with a single MPI process and
>>>>>> therefore on a normal workstation with or without GPUs. BTW with GPU I 
>>>>>> mean
>>>>>> also Intel Phi. Also I estimate from point 1 to 4 at least 1 or 2 years 
>>>>>> of
>>>>>> work depending on the number of people involved.
>>>>>>
>>>>>> Regarding hardware implementation, I also feel is the right way to go
>>>>>> in the long term but for now I would definitively go with the above
>>>>>> solution (considering most likely the algorithm will change in the next
>>>>>> years).
>>>>>> If well implemented the above approach could increase performance of
>>>>>> at least 2 orders of magnitude within a node and most likely scale 
>>>>>> linearly
>>>>>> across a moderate number of cluster nodes.
>>>>>>
>>>>>> I know some of the people on this mailing list have implemented their
>>>>>> own C++ version of HTM in the past, so I am sure they would be 
>>>>>> definitively
>>>>>> interested. Comments are welcome.
>>>>>>
>>>>>> Oreste
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2013 at 11:22 PM, Doug King <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Oreste,
>>>>>>>
>>>>>>> you are right, performance will be a central issue. There are a few
>>>>>>> bottlenecks in the algorithm that can be attacked with hardware
>>>>>>> acceleration. The best approach I can think of for now is to use
>>>>>>> parallelization (some form of map-reduce) to solve this. OpenCL would 
>>>>>>> be a
>>>>>>> good choice to use in place of some of the C++ or Python code. The rest 
>>>>>>> of
>>>>>>> the Python code could be kept as-is to allow for easy experimentation 
>>>>>>> for
>>>>>>> optimization of parameters or changes to features that are not core CLA
>>>>>>> algorithms.
>>>>>>>
>>>>>>> There are many OpenCL drivers for GPUs and there is even a platform
>>>>>>> for converting OpenCL code to FPGA hardware. Eventually the CLA will be
>>>>>>> ported to some sort of digital/analog hybrid device that simulates
>>>>>>> dendrite/synapse connection on neuromorphic silicone. This will not be 
>>>>>>> far
>>>>>>> off - maybe 5 years or less for early experiments, 10 years for cheap
>>>>>>> commodity devices.
>>>>>>>
>>>>>>> For now, most of us are trying to get results that are proof of
>>>>>>> concept with the current code base, then we will figure out how to 
>>>>>>> scale up
>>>>>>> and optimize.
>>>>>>>
>>>>>>> Another key to acceleration will be the sharing of trained networks
>>>>>>> that have encapsulated many CPU hours of training on fundamental 
>>>>>>> streams of
>>>>>>> data, for example speech audio, that once trained will be shared or 
>>>>>>> sold.
>>>>>>> If this happens the building blocks of lower HTM regions could be 
>>>>>>> leveraged
>>>>>>> to get to the next level. We need to work towards some CLA network
>>>>>>> serialization standards for this to happen.
>>>>>>>
>>>>>>> I think you are correct in your assumptions, and if you want to
>>>>>>> contribute to the effort to move to a more performant version of the 
>>>>>>> code I
>>>>>>> would love to see someone port some of the critial segments of the CLA 
>>>>>>> code
>>>>>>> to OpenCL. For an analysis of where the bottlenecks are in the CLA and
>>>>>>> hardware solutions you can start by checking out this paper:
>>>>>>> http://www.pdx.edu/sites/www.pdx.edu.sysc/files/SySc.Seminar.Hammestrom.May.2011.pdf
>>>>>>>
>>>>>>> -Doug
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 20, 2013 at 9:59 PM, oreste villa <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello everybody, this is my first post on this list so please
>>>>>>>> forgive me if this has already been addressed before.
>>>>>>>>
>>>>>>>> I have seen that the current NuPIC source code is mostly Phyton
>>>>>>>> and I am wondering....
>>>>>>>>
>>>>>>>> I don't know about the problems people are trying to solve today
>>>>>>>> (maybe for demand and response of power in a building this is not 
>>>>>>>> true) but
>>>>>>>> in the future I believe performance is going to be a central issue. 
>>>>>>>> Python
>>>>>>>> seems to be a non-optimal choice in this respect (as single threaded 
>>>>>>>> Java,
>>>>>>>> single threaded C# or single threaded C++, or everything not parallel).
>>>>>>>>
>>>>>>>> I keep thinking for instance that the the Large Hadron Collider at 
>>>>>>>> CERN produces
>>>>>>>> something like 3 GByte <http://en.wikipedia.org/wiki/Megabyte>/s
>>>>>>>> of raw data and it would be really nice if we were able to feed at
>>>>>>>> full year of experiments in real time to a system based on the CLA. 
>>>>>>>> Also in
>>>>>>>> robotic, performance and I/O bandwidth requirements for vision, 
>>>>>>>> sensing and
>>>>>>>> motion control are impressive.
>>>>>>>>
>>>>>>>> The question/discussion point I wanted to make is, where does the
>>>>>>>> project stand in terms of performance? More specifically, are there any
>>>>>>>> plans to design high performance code inside NuPIC (openMP, CUDA,
>>>>>>>> MPI)? Is this something much less emphasized because the focus of the
>>>>>>>> project is more on learning the basic CLA principles?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Oreste
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> nupic mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] NuPIC performance requirements

Reply via email to