Hi Ian, Subdividing a region is a bit tricky since the CLA models a continuous sheet of neurons. It's much easier when you have a topology since connections are local. In that case you can make a cut anywhere, you just need to guarantee that cells on both sides of the cut get the correct input.
However you don't have to subdivide a region. As you point out, when you have a hierarchy you will have multiple regions that are logically separate. In this case one or more regions feed into a second, higher level region and so on. It can make sense to parallelize each region separately. The trick here is to keep each region busy. You can do that by pipelining: cells at level 1 can process input from time T while cells at level 2 are processing inputs at time T-1. --Subutai On Thu, Aug 22, 2013 at 9:47 AM, Ian Danforth <[email protected]>wrote: > Oreste, > > For the novices on the list can you give us some reference numbers for > modern GPU's, their memory spaces and transfer speeds? For example a > single, trained CLA region varies greatly in size, but for single step > prediction 100MB resident in memory is not unusual. Under the current > implementation, if increase the number of prediction steps you double the > memory requirements. (Scott, correct me if this has changed recently). > > I've always thought that a parallel CLA implementation would start at the > region level, and what would be passed between regions would be the state > of the columns at each time step (2048 bit sparse vector) which could be > quite small. But that is thinking about a hierarchy of regions where data > moves up or down, not necessarily within a level. > > Perhaps Subutai could also comment on thoughts for within-region > parallel implementations. Can a region be logically subdivided? If so, > where would the cross-sub-region TP connections live? > > Thanks! > > Ian > > > On Thu, Aug 22, 2013 at 8:36 AM, Oreste Villa <[email protected]>wrote: > >> Thanks for the info, I am starting to have a better picture now. >> >> A quick question, It is my understanding from the pseudo-code in the >> white paper that the most computationally intensive part of the algorithm >> is actually the TP and not the SP (simply because one iterates on the >> columns and the other iterates on the all the cells). >> Is there a reason why you have started the C++ porting from the SP >> instead of the TP? >> >> Also, "porting" to GPUs is a all different story than porting from Python >> to C++. MPI is even more different. >> >> For instance moving the sparse matrix and vector for every iteration >> to-from CPU-GPU memory starting from the existing code is most likely not >> going to make it (specially if the matrices are relatively small and >> computation on the GPU is order sub-millisecond). Kernel launch and copy >> latencies are going to kill performance. The data needs to be resident in >> GPU memory and only the input/output data to regions (if needed) needs to >> be moved at each time-step, the larger the regions the better. >> >> More specific question for the current C++ implementation of the SP. >> >> 1) What is the size of the sparse matrices and vectors for the example 2048 >> columns and 65536 cells? >> 2) What is the exact operation on sparse matrices and vectors? (is this a >> simple multiplication among them? only one per time step?) >> >> Also usually "porting" to high performance AFTER having implemented a lot >> of functionalities on single threaded code is much more difficult than >> starting designing a high performance code and then adding slowly new >> functionalities. >> >> Having said that, one idea I am thinking about (for which I would like to >> get your feedback) is to actually make a "simple" MPI + GPU (or just GPU to >> start) prototype. >> I am thinking to start from scratch form the white-paper (or the latest >> incarnation of it) to evaluate trade-offs and benefits of the approach. >> It is my intuition that speeding up significantly a single region with 2048 >> columns and 65536 cells is not going to be easy with GPUs unless the >> full code and data is running inside the GPU and only the minimal necessary >> input/output is done. >> However when you have 10s of regions with 1000000s cells each, things are >> going to be much different. It would be nice to know where is the trade-off >> point, hence a simple prototype. >> >> Please let me know if the above makes some sense to you. >> >> Oreste >> >> >> >> On Wed, Aug 21, 2013 at 9:21 PM, Subutai Ahmad <[email protected]>wrote: >> >>> >>> Oreste and Doug - thank you for the comments! It is great to see these >>> ideas. I agree that performance optimization is extremely important. I want >>> to encourage you guys to continue the discussion and hopefully implement >>> something too. A speed improvement will help the community and accelerate >>> CLA based experimentation. (One of the main reasons we stopped focusing on >>> vision was the inability to run large experiments in reasonable time.) >>> >>> I don't know if this is useful, but here's a quick guide to the current >>> optimized code: >>> >>> Currently within NuPIC the CLA is implemented as a combination of Python >>> and C++. Over time we moved the slower portions of CLA to C++. Today, the >>> code is reasonably fast. A CLA region with 2048 columns and 65536 cells >>> takes about 20 msecs per iteration on my laptop. This includes SP, TP, the >>> classifier and full online learning turned on. >>> >>> There are two main building blocks in C++ from a performance standpoint. >>> >>> 1) There are a set of SparseMatrix classes that implements fast data >>> structures for sparse vectors and matrices. They include a large number of >>> utility routines that are optimized for CLA functions. Today these are >>> primarily used by the Spatial Pooler. There are Python bindings for these >>> classes. You can go to >>> nupic/examples/bindings/sparse_matrix_how_to.py<https://github.com/numenta/nupic/blob/master/examples/bindings/sparse_matrix_how_to.py>for >>> a tutorial on this. >>> >>> 2) There are a set of classes the implement fast data structures for the >>> temporal pooler. These are very specific to temporal pooling. Sparse >>> matrices were not enough for the TP. The input is very sparse and we >>> implemented some strategies for evaluating only cells that connected to >>> those ON input bits. The main starting point for this code is >>> nupic/nta/algorithms/Cells4.hpp<https://github.com/numenta/nupic/blob/master/nta/algorithms/Cells4.hpp>. >>> There are python bindings for this as well, and they are called by the >>> Python temporal pooler class >>> TP10X2.py<https://github.com/numenta/nupic/blob/master/py/nupic/research/TP10X2.py> >>> >>> Our curent plan is to create a pure C++ spatial pooler implementation >>> (see Gil's email). This will also be much cleaner than the current >>> implementation. Perhaps it can serve as a base for some of the ideas that >>> Oreste and Doug have mentioned. >>> >>> We have not extensively explored multi-threaded options. We mainly >>> focused on serial optimizations so far. >>> >>> Hope this helps! >>> >>> --Subutai >>> >>> >>> >>> On Wed, Aug 21, 2013 at 2:02 PM, Oreste Villa <[email protected]>wrote: >>> >>>> Thanks for the pointers, I can start looking into the code and the >>>> tickets. Unfortunately I have limited time and I will be looking into it as >>>> a "hobby" project for the moment. >>>> >>>> I think there is still some time before moving to opencl (or other). As >>>> you said, the first thing in the list is to have a very good c++ >>>> parallelazible cpu implementation. I would say the best think is most >>>> likely openmp, also gcc4.9 should in the very near future support openacc >>>> which is basically the equivalent of openmp but for gpus. So maybe we can >>>> skip opencl completely. BTW I am not a big fan of opencl (because I think >>>> is very verbose and tedious to use) and I have worked a lot with CUDA and >>>> MPI. I agree opencl is more portable than CUDA but openacc targets both >>>> and should solve this issue. >>>> >>>> Regarding unified address space from the programming model point of >>>> view is going to come very soon but hardware support with indistinguishable >>>> performance should come in 2 generations (crossing fingers). >>>> >>>> Oreste >>>> On Aug 21, 2013 10:32 AM, "Doug King" <[email protected]> wrote: >>>> >>>>> Hi Oreste, >>>>> >>>>> Good points and observations. My comments below. >>>>> >>>>> *1) Make sure all the SP and TP (basically the code with a lot of >>>>> nested loops) is at least C++. I would also make sure I am not abusing too >>>>> much of STL library or boost library (like for instance in the list of >>>>> active columns , etc) as this is usually tricky (or low performance) when >>>>> ported to GPUs.* >>>>> >>>>> We are doing some of this in the current codebase. See Jira ticket >>>>> https://issues.numenta.org/browse/NPC-286 - porting entire CLA to >>>>> c++ with language bindings for other languages. First steps are to migrate >>>>> spatial pooler to c++. To see progress on this check here: >>>>> https://issues.numenta.org/browse/NPC-246 I am not sure of the >>>>> approach they are taking regarding STL and Boost library. >>>>> >>>>> *2) I would convert the above nested loops (for all columns, for all >>>>> cells, etc) to OpenCL. We have to be careful about data movement today and >>>>> try to keep most of the data in GPU space (although this problem is going >>>>> to disappear with next or next-next generation of GPUs which are going to >>>>> have unified address space). * >>>>> >>>>> Yes, you are right about data movement. I don't know much about how >>>>> OpenCL abstracts storage and message passing to nodes. Data movement could >>>>> kill any gains you get from parallel computing if not handled correctly - >>>>> biggest issue is, how to move input data onto the GPU and extract >>>>> prediction back out. I/O for each time frame is the problem. First steps I >>>>> think would be to convert those areas to CPU manycore parallel code in C++ >>>>> to sort out any issues, then OpenCL. >>>>> >>>>> Interesting about next-gen GPUs and unified address space - how close >>>>> are we to getting unified memory space GPU? >>>>> >>>>> *3) Support for multiple GPU within the a single node. I would map >>>>> different regions on different host-threads and GPUs and for large regions >>>>> I would try would partition them across multiple host-threads and GPUs. >>>>> Considering that they are 2D regions and communication is mostly localized >>>>> around the columns it should be doable. However on the boundaries of the >>>>> partitions is going to be tricky as updates to cells or columns will >>>>> depend >>>>> on values that are in another GPU address space (again next-next >>>>> generation >>>>> of accelerators should solve this problem with unified address space). >>>>> * >>>>> >>>>> Perhaps we should be looking at cheap mulit-core CPUs for now to >>>>> address this, or would we need to port the entire CLA into OpenCL to run >>>>> on >>>>> shared memory in the GPU ? I don't know enough about the architecture of >>>>> current code, OpenCL and GPU memory space to understand the right >>>>> approach, >>>>> but we should take logical steps that we can build on to get there. >>>>> >>>>> *4) To parallelize across multiple nodes in a cluster I would >>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce >>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to >>>>> compute the final result. In our case considering that the "computation" >>>>> is >>>>> based around the concept of step (or clock) at every step there is going >>>>> to >>>>> be a significant amount of communication across regions and within the >>>>> region. Thankfully if we stick to the concept of time step (not exactly >>>>> brain like) we can batch that communication and perform it at the end of >>>>> each step. If using MPI, I would map different regions on different MPI >>>>> ranks and for large regions I would partition them on multiple MPI ranks >>>>> as >>>>> explained for point 3 within the node (basically a hierarchy of >>>>> parallelization).* >>>>> >>>>> Agreed - Map-Reduce is not the right paradigm for the reasons you >>>>> state. For moving lots of data across multiple compute nodes MPI is the >>>>> standard and might be appropriate to adopt here I think. >>>>> >>>>> When we start talking about hierarchy though, I think operation would >>>>> be similar to map/reduce with lower regions computing on nodes that are >>>>> independent of each other and which feed higher regions at a slower rate. >>>>> For example, audio (speech) prediction - break audio into spectrum >>>>> (frequency bands), feed each band in the spectrum to individual regions. >>>>> Feed predictions of each region into a higher region that aggregates all >>>>> the predictions of individual audio bands. >>>>> >>>>> If we want to move this forward we should come up with a roadmap and >>>>> next steps. Input from others here would be appreciated. >>>>> >>>>> -Doug >>>>> >>>>> >>>>> On Wed, Aug 21, 2013 at 8:26 AM, Oreste Villa >>>>> <[email protected]>wrote: >>>>> >>>>>> So if I understand correctly mostly of the code is Python but the >>>>>> "core" is mostly C++ or is going to be C++. >>>>>> >>>>>> The way I would approach this problem is the following (in temporal >>>>>> order): >>>>>> >>>>>> 1) Make sure all the SP and TP (basically the code with a lot of >>>>>> nested loops) is at least C++. I would also make sure I am not abusing >>>>>> too >>>>>> much of STL library or boost library (like for instance in the list of >>>>>> active columns , etc) as this is usually tricky (or low performance) when >>>>>> ported to GPUs. >>>>>> >>>>>> 2) I would convert the above nested loops (for all columns, for all >>>>>> cells, etc) to OpenCL. We have to be careful about data movement today >>>>>> and >>>>>> try to keep most of the data in GPU space (although this problem is going >>>>>> to disappear with next or next-next generation of GPUs which are going to >>>>>> have unified address space). >>>>>> >>>>>> 3) Support for multiple GPU within the a single node. I would map >>>>>> different regions on different host-threads and GPUs and for large >>>>>> regions >>>>>> I would try would partition them across multiple host-threads and GPUs. >>>>>> Considering that they are 2D regions and communication is mostly >>>>>> localized >>>>>> around the columns it should be doable. However on the boundaries of the >>>>>> partitions is going to be tricky as updates to cells or columns will >>>>>> depend >>>>>> on values that are in another GPU address space (again next-next >>>>>> generation >>>>>> of accelerators should solve this problem with unified address space). >>>>>> >>>>>> 4) To parallelize across multiple nodes in a cluster I would >>>>>> definitively go for MPI and not map-reduce. The reason is that map-reduce >>>>>> is used mostly on embarrassing parallel jobs with a final reduce phase to >>>>>> compute the final result. In our case considering that the "computation" >>>>>> is >>>>>> based around the concept of step (or clock) at every step there is going >>>>>> to >>>>>> be a significant amount of communication across regions and within the >>>>>> region. Thankfully if we stick to the concept of time step (not exactly >>>>>> brain like) we can batch that communication and perform it at the end of >>>>>> each step. If using MPI, I would map different regions on different MPI >>>>>> ranks and for large regions I would partition them on multiple MPI ranks >>>>>> as >>>>>> explained for point 3 within the node (basically a hierarchy of >>>>>> parallelization). >>>>>> >>>>>> The code would obliviously work also with a single MPI process and >>>>>> therefore on a normal workstation with or without GPUs. BTW with GPU I >>>>>> mean >>>>>> also Intel Phi. Also I estimate from point 1 to 4 at least 1 or 2 years >>>>>> of >>>>>> work depending on the number of people involved. >>>>>> >>>>>> Regarding hardware implementation, I also feel is the right way to go >>>>>> in the long term but for now I would definitively go with the above >>>>>> solution (considering most likely the algorithm will change in the next >>>>>> years). >>>>>> If well implemented the above approach could increase performance of >>>>>> at least 2 orders of magnitude within a node and most likely scale >>>>>> linearly >>>>>> across a moderate number of cluster nodes. >>>>>> >>>>>> I know some of the people on this mailing list have implemented their >>>>>> own C++ version of HTM in the past, so I am sure they would be >>>>>> definitively >>>>>> interested. Comments are welcome. >>>>>> >>>>>> Oreste >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 20, 2013 at 11:22 PM, Doug King <[email protected]> wrote: >>>>>> >>>>>>> Hi Oreste, >>>>>>> >>>>>>> you are right, performance will be a central issue. There are a few >>>>>>> bottlenecks in the algorithm that can be attacked with hardware >>>>>>> acceleration. The best approach I can think of for now is to use >>>>>>> parallelization (some form of map-reduce) to solve this. OpenCL would >>>>>>> be a >>>>>>> good choice to use in place of some of the C++ or Python code. The rest >>>>>>> of >>>>>>> the Python code could be kept as-is to allow for easy experimentation >>>>>>> for >>>>>>> optimization of parameters or changes to features that are not core CLA >>>>>>> algorithms. >>>>>>> >>>>>>> There are many OpenCL drivers for GPUs and there is even a platform >>>>>>> for converting OpenCL code to FPGA hardware. Eventually the CLA will be >>>>>>> ported to some sort of digital/analog hybrid device that simulates >>>>>>> dendrite/synapse connection on neuromorphic silicone. This will not be >>>>>>> far >>>>>>> off - maybe 5 years or less for early experiments, 10 years for cheap >>>>>>> commodity devices. >>>>>>> >>>>>>> For now, most of us are trying to get results that are proof of >>>>>>> concept with the current code base, then we will figure out how to >>>>>>> scale up >>>>>>> and optimize. >>>>>>> >>>>>>> Another key to acceleration will be the sharing of trained networks >>>>>>> that have encapsulated many CPU hours of training on fundamental >>>>>>> streams of >>>>>>> data, for example speech audio, that once trained will be shared or >>>>>>> sold. >>>>>>> If this happens the building blocks of lower HTM regions could be >>>>>>> leveraged >>>>>>> to get to the next level. We need to work towards some CLA network >>>>>>> serialization standards for this to happen. >>>>>>> >>>>>>> I think you are correct in your assumptions, and if you want to >>>>>>> contribute to the effort to move to a more performant version of the >>>>>>> code I >>>>>>> would love to see someone port some of the critial segments of the CLA >>>>>>> code >>>>>>> to OpenCL. For an analysis of where the bottlenecks are in the CLA and >>>>>>> hardware solutions you can start by checking out this paper: >>>>>>> http://www.pdx.edu/sites/www.pdx.edu.sysc/files/SySc.Seminar.Hammestrom.May.2011.pdf >>>>>>> >>>>>>> -Doug >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 20, 2013 at 9:59 PM, oreste villa < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hello everybody, this is my first post on this list so please >>>>>>>> forgive me if this has already been addressed before. >>>>>>>> >>>>>>>> I have seen that the current NuPIC source code is mostly Phyton >>>>>>>> and I am wondering.... >>>>>>>> >>>>>>>> I don't know about the problems people are trying to solve today >>>>>>>> (maybe for demand and response of power in a building this is not >>>>>>>> true) but >>>>>>>> in the future I believe performance is going to be a central issue. >>>>>>>> Python >>>>>>>> seems to be a non-optimal choice in this respect (as single threaded >>>>>>>> Java, >>>>>>>> single threaded C# or single threaded C++, or everything not parallel). >>>>>>>> >>>>>>>> I keep thinking for instance that the the Large Hadron Collider at >>>>>>>> CERN produces >>>>>>>> something like 3 GByte <http://en.wikipedia.org/wiki/Megabyte>/s >>>>>>>> of raw data and it would be really nice if we were able to feed at >>>>>>>> full year of experiments in real time to a system based on the CLA. >>>>>>>> Also in >>>>>>>> robotic, performance and I/O bandwidth requirements for vision, >>>>>>>> sensing and >>>>>>>> motion control are impressive. >>>>>>>> >>>>>>>> The question/discussion point I wanted to make is, where does the >>>>>>>> project stand in terms of performance? More specifically, are there any >>>>>>>> plans to design high performance code inside NuPIC (openMP, CUDA, >>>>>>>> MPI)? Is this something much less emphasized because the focus of the >>>>>>>> project is more on learning the basic CLA principles? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Oreste >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> nupic mailing list >>>>>>>> [email protected] >>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
