Subutai, thanks for the valuable info! To all, thanks for all the ideas submitted. I'll add to the discussion a JIRA issue https://issues.numenta.org/browse/NPC-320 about taking advantage of specialized math optimization libraries.
I definitely support the multi-core/parallel architecture for nupic, https://issues.numenta.org/browse/NPC-259 , and I think this could (at least to some extent quite easily) be achieved. I would be interested in the GPGPU support, https://issues.numenta.org/browse/NPC-258 , but there was a discussion if it's reasonable for this kind of tasks (time to transfer data GPU<->CPU). I hope to put some time,or money to the Eigen optimizations, so would like to know if it's good/ok/orthogonal for the GPGPU task. Best regards, Mark On Thu, Aug 22, 2013 at 6:21 AM, Subutai Ahmad <[email protected]> wrote: > > Oreste and Doug - thank you for the comments! It is great to see these > ideas. I agree that performance optimization is extremely important. I want > to encourage you guys to continue the discussion and hopefully implement > something too. A speed improvement will help the community and accelerate > CLA based experimentation. (One of the main reasons we stopped focusing on > vision was the inability to run large experiments in reasonable time.) > > I don't know if this is useful, but here's a quick guide to the current > optimized code: > > Currently within NuPIC the CLA is implemented as a combination of Python > and C++. Over time we moved the slower portions of CLA to C++. Today, the > code is reasonably fast. A CLA region with 2048 columns and 65536 cells > takes about 20 msecs per iteration on my laptop. This includes SP, TP, the > classifier and full online learning turned on. > > There are two main building blocks in C++ from a performance standpoint. > > 1) There are a set of SparseMatrix classes that implements fast data > structures for sparse vectors and matrices. They include a large number of > utility routines that are optimized for CLA functions. Today these are > primarily used by the Spatial Pooler. There are Python bindings for these > classes. You can go to > nupic/examples/bindings/sparse_matrix_how_to.py<https://github.com/numenta/nupic/blob/master/examples/bindings/sparse_matrix_how_to.py>for > a tutorial on this. > > 2) There are a set of classes the implement fast data structures for the > temporal pooler. These are very specific to temporal pooling. Sparse > matrices were not enough for the TP. The input is very sparse and we > implemented some strategies for evaluating only cells that connected to > those ON input bits. The main starting point for this code is > nupic/nta/algorithms/Cells4.hpp<https://github.com/numenta/nupic/blob/master/nta/algorithms/Cells4.hpp>. > There are python bindings for this as well, and they are called by the > Python temporal pooler class > TP10X2.py<https://github.com/numenta/nupic/blob/master/py/nupic/research/TP10X2.py> > > Our curent plan is to create a pure C++ spatial pooler implementation (see > Gil's email). This will also be much cleaner than the current > implementation. Perhaps it can serve as a base for some of the ideas that > Oreste and Doug have mentioned. > > We have not extensively explored multi-threaded options. We mainly focused > on serial optimizations so far. > > Hope this helps! > > --Subutai > > > > On Wed, Aug 21, 2013 at 2:02 PM, Oreste Villa <[email protected]>wrote: > >> Thanks for the pointers, I can start looking into the code and the >> tickets. Unfortunately I have limited time and I will be looking into it as >> a "hobby" project for the moment. >> >> I think there is still some time before moving to opencl (or other). As >> you said, the first thing in the list is to have a very good c++ >> parallelazible cpu implementation. I would say the best think is most >> likely openmp, also gcc4.9 should in the very near future support openacc >> which is basically the equivalent of openmp but for gpus. So maybe we can >> skip opencl completely. BTW I am not a big fan of opencl (because I think >> is very verbose and tedious to use) and I have worked a lot with CUDA and >> MPI. I agree opencl is more portable than CUDA but openacc targets both >> and should solve this issue. >> >> Regarding unified address space from the programming model point of view >> is going to come very soon but hardware support with indistinguishable >> performance should come in 2 generations (crossing fingers). >> >> Oreste >> On Aug 21, 2013 10:32 AM, "Doug King" <[email protected]> wrote: >> >>> Hi Oreste, >>> >>> Good points and observations. My comments below. >>> >>> *1) Make sure all the SP and TP (basically the code with a lot of >>> nested loops) is at least C++. I would also make sure I am not abusing too >>> much of STL library or boost library (like for instance in the list of >>> active columns , etc) as this is usually tricky (or low performance) when >>> ported to GPUs.* >>> >>> We are doing some of this in the current codebase. See Jira ticket >>> https://issues.numenta.org/browse/NPC-286 - porting entire CLA to c++ >>> with language bindings for other languages. First steps are to migrate >>> spatial pooler to c++. To see progress on this check here: >>> https://issues.numenta.org/browse/NPC-246 I am not sure of the >>> approach they are taking regarding STL and Boost library. >>> >>> *2) I would convert the above nested loops (for all columns, for all >>> cells, etc) to OpenCL. We have to be careful about data movement today and >>> try to keep most of the data in GPU space (although this problem is going >>> to disappear with next or next-next generation of GPUs which are going to >>> have unified address space). * >>> >>> Yes, you are right about data movement. I don't know much about how >>> OpenCL abstracts storage and message passing to nodes. Data movement could >>> kill any gains you get from parallel computing if not handled correctly - >>> biggest issue is, how to move input data onto the GPU and extract >>> prediction back out. I/O for each time frame is the problem. First steps I >>> think would be to convert those areas to CPU manycore parallel code in C++ >>> to sort out any issues, then OpenCL. >>> >>> Interesting about next-gen GPUs and unified address space - how close >>> are we to getting unified memory space GPU? >>> >>> *3) Support for multiple GPU within the a single node. I would map >>> different regions on different host-threads and GPUs and for large regions >>> I would try would partition them across multiple host-threads and GPUs. >>> Considering that they are 2D regions and communication is mostly localized >>> around the columns it should be doable. However on the boundaries of the >>> partitions is going to be tricky as updates to cells or columns will depend >>> on values that are in another GPU address space (again next-next generation >>> of accelerators should solve this problem with unified address space).* >>> >>> Perhaps we should be looking at cheap mulit-core CPUs for now to address >>> this, or would we need to port the entire CLA into OpenCL to run on shared >>> memory in the GPU ? I don't know enough about the architecture of current >>> code, OpenCL and GPU memory space to understand the right approach, but we >>> should take logical steps that we can build on to get there. >>> >>> *4) To parallelize across multiple nodes in a cluster I would >>> definitively go for MPI and not map-reduce. The reason is that map-reduce >>> is used mostly on embarrassing parallel jobs with a final reduce phase to >>> compute the final result. In our case considering that the "computation" is >>> based around the concept of step (or clock) at every step there is going to >>> be a significant amount of communication across regions and within the >>> region. Thankfully if we stick to the concept of time step (not exactly >>> brain like) we can batch that communication and perform it at the end of >>> each step. If using MPI, I would map different regions on different MPI >>> ranks and for large regions I would partition them on multiple MPI ranks as >>> explained for point 3 within the node (basically a hierarchy of >>> parallelization).* >>> >>> Agreed - Map-Reduce is not the right paradigm for the reasons you state. >>> For moving lots of data across multiple compute nodes MPI is the standard >>> and might be appropriate to adopt here I think. >>> >>> When we start talking about hierarchy though, I think operation would be >>> similar to map/reduce with lower regions computing on nodes that are >>> independent of each other and which feed higher regions at a slower rate. >>> For example, audio (speech) prediction - break audio into spectrum >>> (frequency bands), feed each band in the spectrum to individual regions. >>> Feed predictions of each region into a higher region that aggregates all >>> the predictions of individual audio bands. >>> >>> If we want to move this forward we should come up with a roadmap and >>> next steps. Input from others here would be appreciated. >>> >>> -Doug >>> >>> >>> On Wed, Aug 21, 2013 at 8:26 AM, Oreste Villa <[email protected]>wrote: >>> >>>> So if I understand correctly mostly of the code is Python but the >>>> "core" is mostly C++ or is going to be C++. >>>> >>>> The way I would approach this problem is the following (in temporal >>>> order): >>>> >>>> 1) Make sure all the SP and TP (basically the code with a lot of nested >>>> loops) is at least C++. I would also make sure I am not abusing too much of >>>> STL library or boost library (like for instance in the list of active >>>> columns , etc) as this is usually tricky (or low performance) when ported >>>> to GPUs. >>>> >>>> 2) I would convert the above nested loops (for all columns, for all >>>> cells, etc) to OpenCL. We have to be careful about data movement today and >>>> try to keep most of the data in GPU space (although this problem is going >>>> to disappear with next or next-next generation of GPUs which are going to >>>> have unified address space). >>>> >>>> 3) Support for multiple GPU within the a single node. I would map >>>> different regions on different host-threads and GPUs and for large regions >>>> I would try would partition them across multiple host-threads and GPUs. >>>> Considering that they are 2D regions and communication is mostly localized >>>> around the columns it should be doable. However on the boundaries of the >>>> partitions is going to be tricky as updates to cells or columns will depend >>>> on values that are in another GPU address space (again next-next generation >>>> of accelerators should solve this problem with unified address space). >>>> >>>> 4) To parallelize across multiple nodes in a cluster I would >>>> definitively go for MPI and not map-reduce. The reason is that map-reduce >>>> is used mostly on embarrassing parallel jobs with a final reduce phase to >>>> compute the final result. In our case considering that the "computation" is >>>> based around the concept of step (or clock) at every step there is going to >>>> be a significant amount of communication across regions and within the >>>> region. Thankfully if we stick to the concept of time step (not exactly >>>> brain like) we can batch that communication and perform it at the end of >>>> each step. If using MPI, I would map different regions on different MPI >>>> ranks and for large regions I would partition them on multiple MPI ranks as >>>> explained for point 3 within the node (basically a hierarchy of >>>> parallelization). >>>> >>>> The code would obliviously work also with a single MPI process and >>>> therefore on a normal workstation with or without GPUs. BTW with GPU I mean >>>> also Intel Phi. Also I estimate from point 1 to 4 at least 1 or 2 years of >>>> work depending on the number of people involved. >>>> >>>> Regarding hardware implementation, I also feel is the right way to go >>>> in the long term but for now I would definitively go with the above >>>> solution (considering most likely the algorithm will change in the next >>>> years). >>>> If well implemented the above approach could increase performance of at >>>> least 2 orders of magnitude within a node and most likely scale linearly >>>> across a moderate number of cluster nodes. >>>> >>>> I know some of the people on this mailing list have implemented their >>>> own C++ version of HTM in the past, so I am sure they would be definitively >>>> interested. Comments are welcome. >>>> >>>> Oreste >>>> >>>> >>>> >>>> >>>> On Tue, Aug 20, 2013 at 11:22 PM, Doug King <[email protected]> wrote: >>>> >>>>> Hi Oreste, >>>>> >>>>> you are right, performance will be a central issue. There are a few >>>>> bottlenecks in the algorithm that can be attacked with hardware >>>>> acceleration. The best approach I can think of for now is to use >>>>> parallelization (some form of map-reduce) to solve this. OpenCL would be a >>>>> good choice to use in place of some of the C++ or Python code. The rest of >>>>> the Python code could be kept as-is to allow for easy experimentation for >>>>> optimization of parameters or changes to features that are not core CLA >>>>> algorithms. >>>>> >>>>> There are many OpenCL drivers for GPUs and there is even a platform >>>>> for converting OpenCL code to FPGA hardware. Eventually the CLA will be >>>>> ported to some sort of digital/analog hybrid device that simulates >>>>> dendrite/synapse connection on neuromorphic silicone. This will not be far >>>>> off - maybe 5 years or less for early experiments, 10 years for cheap >>>>> commodity devices. >>>>> >>>>> For now, most of us are trying to get results that are proof of >>>>> concept with the current code base, then we will figure out how to scale >>>>> up >>>>> and optimize. >>>>> >>>>> Another key to acceleration will be the sharing of trained networks >>>>> that have encapsulated many CPU hours of training on fundamental streams >>>>> of >>>>> data, for example speech audio, that once trained will be shared or sold. >>>>> If this happens the building blocks of lower HTM regions could be >>>>> leveraged >>>>> to get to the next level. We need to work towards some CLA network >>>>> serialization standards for this to happen. >>>>> >>>>> I think you are correct in your assumptions, and if you want to >>>>> contribute to the effort to move to a more performant version of the code >>>>> I >>>>> would love to see someone port some of the critial segments of the CLA >>>>> code >>>>> to OpenCL. For an analysis of where the bottlenecks are in the CLA and >>>>> hardware solutions you can start by checking out this paper: >>>>> http://www.pdx.edu/sites/www.pdx.edu.sysc/files/SySc.Seminar.Hammestrom.May.2011.pdf >>>>> >>>>> -Doug >>>>> >>>>> >>>>> On Tue, Aug 20, 2013 at 9:59 PM, oreste villa >>>>> <[email protected]>wrote: >>>>> >>>>>> Hello everybody, this is my first post on this list so please forgive >>>>>> me if this has already been addressed before. >>>>>> >>>>>> I have seen that the current NuPIC source code is mostly Phyton and >>>>>> I am wondering.... >>>>>> >>>>>> I don't know about the problems people are trying to solve today >>>>>> (maybe for demand and response of power in a building this is not true) >>>>>> but >>>>>> in the future I believe performance is going to be a central issue. >>>>>> Python >>>>>> seems to be a non-optimal choice in this respect (as single threaded >>>>>> Java, >>>>>> single threaded C# or single threaded C++, or everything not parallel). >>>>>> >>>>>> I keep thinking for instance that the the Large Hadron Collider at CERN >>>>>> produces >>>>>> something like 3 GByte <http://en.wikipedia.org/wiki/Megabyte>/s of >>>>>> raw data and it would be really nice if we were able to feed at full >>>>>> year of experiments in real time to a system based on the CLA. Also in >>>>>> robotic, performance and I/O bandwidth requirements for vision, sensing >>>>>> and >>>>>> motion control are impressive. >>>>>> >>>>>> The question/discussion point I wanted to make is, where does the >>>>>> project stand in terms of performance? More specifically, are there any >>>>>> plans to design high performance code inside NuPIC (openMP, CUDA, >>>>>> MPI)? Is this something much less emphasized because the focus of the >>>>>> project is more on learning the basic CLA principles? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Oreste >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- Marek Otahal :o)
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
