Re: [nupic-dev] NuPIC performance requirements

Doug King Wed, 21 Aug 2013 10:34:47 -0700

Hi Oreste,

Good points and observations. My comments below.

*1) Make sure all the SP and TP (basically the code with a lot of nested
loops) is at least C++. I would also make sure I am not abusing too much of
STL library or boost library (like for instance in the list of active
columns , etc) as this is usually tricky (or low performance) when ported
to GPUs.*

We are doing some of this in the current codebase. See Jira ticket
https://issues.numenta.org/browse/NPC-286  - porting entire CLA to c++ with
language bindings for other languages. First steps are to migrate spatial
pooler to c++. To see progress on this check here:
https://issues.numenta.org/browse/NPC-246  I am not sure of the approach
they are taking regarding STL and Boost library.

*2) I would convert the above nested loops (for all columns, for all cells,
etc) to OpenCL. We have to be careful about data movement today and try to
keep most of the data in GPU space (although this problem is going to
disappear with next or next-next generation of GPUs which are going to have
unified address space). *

Yes, you are right about data movement. I don't know much about how OpenCL
abstracts storage and message passing to nodes. Data movement could kill
any gains you get from parallel computing if not handled correctly -
biggest issue is, how to move input data onto the GPU and extract
prediction back out. I/O for each time frame is the problem. First steps I
think would be to convert those areas to CPU manycore parallel code in C++
to sort out any issues, then OpenCL.

Interesting about next-gen GPUs and unified address space - how close are
we to getting unified memory space GPU?

*3) Support for multiple GPU within the a single node. I would map
different regions on different host-threads and GPUs and for large regions
I would try would partition them across multiple host-threads and GPUs.
Considering that they are 2D regions and communication is mostly localized
around the columns it should be doable. However on the boundaries of the
partitions is going to be tricky as updates to cells or columns will depend
on values that are in another GPU address space (again next-next generation
of accelerators should solve this problem with unified address space).*

Perhaps we should be looking at cheap mulit-core CPUs for now to address
this, or would we need to port the entire CLA into OpenCL to run on shared
memory in the GPU ? I don't know enough about the architecture of current
code, OpenCL and GPU memory space to understand the right approach, but we
should take logical steps that we can build on to get there.

*4) To parallelize across multiple nodes in a cluster I would definitively
go for MPI and not map-reduce. The reason is that map-reduce is used mostly
on embarrassing parallel jobs with a final reduce phase to compute the
final result. In our case considering that the "computation" is based
around the concept of step (or clock) at every step there is going to be a
significant amount of communication across regions and within the region.
Thankfully if we stick to the concept of time step (not exactly brain like)
we can batch that communication and perform it at the end of each step. If
using MPI, I would map different regions on different MPI ranks and for
large regions I would partition them on multiple MPI ranks as explained for
point 3 within the node (basically a hierarchy of parallelization).*

Agreed - Map-Reduce is not the right paradigm for the reasons you state.
For moving lots of data across multiple compute nodes MPI is the standard
and might be appropriate to adopt here I think.

When we start talking about hierarchy though, I think operation would be
similar to map/reduce with lower regions computing on nodes that are
independent of each other and which feed higher regions at a slower rate.
For example, audio (speech) prediction - break audio into spectrum
(frequency bands), feed each band in the spectrum to individual regions.
Feed predictions of each region into a higher region that aggregates all
the predictions of individual audio bands.

If we want to move this forward we should come up with a roadmap and next
steps. Input from others here would be appreciated.

-Doug

On Wed, Aug 21, 2013 at 8:26 AM, Oreste Villa <[email protected]>wrote:

> So if I understand correctly mostly of the code is Python but the "core"
> is mostly C++ or is going to be C++.
>
> The way I would approach this problem is the following (in temporal order):
>
> 1) Make sure all the SP and TP (basically the code with a lot of nested
> loops) is at least C++. I would also make sure I am not abusing too much of
> STL library or boost library (like for instance in the list of active
> columns , etc) as this is usually tricky (or low performance) when ported
> to GPUs.
>
> 2) I would convert the above nested loops (for all columns, for all cells,
> etc) to OpenCL. We have to be careful about data movement today and try to
> keep most of the data in GPU space (although this problem is going to
> disappear with next or next-next generation of GPUs which are going to have
> unified address space).
>
> 3) Support for multiple GPU within the a single node. I would map
> different regions on different host-threads and GPUs and for large regions
> I would try would partition them across multiple host-threads and GPUs.
> Considering that they are 2D regions and communication is mostly localized
> around the columns it should be doable. However on the boundaries of the
> partitions is going to be tricky as updates to cells or columns will depend
> on values that are in another GPU address space (again next-next generation
> of accelerators should solve this problem with unified address space).
>
> 4) To parallelize across multiple nodes in a cluster I would definitively
> go for MPI and not map-reduce. The reason is that map-reduce is used mostly
> on embarrassing parallel jobs with a final reduce phase to compute the
> final result. In our case considering that the "computation" is based
> around the concept of step (or clock) at every step there is going to be a
> significant amount of communication across regions and within the region.
> Thankfully if we stick to the concept of time step (not exactly brain like)
> we can batch that communication and perform it at the end of each step. If
> using MPI, I would map different regions on different MPI ranks and for
> large regions I would partition them on multiple MPI ranks as explained for
> point 3 within the node (basically a hierarchy of parallelization).
>
> The code would obliviously work also with a single MPI process and
> therefore on a normal workstation with or without GPUs. BTW with GPU I mean
> also Intel Phi. Also I estimate from point 1 to 4 at least 1 or 2 years of
> work depending on the number of people involved.
>
> Regarding hardware implementation, I also feel is the right way to go in
> the long term but for now I would definitively go with the above solution
> (considering most likely the algorithm will change in the next years).
> If well implemented the above approach could increase performance of at
> least 2 orders of magnitude within a node and most likely scale linearly
> across a moderate number of cluster nodes.
>
> I know some of the people on this mailing list have implemented their own
> C++ version of HTM in the past, so I am sure they would be definitively
> interested. Comments are welcome.
>
> Oreste
>
>
>
>
> On Tue, Aug 20, 2013 at 11:22 PM, Doug King <[email protected]> wrote:
>
>> Hi Oreste,
>>
>> you are right, performance will be a central issue. There are a few
>> bottlenecks in the algorithm that can be attacked with hardware
>> acceleration. The best approach I can think of for now is to use
>> parallelization (some form of map-reduce) to solve this. OpenCL would be a
>> good choice to use in place of some of the C++ or Python code. The rest of
>> the Python code could be kept as-is to allow for easy experimentation for
>> optimization of parameters or changes to features that are not core CLA
>> algorithms.
>>
>> There are many OpenCL drivers for GPUs and there is even a platform for
>> converting OpenCL code to FPGA hardware. Eventually the CLA will be ported
>> to some sort of digital/analog hybrid device that simulates
>> dendrite/synapse connection on neuromorphic silicone. This will not be far
>> off - maybe 5 years or less for early experiments, 10 years for cheap
>> commodity devices.
>>
>> For now, most of us are trying to get results that are proof of concept
>> with the current code base, then we will figure out how to scale up and
>> optimize.
>>
>> Another key to acceleration will be the sharing of trained networks that
>> have encapsulated many CPU hours of training on fundamental streams of
>> data, for example speech audio, that once trained will be shared or sold.
>> If this happens the building blocks of lower HTM regions could be leveraged
>> to get to the next level. We need to work towards some CLA network
>> serialization standards for this to happen.
>>
>> I think you are correct in your assumptions, and if you want to
>> contribute to the effort to move to a more performant version of the code I
>> would love to see someone port some of the critial segments of the CLA code
>> to OpenCL. For an analysis of where the bottlenecks are in the CLA and
>> hardware solutions you can start by checking out this paper:
>> http://www.pdx.edu/sites/www.pdx.edu.sysc/files/SySc.Seminar.Hammestrom.May.2011.pdf
>>
>> -Doug
>>
>>
>> On Tue, Aug 20, 2013 at 9:59 PM, oreste villa <[email protected]>wrote:
>>
>>> Hello everybody, this is my first post on this list so please forgive me
>>> if this has already been addressed before.
>>>
>>> I have seen that the current NuPIC source code is mostly Phyton and I
>>> am wondering....
>>>
>>> I don't know about the problems people are trying to solve today (maybe
>>> for demand and response of power in a building this is not true) but in the
>>> future I believe performance is going to be a central issue. Python seems
>>> to be a non-optimal choice in this respect (as single threaded Java, single
>>> threaded C# or single threaded C++, or everything not parallel).
>>>
>>> I keep thinking for instance that the the Large Hadron Collider at CERN 
>>> produces
>>> something like 3 GByte <http://en.wikipedia.org/wiki/Megabyte>/s of raw
>>> data and it would be really nice if we were able to feed at full year
>>> of experiments in real time to a system based on the CLA. Also in robotic,
>>> performance and I/O bandwidth requirements for vision, sensing and motion
>>> control are impressive.
>>>
>>> The question/discussion point I wanted to make is, where does the
>>> project stand in terms of performance? More specifically, are there any
>>> plans to design high performance code inside NuPIC (openMP, CUDA, MPI)?
>>> Is this something much less emphasized because the focus of the project is
>>> more on learning the basic CLA principles?
>>>
>>> Thanks,
>>>
>>> Oreste
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] NuPIC performance requirements

Reply via email to