I will describe the process more in detail:

I'm trying to achieve a Monte Carlo financial simulation.

My inputs are:
1. circa 20,000 Python objects of class "Instrument". Each instrument is
defined by a subclass and a dict of attributes, which will be a handful of
scalars most of the times, or a couple of MB worth of numpy arrays in the
worst case.
Total size: 50MB after compressing the numpy arrays

2. 120 risk factors, each of which is a numpy 1D array of 2 million doubles
(the risk factor values for each simulation scenario)
Total size: 1.8GB

My calculation happens in two phases:
1.
Simulation: for every one of the 20,000 instruments, calculate the
instrument value as a function of the instrument scalar settings and a
subset of the risk factor vectors. There will be different functions
(kernels) depending on the instrument subclass. The output is always a 1D
array of 2 million doubles per instrument - or if you prefer, a 2D arrray
of 20,000 x 2,000,000 doubles. Some instruments require as input the output
value of other instruments, in a recursive dependency tree.
Total output size: 300GB

2.
Vertical aggregation:
I calculate the value of circa 150 nodes, each of which is a vector of 2
million doubles defined as a weighted sum of the value of up to 8,000
instruments (with the weights being scalar): node_value = instr_value1 * k1
+ instr_value2 * k2 + ... + instr_valueN * kn
Each of the 20,000 instruments can contribute to more than one of the
output 150 nodes.

Total output size: 2.3GB

This process currently runs on a CPU grid, and takes 5000 CPU hours worth
of calculation time.

The key points are:
- the raw outputs of the simulation phase can be discarded immediately
after aggregation
- both the initial inputs (1.9GB) and the final outputs (2.3GB) are
negligible in size. Moving around the 300GB of raw simulation outputs
instead will bring the network to its knees.
- there's a very heavy dependency between nodes on the vertical axis, but
there is no dependency whatsoever on the horizontal axis. Each of the 2
million points is completely independent from each other.

Ideally, i would split the 120 input risk factors into chunks, small enough
that the whole simulation+aggregation fits in VRAM for a single chunk, and
send them to different executors on the grid. the 50mb worth of instruments
would be broadcasted to all the executors. In the end, I would recombine
the output chunks horizontally.
On 20 Oct 2015 6:00 p.m., "Andreas Kloeckner" <[email protected]>
wrote:

> Karl Czajkowski <[email protected]> writes:
>
> > You've only described a K x N processing problem, where you would run
> > N kernels that each process one row of K values.  You haven't
> > described any cross-communication or data shuffling if there are
> > multiple such sub-problems, nor approximate amount of work per input
> > or output data.  Are your tasks truly independent?  What data
> > management do you have to do to get your inputs into a parallel or
> > distributed decomposition?
> >
> > At one far extreme, a high throughput job manager could be used to
> > execute a set of independent PyOpenCL programs, each sized to fit on
> > your OpenCL devices, each processing a different input file containing
> > a subset of your N rows of data.
> >
> > In the middle are a huge number of choices to balance IO, memory, and
> > compute resources. This leads to a huge number of different research
> > programs all focusing on different niches and machine models.
> >
> > If you really want to abstract away the GPGPU devices, you might want
> > to look at OpenMP or similar projects that have tried to add such
> > devices to their targets for auto-vectorization.  I don't work in that
> > space, and so have no specific recommendations.
> >
> > At the other extreme, I adopted PyOpenCL to allow me to do my ad-hoc
> > processing in Python and Numpy with OpenCL callouts for certain
> > bottlenecks.  I have some image processing tasks where there isn't
> > even enough compute time per byte of input to warrant the PCIe
> > transfer bottleneck in some cases.  It is the same speed to run on an
> > i7-4700MQ mobile quad-core CPU (using just SIMD+multicore) as to run
> > on a desktop Kepler GPU.
> >
> > For me, the data IO from disk or network would also dominate, so
> > distributed processing is pointless. Even still, I have used
> > explicit sub-block decomposition to split my large images into smaller
> > OpenCL tasks that can be marshaled through the system RAM or GPU to
> > improve locality and limit the intermediate working set sizes.
>
> Karl's restatement of the problem has helped me understand better what
> you might be looking for. With some work, these two could likely also help:
>
> http://starpu.gforge.inria.fr/
> http://legion.stanford.edu/
>
> Andreas
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to