I will describe the process more in detail: I'm trying to achieve a Monte Carlo financial simulation.
My inputs are: 1. circa 20,000 Python objects of class "Instrument". Each instrument is defined by a subclass and a dict of attributes, which will be a handful of scalars most of the times, or a couple of MB worth of numpy arrays in the worst case. Total size: 50MB after compressing the numpy arrays 2. 120 risk factors, each of which is a numpy 1D array of 2 million doubles (the risk factor values for each simulation scenario) Total size: 1.8GB My calculation happens in two phases: 1. Simulation: for every one of the 20,000 instruments, calculate the instrument value as a function of the instrument scalar settings and a subset of the risk factor vectors. There will be different functions (kernels) depending on the instrument subclass. The output is always a 1D array of 2 million doubles per instrument - or if you prefer, a 2D arrray of 20,000 x 2,000,000 doubles. Some instruments require as input the output value of other instruments, in a recursive dependency tree. Total output size: 300GB 2. Vertical aggregation: I calculate the value of circa 150 nodes, each of which is a vector of 2 million doubles defined as a weighted sum of the value of up to 8,000 instruments (with the weights being scalar): node_value = instr_value1 * k1 + instr_value2 * k2 + ... + instr_valueN * kn Each of the 20,000 instruments can contribute to more than one of the output 150 nodes. Total output size: 2.3GB This process currently runs on a CPU grid, and takes 5000 CPU hours worth of calculation time. The key points are: - the raw outputs of the simulation phase can be discarded immediately after aggregation - both the initial inputs (1.9GB) and the final outputs (2.3GB) are negligible in size. Moving around the 300GB of raw simulation outputs instead will bring the network to its knees. - there's a very heavy dependency between nodes on the vertical axis, but there is no dependency whatsoever on the horizontal axis. Each of the 2 million points is completely independent from each other. Ideally, i would split the 120 input risk factors into chunks, small enough that the whole simulation+aggregation fits in VRAM for a single chunk, and send them to different executors on the grid. the 50mb worth of instruments would be broadcasted to all the executors. In the end, I would recombine the output chunks horizontally. On 20 Oct 2015 6:00 p.m., "Andreas Kloeckner" <[email protected]> wrote: > Karl Czajkowski <[email protected]> writes: > > > You've only described a K x N processing problem, where you would run > > N kernels that each process one row of K values. You haven't > > described any cross-communication or data shuffling if there are > > multiple such sub-problems, nor approximate amount of work per input > > or output data. Are your tasks truly independent? What data > > management do you have to do to get your inputs into a parallel or > > distributed decomposition? > > > > At one far extreme, a high throughput job manager could be used to > > execute a set of independent PyOpenCL programs, each sized to fit on > > your OpenCL devices, each processing a different input file containing > > a subset of your N rows of data. > > > > In the middle are a huge number of choices to balance IO, memory, and > > compute resources. This leads to a huge number of different research > > programs all focusing on different niches and machine models. > > > > If you really want to abstract away the GPGPU devices, you might want > > to look at OpenMP or similar projects that have tried to add such > > devices to their targets for auto-vectorization. I don't work in that > > space, and so have no specific recommendations. > > > > At the other extreme, I adopted PyOpenCL to allow me to do my ad-hoc > > processing in Python and Numpy with OpenCL callouts for certain > > bottlenecks. I have some image processing tasks where there isn't > > even enough compute time per byte of input to warrant the PCIe > > transfer bottleneck in some cases. It is the same speed to run on an > > i7-4700MQ mobile quad-core CPU (using just SIMD+multicore) as to run > > on a desktop Kepler GPU. > > > > For me, the data IO from disk or network would also dominate, so > > distributed processing is pointless. Even still, I have used > > explicit sub-block decomposition to split my large images into smaller > > OpenCL tasks that can be marshaled through the system RAM or GPU to > > improve locality and limit the intermediate working set sizes. > > Karl's restatement of the problem has helped me understand better what > you might be looking for. With some work, these two could likely also help: > > http://starpu.gforge.inria.fr/ > http://legion.stanford.edu/ > > Andreas >
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
