A few questions and comments in-line below...
On Oct 21, CRV§ADER//KY modulated:
> 1. circa 20,000 Python objects of class "Instrument". Each instrument
> is defined by a subclass and a dict of attributes, which will be a
> handful of scalars most of the times, or a couple of MB worth of numpy
> arrays in the worst case.
> Total size: 50MB after compressing the numpy arrays
>
If I understand correctly, each attribute may be a scalar or a 1D
Numpy array of variable length? Does the attribute shape vary by
individual instrument or by instrument sub-class?
> 2. 120 risk factors, each of which is a numpy 1D array of 2 million
> doubles (the risk factor values for each simulation scenario)
> Total size: 1.8GB
>
So, each logical kernel needs as input one 120 scalar row of risk
factors and one instrument attribute set, and it outputs one scalar?
> My calculation happens in two phases:
> 1.
> Simulation: for every one of the 20,000 instruments, calculate the
> instrument value as a function of the instrument scalar settings and a
> subset of the risk factor vectors. There will be different functions
> (kernels) depending on the instrument subclass. The output is always a
> 1D array of 2 million doubles per instrument - or if you prefer, a 2D
> arrray of 20,000 x 2,000,000 doubles. Some instruments require as input
> the output value of other instruments, in a recursive dependency tree.
> Total output size: 300GB
>
Have you already micro-benchmarked any mappings of this to OpenCL? It
seems to me worth checking:
A. K OpenCL jobs of shape (N,), with each worker evaluating one of K
instruments for one of N scenarios.
B. B OpenCL jobs of shape (K/B, N), with each worker evaluating one
of K instruments for one of N scenarios in one of B blocks.
Can the instrument attributes fit in local device memory? If so, this
can easily benefit (A) and may also help in (B) if you can structure
the global shape (K/B, N) into smaller (W, N) workgroups that share a
single instrument...?
These are of course for K instruments in a single class, so they can
share the same kernel. The output should be a K x N array of scalars,
if I understand your problem statement. I'd limit the numbers K and N
for testing, before worrying about further decomposition to fit the
device and driver limits which probably cannot cope with a 20K much
less 2M job shape axis.
I'd test on both GPU and CPU devices, including existing devices in
your cluster. If your cluster isn't the latest generation of CPUs
and/or GPUs, I'd also try to test on newer equipment; there could be
dramatic performance improvements that would allow a much smaller
number of new devices to meet or exceed a large pool of older ones...
> 2.
> Vertical aggregation:
> I calculate the value of circa 150 nodes, each of which is a vector of
> 2 million doubles defined as a weighted sum of the value of up to 8,000
> instruments (with the weights being scalar): node_value = instr_value1
> * k1 + instr_value2 * k2 + ... + instr_valueN * kn
> Each of the 20,000 instruments can contribute to more than one of the
> output 150 nodes.
>
This phase seems trivially parallelizable and vectorizable. You can
almost dismiss it while optimizing the phase 1 work and overall data
transfers.
Karl
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl