Re: [PyOpenCL] Grid-based (Py)OpenCL frameworks?

CRV§ADER//KY Wed, 28 Oct 2015 01:35:22 -0700

Hi,


> > 1. circa 20,000 Python objects of class "Instrument". Each instrument
> > is defined by a subclass and a dict of attributes, which will be a
> > handful of scalars most of the times, or a couple of MB worth of numpy
> > arrays in the worst case.
> > Total size: 50MB after compressing the numpy arrays
> >
>
> If I understand correctly, each attribute may be a scalar or a 1D
> Numpy array of variable length?  Does the attribute shape vary by
> individual instrument or by instrument sub-class?
>
>
The attribute shape varies by instrument sub-class. Some instrument types
have attributes which are sampled curves. Individual instruments of the
same type may have more or less sample points for their curve attribute.


>
> > 2. 120 risk factors, each of which is a numpy 1D array of 2 million
> > doubles (the risk factor values for each simulation scenario)
> > Total size: 1.8GB
> >
>
> So, each logical kernel needs as input one 120 scalar row of risk
> factors and one instrument attribute set, and it outputs one scalar?
>
> It needs a subset of the 120 scalar risk factors (between 0 and 20) and a
subset of the output of other instruments (between 0 and 150), and it
produces one scalar.

>
> > My calculation happens in two phases:
> > 1.
> > Simulation: for every one of the 20,000 instruments, calculate the
> > instrument value as a function of the instrument scalar settings and a
> > subset of the risk factor vectors. There will be different functions
> > (kernels) depending on the instrument subclass. The output is always a
> > 1D array of 2 million doubles per instrument - or if you prefer, a 2D
> > arrray of 20,000 x 2,000,000 doubles. Some instruments require as input
> > the output value of other instruments, in a recursive dependency tree.
> > Total output size: 300GB
> >
>
> Have you already micro-benchmarked any mappings of this to OpenCL?


Didn't start writing the code yet... I'm still in an exploration phase.


>   It seems to me worth checking:
>
>   A. K OpenCL jobs of shape (N,), with each worker evaluating one of K
>      instruments for one of N scenarios.
>
>   B. B OpenCL jobs of shape (K/B, N), with each worker evaluating one
>      of K instruments for one of N scenarios in one of B blocks.
>
> Can the instrument attributes fit in local device memory?


Yes, as I said it's 50MB after compressing, more like 300-500MB
uncompressed.
I can optimize though.


> If so, this
> can easily benefit (A) and may also help in (B) if you can structure
> the global shape (K/B, N) into smaller (W, N) workgroups that share a
> single instrument...?
>
> These are of course for K instruments in a single class, so they can
> share the same kernel.  The output should be a K x N array of scalars,
> if I understand your problem statement.


Theoretically yes, but it's probably simpler to just invoke the same kernel
multiple times, once per instrument.
With out-of-order execution enabled, I would expect not to get too much
penalty for that.


>   I'd limit the numbers K and N
> for testing, before worrying about further decomposition to fit the
> device and driver limits which probably cannot cope with a 20K much
> less 2M job shape axis.
>
> I'd test on both GPU and CPU devices, including existing devices in
> your cluster. If your cluster isn't the latest generation of CPUs
> and/or GPUs, I'd also try to test on newer equipment; there could be
> dramatic performance improvements that would allow a much smaller
> number of new devices to meet or exceed a large pool of older ones...
>
>
> > 2.
> > Vertical aggregation:
> > I calculate the value of circa 150 nodes, each of which is a vector of
> > 2 million doubles defined as a weighted sum of the value of up to 8,000
> > instruments (with the weights being scalar): node_value = instr_value1
> > * k1 + instr_value2 * k2 + ... + instr_valueN * kn
> > Each of the 20,000 instruments can contribute to more than one of the
> > output 150 nodes.
> >
>
> This phase seems trivially parallelizable and vectorizable.  You can
> almost dismiss it while optimizing the phase 1 work and overall data
> transfers.
>
>
> Karl
>
>

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Re: [PyOpenCL] Grid-based (Py)OpenCL frameworks?

Reply via email to