Re: Whats the recommended approach to run parallel operations inside ParDo ?

Lukasz Cwik Thu, 05 May 2016 14:05:01 -0700

How the work is parallelized is dependent on the runner but you should
strive to keep your DoFns as simple as possible and try to avoid additional
complexity. Generally a runner will parallelize work automatically
splitting the source the data is coming from so that each worker is busy,
for example they may use signals such as CPU utilization to either increase
the number of bundles of work that are processed in parallel per worker.
This allows people to write DoFns that are as simple as possible and not
have to worry about how to make them run faster.

But there are practical considerations since work can only be split down to
how many elements there are and also remote calls may be much more
efficient if done in batches but this is a case by case basis and different
strategies work better for different problems.

Also, as a clarifying point, is it that each generateAndStoreGraphData() ,
generateAndStoreTimeSerieseData1() , generateAndStoreTimeSerieseData1()
need to be called for the same T in parallel (because of some internal
requirements) or is it good enough that different T's go through the
generateAndStore... in parallel?

On Thu, May 5, 2016 at 11:09 AM, kaniska Mandal <[email protected]>
wrote:

> Inside a CompositeTransformation --
> >  is it Ok to  spawn threads / use CountDownLatch to perform multiple
> ParDo on the same data item at the same item
> > or all the calls to individual ParDo inherently parallelized
>
> For example, I need to execute  generateAndStoreGraphData() ,
> enerateAndStoreTimeSerieseData1() ,  generateAndStoreTimeSerieseData1()*
>   -- in parallel .*
>
> static class MultiOps extends PTransform<PCollection<T>, PCollection<T>> {
>
>     public PCollection<T> apply(PCollection<T> dataSet) {
>
>      dataSet.apply(ParDo.of( generateAndStoreGraphData());
>
>      dataSet.apply(ParDo.of( generateAndStoreTimeSerieseData1());
>      dataSet.apply(ParDo.of( generateAndStoreTimeSerieseData1());
>
>       return results;
>
>     }
>
>   }
>
> =====================
>
> Thanks
>
> Kaniska
>

Re: Whats the recommended approach to run parallel operations inside ParDo ?

Reply via email to