Re: Parallelism

Thomas Groh Wed, 25 May 2016 11:56:15 -0700

Each source (e.g. TextIO.Read.from("/foo/bar")) will currently be invoked
by only a single thread at a time; Multiple sources (e.g.
TextIO.Read.from("/foo/bar"); ... TextIO.Read.from("/foo/baz"); ...) will
be read from independently and in different threads. If you have a known
distribution of sources, splitting the read into multiple sources is a
workaround to produce additional parallelism in the InProcessPipelineRunner
(Flattening the produced PCollections together). Additionally, downstream
transforms prior to a GroupByKey will also be executed by a single thread
at a time, but independently; so if multiple transforms are applied to the
same PCollection, they will be executed in parallel).


I've added https://issues.apache.org/jira/browse/BEAM-310 to track the
feature to split input sources.

On Wed, May 25, 2016 at 8:41 AM, David Olsen <[email protected]>
wrote:

> At the moment I would need to read split data locally, to perform external
> calls, and then to aggregate results based on a particular key (e.g. per
> user) if needed.
>
> This seems to me the current InProcessPipelineRunner fulfill my
> requirement, but not sure if 'Read transforms with one thread' means
> reading underlying data (in my case i.e. to read split data locally) will
> use one single thread to go through all split data? If this is the case,
> anyway I can read splits with more than one thread or any workaround?
> (Eventually I will go with across machines pipeline runner; now I don't
> have enough resources so have to start from single machine.)
>
> Thanks for all the input. It's very useful!
>
> On 25 May 2016 at 02:41, Jean-Baptiste Onofré <[email protected]> wrote:
>
>> I second Thomas: thanks for the details explanation (I forgot the mention
>> the "unique" JVM ;)).
>>
>> Regards
>> JB
>>
>> On 05/24/2016 07:28 PM, Thomas Groh wrote:
>>
>>> More specifically, the InProcessPipelineRunner (soon to be renamed to
>>> the DirectRunner) will run on a single machine, with a number of threads
>>> based on the number of available processors in the JVM, fanning out work
>>> to these threads as appropriate; It will not perform any cross-process
>>> (including cross-machine) communication. No configuration is required to
>>> get this threading behavior, but the number of threads is also not
>>> currently configurable.
>>>
>>> Can you say more about what you require to be parallel? In the current
>>> implementation, Read transforms (and the Source that underlies them) are
>>> currently exercised by only one thread, as are PTransforms downstream of
>>> them prior to a GroupByKey, based on how work is scheduled. However, all
>>> transforms after a GroupByKey execute in parallel based on the number of
>>> available keys.
>>>
>>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Hi David,
>>>
>>>     if you use the InProcessPipelineRunner (the "new"
>>>     DirectPipelineRunner), than it can creates several threads.
>>>
>>>     Regards
>>>     JB
>>>
>>>
>>>     On 05/24/2016 04:38 PM, David Olsen wrote:
>>>
>>>         A naive question about DirectPipelineRunner: Is it possible to
>>>         execute DirectPipelineRunner with multiple threads/ instances
>>>         (across
>>>         machines) or the parallelism is only supported by runner such as
>>>         SparkPipelineRunner?
>>>
>>>         My requirement is to run pipeline in parallel, either threading
>>> or
>>>         multiple machines. And I just start to investigating Apache Beam.
>>>
>>>         When reading google dataflow doc, the options setting mention
>>> that
>>>         numWorkers can be configured for the instances to use (I
>>>         understand it's
>>>         still different from Apache Beam). However, searching Apache
>>>         Beam source
>>>         on github with the keyword 'numWorkers' doesn't come up related
>>>         source
>>>         snippet. So I am wondering if the only way to execute pipeline
>>>         process
>>>         in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
>>>         (meaning
>>>         I have to use Apache Beam + Spark/ Flink) or make use of Google
>>>         Cloud
>>>         Platform?
>>>
>>>         Thanks
>>>
>>>         [1].
>>>
>>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>>
>>>
>>>     --
>>>     Jean-Baptiste Onofré
>>>     [email protected] <mailto:[email protected]>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: Parallelism

Reply via email to