Re: Parallelism

Jean-Baptiste Onofré Tue, 24 May 2016 11:41:47 -0700

I second Thomas: thanks for the details explanation (I forgot themention the "unique" JVM ;)).


Regards
JB


On 05/24/2016 07:28 PM, Thomas Groh wrote:

More specifically, the InProcessPipelineRunner (soon to be renamed to
the DirectRunner) will run on a single machine, with a number of threads
based on the number of available processors in the JVM, fanning out work
to these threads as appropriate; It will not perform any cross-process
(including cross-machine) communication. No configuration is required to
get this threading behavior, but the number of threads is also not
currently configurable.

Can you say more about what you require to be parallel? In the current
implementation, Read transforms (and the Source that underlies them) are
currently exercised by only one thread, as are PTransforms downstream of
them prior to a GroupByKey, based on how work is scheduled. However, all
transforms after a GroupByKey execute in parallel based on the number of
available keys.

On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected]
<mailto:[email protected]>> wrote:

    Hi David,

    if you use the InProcessPipelineRunner (the "new"
    DirectPipelineRunner), than it can creates several threads.

    Regards
    JB


    On 05/24/2016 04:38 PM, David Olsen wrote:

        A naive question about DirectPipelineRunner: Is it possible to
        execute DirectPipelineRunner with multiple threads/ instances
        (across
        machines) or the parallelism is only supported by runner such as
        SparkPipelineRunner?

        My requirement is to run pipeline in parallel, either threading or
        multiple machines. And I just start to investigating Apache Beam.

        When reading google dataflow doc, the options setting mention that
        numWorkers can be configured for the instances to use (I
        understand it's
        still different from Apache Beam). However, searching Apache
        Beam source
        on github with the keyword 'numWorkers' doesn't come up related
        source
        snippet. So I am wondering if the only way to execute pipeline
        process
        in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
        (meaning
        I have to use Apache Beam + Spark/ Flink) or make use of Google
        Cloud
        Platform?

        Thanks

        [1].
        
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options


    --
    Jean-Baptiste Onofré
    [email protected] <mailto:[email protected]>
    http://blog.nanthrax.net
    Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Parallelism

Reply via email to