I second Thomas: thanks for the details explanation (I forgot the
mention the "unique" JVM ;)).
Regards
JB
On 05/24/2016 07:28 PM, Thomas Groh wrote:
More specifically, the InProcessPipelineRunner (soon to be renamed to
the DirectRunner) will run on a single machine, with a number of threads
based on the number of available processors in the JVM, fanning out work
to these threads as appropriate; It will not perform any cross-process
(including cross-machine) communication. No configuration is required to
get this threading behavior, but the number of threads is also not
currently configurable.
Can you say more about what you require to be parallel? In the current
implementation, Read transforms (and the Source that underlies them) are
currently exercised by only one thread, as are PTransforms downstream of
them prior to a GroupByKey, based on how work is scheduled. However, all
transforms after a GroupByKey execute in parallel based on the number of
available keys.
On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected]
<mailto:[email protected]>> wrote:
Hi David,
if you use the InProcessPipelineRunner (the "new"
DirectPipelineRunner), than it can creates several threads.
Regards
JB
On 05/24/2016 04:38 PM, David Olsen wrote:
A naive question about DirectPipelineRunner: Is it possible to
execute DirectPipelineRunner with multiple threads/ instances
(across
machines) or the parallelism is only supported by runner such as
SparkPipelineRunner?
My requirement is to run pipeline in parallel, either threading or
multiple machines. And I just start to investigating Apache Beam.
When reading google dataflow doc, the options setting mention that
numWorkers can be configured for the instances to use (I
understand it's
still different from Apache Beam). However, searching Apache
Beam source
on github with the keyword 'numWorkers' doesn't come up related
source
snippet. So I am wondering if the only way to execute pipeline
process
in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
(meaning
I have to use Apache Beam + Spark/ Flink) or make use of Google
Cloud
Platform?
Thanks
[1].
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
--
Jean-Baptiste Onofré
[email protected] <mailto:[email protected]>
http://blog.nanthrax.net
Talend - http://www.talend.com
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com