Each source (e.g. TextIO.Read.from("/foo/bar")) will currently be invoked
by only a single thread at a time; Multiple sources (e.g.
TextIO.Read.from("/foo/bar"); ... TextIO.Read.from("/foo/baz"); ...) will
be read from independently and in different threads. If you have a known
distribution of sources, splitting the read into multiple sources is a
workaround to produce additional parallelism in the InProcessPipelineRunner
(Flattening the produced PCollections together). Additionally, downstream
transforms prior to a GroupByKey will also be executed by a single thread
at a time, but independently; so if multiple transforms are applied to the
same PCollection, they will be executed in parallel).I've added https://issues.apache.org/jira/browse/BEAM-310 to track the feature to split input sources. On Wed, May 25, 2016 at 8:41 AM, David Olsen <[email protected]> wrote: > At the moment I would need to read split data locally, to perform external > calls, and then to aggregate results based on a particular key (e.g. per > user) if needed. > > This seems to me the current InProcessPipelineRunner fulfill my > requirement, but not sure if 'Read transforms with one thread' means > reading underlying data (in my case i.e. to read split data locally) will > use one single thread to go through all split data? If this is the case, > anyway I can read splits with more than one thread or any workaround? > (Eventually I will go with across machines pipeline runner; now I don't > have enough resources so have to start from single machine.) > > Thanks for all the input. It's very useful! > > On 25 May 2016 at 02:41, Jean-Baptiste Onofré <[email protected]> wrote: > >> I second Thomas: thanks for the details explanation (I forgot the mention >> the "unique" JVM ;)). >> >> Regards >> JB >> >> On 05/24/2016 07:28 PM, Thomas Groh wrote: >> >>> More specifically, the InProcessPipelineRunner (soon to be renamed to >>> the DirectRunner) will run on a single machine, with a number of threads >>> based on the number of available processors in the JVM, fanning out work >>> to these threads as appropriate; It will not perform any cross-process >>> (including cross-machine) communication. No configuration is required to >>> get this threading behavior, but the number of threads is also not >>> currently configurable. >>> >>> Can you say more about what you require to be parallel? In the current >>> implementation, Read transforms (and the Source that underlies them) are >>> currently exercised by only one thread, as are PTransforms downstream of >>> them prior to a GroupByKey, based on how work is scheduled. However, all >>> transforms after a GroupByKey execute in parallel based on the number of >>> available keys. >>> >>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi David, >>> >>> if you use the InProcessPipelineRunner (the "new" >>> DirectPipelineRunner), than it can creates several threads. >>> >>> Regards >>> JB >>> >>> >>> On 05/24/2016 04:38 PM, David Olsen wrote: >>> >>> A naive question about DirectPipelineRunner: Is it possible to >>> execute DirectPipelineRunner with multiple threads/ instances >>> (across >>> machines) or the parallelism is only supported by runner such as >>> SparkPipelineRunner? >>> >>> My requirement is to run pipeline in parallel, either threading >>> or >>> multiple machines. And I just start to investigating Apache Beam. >>> >>> When reading google dataflow doc, the options setting mention >>> that >>> numWorkers can be configured for the instances to use (I >>> understand it's >>> still different from Apache Beam). However, searching Apache >>> Beam source >>> on github with the keyword 'numWorkers' doesn't come up related >>> source >>> snippet. So I am wondering if the only way to execute pipeline >>> process >>> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner >>> (meaning >>> I have to use Apache Beam + Spark/ Flink) or make use of Google >>> Cloud >>> Platform? >>> >>> Thanks >>> >>> [1]. >>> >>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options >>> >>> >>> -- >>> Jean-Baptiste Onofré >>> [email protected] <mailto:[email protected]> >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> >>> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> > >
