Re: Collecting Results to Executing Process

Frances Perry Thu, 19 May 2016 11:27:37 -0700

Nope. A Beam pipeline goes through three different phases:
(1) Graph construction -- from Pipeline.create() until Pipeline.run().
During this phase PCollection are just placeholders -- their contents don't
exist.
(2) Submission -- Pipeline.run() submits the pipeline to a runner for
execution.
(3) Execution -- The runner runs the pipeline. May continue after
Pipeline.run() returns, depending on the runner. Some PCollections may get
materialized during execution, but others may not (for example, if the
runner chooses to fuse or optimize them away).

So in other words, there's no place in your main program when a PCollection
is guaranteed to have its contents available.

If you are attempting to debug a job, you might look into PAssert [1],
which gives you a place to inspect the entire contents of a PCollection and
do something with it. In the DirectPipelineRunner that code will run
locally, so you'll see the output. But of course, in other runners it may
helpfully print on some random worker somewhere. But you can also write
tests that assert on its contents, which will work on any runner. (This is
how the @RunnableOnService tests work that we are setting up for all
runners.)

We should get some documentation on testing put together on the Beam site,
but in the meantime the info on the Dataflow site [2] roughly applies
(DataflowAssert was renamed PAssert in Beam).

Hope that helps!

[1]
https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java
[2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline

On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson <[email protected]>
wrote:

> Is there a way to get the PCollection's results in the executing process?
> In Spark this is using the collect method on an RDD.
>
> This would be for small amounts of data stored in a PCollection like:
>
> PCollection<Long> count = partition.apply(Count.globally());
> System.out.println(valuesincount);
>
> Thanks,
>
> Jesse
>

Re: Collecting Results to Executing Process

Reply via email to