Re: Collecting Results to Executing Process

Frances Perry Thu, 19 May 2016 11:49:23 -0700

Oh, and I forgot the unified model!  For unbounded collections, you can't
ask to access the whole collection, given you'll be waiting forever ;-)
Thomas has some work in progress on making PAssert support per-window
assertions though, so it'll be able to handle that case in the future.


On Thu, May 19, 2016 at 11:26 AM, Frances Perry <[email protected]> wrote:

> Nope. A Beam pipeline goes through three different phases:
> (1) Graph construction -- from Pipeline.create() until Pipeline.run().
> During this phase PCollection are just placeholders -- their contents don't
> exist.
> (2) Submission -- Pipeline.run() submits the pipeline to a runner for
> execution.
> (3) Execution -- The runner runs the pipeline. May continue after
> Pipeline.run() returns, depending on the runner. Some PCollections may get
> materialized during execution, but others may not (for example, if the
> runner chooses to fuse or optimize them away).
>
> So in other words, there's no place in your main program when a
> PCollection is guaranteed to have its contents available.
>
> If you are attempting to debug a job, you might look into PAssert [1],
> which gives you a place to inspect the entire contents of a PCollection and
> do something with it. In the DirectPipelineRunner that code will run
> locally, so you'll see the output. But of course, in other runners it may
> helpfully print on some random worker somewhere. But you can also write
> tests that assert on its contents, which will work on any runner. (This is
> how the @RunnableOnService tests work that we are setting up for all
> runners.)
>
> We should get some documentation on testing put together on the Beam site,
> but in the meantime the info on the Dataflow site [2] roughly applies
> (DataflowAssert was renamed PAssert in Beam).
>
> Hope that helps!
>
> [1]
> https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java
> [2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline
>
> On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson <[email protected]>
> wrote:
>
>> Is there a way to get the PCollection's results in the executing process?
>> In Spark this is using the collect method on an RDD.
>>
>> This would be for small amounts of data stored in a PCollection like:
>>
>> PCollection<Long> count = partition.apply(Count.globally());
>> System.out.println(valuesincount);
>>
>> Thanks,
>>
>> Jesse
>>
>
>

Re: Collecting Results to Executing Process

Reply via email to