Re: Collecting Results to Executing Process

Amit Sela Thu, 19 May 2016 12:52:28 -0700

I'll join Frances and add that even Spark warns about using collect() as it
might not fit into Driver memory, and like you said it's only for small
datasets - but small is relative...
I you want some "printout" during development you could use ConsoleIO (for
streaming) - both Flink an Spark runner support this (and Beam doesn't
AFAIK).


On Thu, May 19, 2016 at 9:49 PM Frances Perry <[email protected]> wrote:

> Oh, and I forgot the unified model!  For unbounded collections, you can't
> ask to access the whole collection, given you'll be waiting forever ;-)
> Thomas has some work in progress on making PAssert support per-window
> assertions though, so it'll be able to handle that case in the future.
>
> On Thu, May 19, 2016 at 11:26 AM, Frances Perry <[email protected]> wrote:
>
>> Nope. A Beam pipeline goes through three different phases:
>> (1) Graph construction -- from Pipeline.create() until Pipeline.run().
>> During this phase PCollection are just placeholders -- their contents don't
>> exist.
>> (2) Submission -- Pipeline.run() submits the pipeline to a runner for
>> execution.
>> (3) Execution -- The runner runs the pipeline. May continue after
>> Pipeline.run() returns, depending on the runner. Some PCollections may get
>> materialized during execution, but others may not (for example, if the
>> runner chooses to fuse or optimize them away).
>>
>> So in other words, there's no place in your main program when a
>> PCollection is guaranteed to have its contents available.
>>
>> If you are attempting to debug a job, you might look into PAssert [1],
>> which gives you a place to inspect the entire contents of a PCollection and
>> do something with it. In the DirectPipelineRunner that code will run
>> locally, so you'll see the output. But of course, in other runners it may
>> helpfully print on some random worker somewhere. But you can also write
>> tests that assert on its contents, which will work on any runner. (This is
>> how the @RunnableOnService tests work that we are setting up for all
>> runners.)
>>
>> We should get some documentation on testing put together on the Beam
>> site, but in the meantime the info on the Dataflow site [2] roughly applies
>> (DataflowAssert was renamed PAssert in Beam).
>>
>> Hope that helps!
>>
>> [1]
>> https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java
>> [2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline
>>
>> On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson <[email protected]>
>> wrote:
>>
>>> Is there a way to get the PCollection's results in the executing
>>> process? In Spark this is using the collect method on an RDD.
>>>
>>> This would be for small amounts of data stored in a PCollection like:
>>>
>>> PCollection<Long> count = partition.apply(Count.globally());
>>> System.out.println(valuesincount);
>>>
>>> Thanks,
>>>
>>> Jesse
>>>
>>
>>
>

Re: Collecting Results to Executing Process

Reply via email to