Re: Collecting Results to Executing Process

Frances Perry Thu, 19 May 2016 21:06:32 -0700

Looks like Flink and Spark have their own implementations of ConsoleIO but
there's no SDK level version. Generally that goes against the goal of
having generally useful IO connectors, with potential runner-specific
overrides. Is there a generalized version we could add?


On Thu, May 19, 2016 at 12:51 PM, Amit Sela <[email protected]> wrote:

> I'll join Frances and add that even Spark warns about using collect() as
> it might not fit into Driver memory, and like you said it's only for small
> datasets - but small is relative...
> I you want some "printout" during development you could use ConsoleIO (for
> streaming) - both Flink an Spark runner support this (and Beam doesn't
> AFAIK).
>
> On Thu, May 19, 2016 at 9:49 PM Frances Perry <[email protected]> wrote:
>
>> Oh, and I forgot the unified model!  For unbounded collections, you can't
>> ask to access the whole collection, given you'll be waiting forever ;-)
>> Thomas has some work in progress on making PAssert support per-window
>> assertions though, so it'll be able to handle that case in the future.
>>
>> On Thu, May 19, 2016 at 11:26 AM, Frances Perry <[email protected]> wrote:
>>
>>> Nope. A Beam pipeline goes through three different phases:
>>> (1) Graph construction -- from Pipeline.create() until Pipeline.run().
>>> During this phase PCollection are just placeholders -- their contents don't
>>> exist.
>>> (2) Submission -- Pipeline.run() submits the pipeline to a runner for
>>> execution.
>>> (3) Execution -- The runner runs the pipeline. May continue after
>>> Pipeline.run() returns, depending on the runner. Some PCollections may get
>>> materialized during execution, but others may not (for example, if the
>>> runner chooses to fuse or optimize them away).
>>>
>>> So in other words, there's no place in your main program when a
>>> PCollection is guaranteed to have its contents available.
>>>
>>> If you are attempting to debug a job, you might look into PAssert [1],
>>> which gives you a place to inspect the entire contents of a PCollection and
>>> do something with it. In the DirectPipelineRunner that code will run
>>> locally, so you'll see the output. But of course, in other runners it may
>>> helpfully print on some random worker somewhere. But you can also write
>>> tests that assert on its contents, which will work on any runner. (This is
>>> how the @RunnableOnService tests work that we are setting up for all
>>> runners.)
>>>
>>> We should get some documentation on testing put together on the Beam
>>> site, but in the meantime the info on the Dataflow site [2] roughly applies
>>> (DataflowAssert was renamed PAssert in Beam).
>>>
>>> Hope that helps!
>>>
>>> [1]
>>> https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java
>>> [2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline
>>>
>>> On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson <[email protected]>
>>> wrote:
>>>
>>>> Is there a way to get the PCollection's results in the executing
>>>> process? In Spark this is using the collect method on an RDD.
>>>>
>>>> This would be for small amounts of data stored in a PCollection like:
>>>>
>>>> PCollection<Long> count = partition.apply(Count.globally());
>>>> System.out.println(valuesincount);
>>>>
>>>> Thanks,
>>>>
>>>> Jesse
>>>>
>>>
>>>
>>

Re: Collecting Results to Executing Process

Reply via email to