I'll join Frances and add that even Spark warns about using collect() as it might not fit into Driver memory, and like you said it's only for small datasets - but small is relative... I you want some "printout" during development you could use ConsoleIO (for streaming) - both Flink an Spark runner support this (and Beam doesn't AFAIK).
On Thu, May 19, 2016 at 9:49 PM Frances Perry <[email protected]> wrote: > Oh, and I forgot the unified model! For unbounded collections, you can't > ask to access the whole collection, given you'll be waiting forever ;-) > Thomas has some work in progress on making PAssert support per-window > assertions though, so it'll be able to handle that case in the future. > > On Thu, May 19, 2016 at 11:26 AM, Frances Perry <[email protected]> wrote: > >> Nope. A Beam pipeline goes through three different phases: >> (1) Graph construction -- from Pipeline.create() until Pipeline.run(). >> During this phase PCollection are just placeholders -- their contents don't >> exist. >> (2) Submission -- Pipeline.run() submits the pipeline to a runner for >> execution. >> (3) Execution -- The runner runs the pipeline. May continue after >> Pipeline.run() returns, depending on the runner. Some PCollections may get >> materialized during execution, but others may not (for example, if the >> runner chooses to fuse or optimize them away). >> >> So in other words, there's no place in your main program when a >> PCollection is guaranteed to have its contents available. >> >> If you are attempting to debug a job, you might look into PAssert [1], >> which gives you a place to inspect the entire contents of a PCollection and >> do something with it. In the DirectPipelineRunner that code will run >> locally, so you'll see the output. But of course, in other runners it may >> helpfully print on some random worker somewhere. But you can also write >> tests that assert on its contents, which will work on any runner. (This is >> how the @RunnableOnService tests work that we are setting up for all >> runners.) >> >> We should get some documentation on testing put together on the Beam >> site, but in the meantime the info on the Dataflow site [2] roughly applies >> (DataflowAssert was renamed PAssert in Beam). >> >> Hope that helps! >> >> [1] >> https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java >> [2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline >> >> On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson <[email protected]> >> wrote: >> >>> Is there a way to get the PCollection's results in the executing >>> process? In Spark this is using the collect method on an RDD. >>> >>> This would be for small amounts of data stored in a PCollection like: >>> >>> PCollection<Long> count = partition.apply(Count.globally()); >>> System.out.println(valuesincount); >>> >>> Thanks, >>> >>> Jesse >>> >> >> >
