@amit, yes that's the reason I was looking for one. Students often want a quick sanity check for their processing. The workaround now is to write out the collection to a file and then open it.
On Fri, May 20, 2016, 12:15 AM Amit Sela <[email protected]> wrote: > For the Spark runner this will print to the driver stdout, which in some > cases is the console you're running the job from (depends on execution > modes - standalone/yarn-client), and in others it's the driver's stdout log > (yarn-cluster, not sure about Mesos). > > Generally, if such PTransform is considered for the SDK, I think it should > print to stdout in a centralized location (in Spark it's the driver node) > and it should be capped by N to avoid OOM. I can definitely see this useful > for development, even if it's mostly for IDE use, and I think this is what > Jesse was looking for because debugging a PCollection's content in the IDE > is useless unless you apply the runner's (implementation of) "collect" > method to see what it contains. > > If we're looking for something to print the content of PCollection's > partitions on the nodes, this could simply be a composite transform that > iterates over the first N elements in the partition and logs their values. > > On Fri, May 20, 2016 at 8:56 AM Aljoscha Krettek <[email protected]> > wrote: > >> I removed ConsoleIO in my latest PR since it was specific to Flink. It >> was just a wrapper for a DoFn that prints to System.out, so it would be >> super easy to implement in core Beam. I don't how, however, if this is >> useful since for most runners this would print to some file on the worker >> where the DoFn is running. This is only useful when running inside an IDE. >> >> On Fri, 20 May 2016 at 06:06 Frances Perry <[email protected]> wrote: >> >>> Looks like Flink and Spark have their own implementations of ConsoleIO >>> but there's no SDK level version. Generally that goes against the goal of >>> having generally useful IO connectors, with potential runner-specific >>> overrides. Is there a generalized version we could add? >>> >>> On Thu, May 19, 2016 at 12:51 PM, Amit Sela <[email protected]> >>> wrote: >>> >>>> I'll join Frances and add that even Spark warns about using collect() >>>> as it might not fit into Driver memory, and like you said it's only for >>>> small datasets - but small is relative... >>>> I you want some "printout" during development you could use ConsoleIO >>>> (for streaming) - both Flink an Spark runner support this (and Beam doesn't >>>> AFAIK). >>>> >>>> On Thu, May 19, 2016 at 9:49 PM Frances Perry <[email protected]> wrote: >>>> >>>>> Oh, and I forgot the unified model! For unbounded collections, you >>>>> can't ask to access the whole collection, given you'll be waiting forever >>>>> ;-) Thomas has some work in progress on making PAssert support per-window >>>>> assertions though, so it'll be able to handle that case in the future. >>>>> >>>>> On Thu, May 19, 2016 at 11:26 AM, Frances Perry <[email protected]> >>>>> wrote: >>>>> >>>>>> Nope. A Beam pipeline goes through three different phases: >>>>>> (1) Graph construction -- from Pipeline.create() until >>>>>> Pipeline.run(). During this phase PCollection are just placeholders -- >>>>>> their contents don't exist. >>>>>> (2) Submission -- Pipeline.run() submits the pipeline to a runner for >>>>>> execution. >>>>>> (3) Execution -- The runner runs the pipeline. May continue after >>>>>> Pipeline.run() returns, depending on the runner. Some PCollections may >>>>>> get >>>>>> materialized during execution, but others may not (for example, if the >>>>>> runner chooses to fuse or optimize them away). >>>>>> >>>>>> So in other words, there's no place in your main program when a >>>>>> PCollection is guaranteed to have its contents available. >>>>>> >>>>>> If you are attempting to debug a job, you might look into PAssert >>>>>> [1], which gives you a place to inspect the entire contents of a >>>>>> PCollection and do something with it. In the DirectPipelineRunner that >>>>>> code >>>>>> will run locally, so you'll see the output. But of course, in other >>>>>> runners >>>>>> it may helpfully print on some random worker somewhere. But you can also >>>>>> write tests that assert on its contents, which will work on any runner. >>>>>> (This is how the @RunnableOnService tests work that we are setting up for >>>>>> all runners.) >>>>>> >>>>>> We should get some documentation on testing put together on the Beam >>>>>> site, but in the meantime the info on the Dataflow site [2] roughly >>>>>> applies >>>>>> (DataflowAssert was renamed PAssert in Beam). >>>>>> >>>>>> Hope that helps! >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java >>>>>> [2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline >>>>>> >>>>>> On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Is there a way to get the PCollection's results in the executing >>>>>>> process? In Spark this is using the collect method on an RDD. >>>>>>> >>>>>>> This would be for small amounts of data stored in a PCollection like: >>>>>>> >>>>>>> PCollection<Long> count = partition.apply(Count.globally()); >>>>>>> System.out.println(valuesincount); >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Jesse >>>>>>> >>>>>> >>>>>> >>>>> >>>
