I agree with Ufuk that it depends on how much both subgraphs and also future subgraphs overlap. It is conceivable that the user will reuse subgraphs of an already computed data sink after he called collect(). Then we also would have to reexecute parts of the dataflow graph. I guess we easily find examples supporting both cases. But of course, once we have checkpointing, we can manually checkpoint the future branching points.
But I'm also leaning more towards 1). On Mon, Jan 19, 2015 at 11:19 AM, Ufuk Celebi <u...@apache.org> wrote: > I think this question depends on how much both subgraphs overlap? But in > general, I agree that the first approach seems more desirable from the > runtime view (multiple consumers at the branch point). > > On Mon, Jan 19, 2015 at 10:59 AM, Robert Metzger <rmetz...@apache.org> > wrote: > > > I would also execute the sinks immediately. I think its a corner case > > because the sinks are usually the last thing in a plan and all print() or > > collect() statements are earlier in the plan. > > > > print() should go to the client command line, yes. > > > > On Mon, Jan 19, 2015 at 1:42 AM, Stephan Ewen <se...@apache.org> wrote: > > > > > Hi there! > > > > > > With the upcoming more interactive extensions to the API (operations > that > > > go back to the client from a program and need to be eagerly evaluated) > we > > > need to define how different actions should behave. > > > > > > Currently, nothing gets executed until the "env.execute()" call is > made. > > > That allows to produce multiple data sources at the same time, which > is a > > > good feature. > > > > > > For certain operations, like the "count()" and "collect()" functions > > added > > > in https://github.com/apache/flink/pull/210 , we need to trigger > > execution > > > immediately. > > > > > > The open question is, how should this behave in connection with already > > > defined data sinks: > > > > > > 1) Should all yet defined data sinks be executed as well? > > > 2) Should only that immediate operation be executed and the data sinks > be > > > pending till a call to "env.execute()" > > > > > > I am somewhat leaning towards the first option right now, because I > think > > > that executing them later may force re-execution of larger parts of the > > > plan. > > > > > > In addition: I think that the "print()" commands should go to the > client > > > command line. In that sense, they would behave like > > > "collect().foreach(print)" > > > > > > > > > Greetings, > > > Stephan > > > > > >