Rethink the "always copy" policy for streaming topologies

2015-10-02 Thread Stephan Ewen
Hi all! Now that we are coming to the next release, I wanted to make sure we finalize the decision on that point, because it would be nice to not break the behavior of system afterwards. Right now, when tasks are chained together, the system copies the elements always between different tasks in t

Re: Rethink the "always copy" policy for streaming topologies

2015-10-24 Thread Gyula Fóra
Hey guys, Have we disabled the default input copying after all? I don't remember seeing a Jira or PR for this (maybe I just missed it). And if not, do we want this in the 0.10 release? Cheers, Gyula On Fri, Oct 2, 2015 at 7:57 PM, Till Rohrmann wrote: > Do we know what kind of impact the non-

Re: Rethink the "always copy" policy for streaming topologies

2015-10-24 Thread Stephan Ewen
I don't recall that the default policy was changed. If we change it, would be a good idea to change it for 0.10 - the latest for 1.0 One thing I realized is that to get predictable behavior with chaining, we should not do the special case parallelism 1 chaining (meaning shuffle operations get cha

Re: Rethink the "always copy" policy for streaming topologies

2015-10-02 Thread Matthias J. Sax
+1 for disable copy by default On 10/02/2015 05:53 PM, Stephan Ewen wrote: > Hi all! > > Now that we are coming to the next release, I wanted to make sure we > finalize the decision on that point, because it would be nice to not break > the behavior of system afterwards. > > Right now, when tas

Re: Rethink the "always copy" policy for streaming topologies

2015-10-02 Thread Martin Neumann
It seems like I'm one of the few people that run into the mutable elements trap on the Batch API from time to time. At the moment I always clone when I'm not 100% sure to avoid hunting the bugs later. So far I was happy to learn that this is not a problem in Streaming, but that's just me. When wor

Re: Rethink the "always copy" policy for streaming topologies

2015-10-02 Thread Stephan Ewen
@Martin: I think you were a user of the Batch API before we made the non-reuse mode the default mode. By now, when you use a GroupReduceFunction or a MapPartitionFunction or so, you need not do any cloning or copying. All functions that receive groups will always get fresh elements. This chaining

Re: Rethink the "always copy" policy for streaming topologies

2015-10-02 Thread Maximilian Michels
+1 Good idea. I think we can save quite some CPU cycles by not copying records. That is basically the behavior of the batch API, and there has so far never > been an issue with that (people running into the trap of overwritten > mutable elements). As far as I know, this is only the case for chai

Re: Rethink the "always copy" policy for streaming topologies

2015-10-02 Thread Till Rohrmann
Do we know what kind of impact the non-reuse policy has? Maybe the serialization overhead is subsumed by other effects. But in general I'm ok with changing the default to non copying. We just have to document this feature properly. On Oct 2, 2015 6:31 PM, "Maximilian Michels" wrote: > +1 Good id