I don't recall that the default policy was changed. If we change it, would be a good idea to change it for 0.10 - the latest for 1.0
One thing I realized is that to get predictable behavior with chaining, we should not do the special case parallelism 1 chaining (meaning shuffle operations get chained when both sender and receiver have parallelism 1). This causes different chaining behavior with different parallelism - can be an easy source of confusion when debugging a program. Parallelism 1 with repartitioning operators is probably mostly a debug setup anyways. On Sat, Oct 24, 2015 at 6:35 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > Hey guys, > > Have we disabled the default input copying after all? I don't remember > seeing a Jira or PR for this (maybe I just missed it). > > And if not, do we want this in the 0.10 release? > > Cheers, > Gyula > > On Fri, Oct 2, 2015 at 7:57 PM, Till Rohrmann <trohrm...@apache.org> > wrote: > > > Do we know what kind of impact the non-reuse policy has? Maybe the > > serialization overhead is subsumed by other effects. > > > > But in general I'm ok with changing the default to non copying. We just > > have to document this feature properly. > > On Oct 2, 2015 6:31 PM, "Maximilian Michels" <m...@apache.org> wrote: > > > > > +1 Good idea. I think we can save quite some CPU cycles by not copying > > > records. > > > > > > That is basically the behavior of the batch API, and there has so far > > never > > > > been an issue with that (people running into the trap of overwritten > > > > mutable elements). > > > > > > > > > As far as I know, this is only the case for chained operators? > > > > > > On Fri, Oct 2, 2015 at 6:15 PM, Matthias J. Sax <mj...@apache.org> > > wrote: > > > > > > > +1 for disable copy by default > > > > > > > > > > > > On 10/02/2015 05:53 PM, Stephan Ewen wrote: > > > > > Hi all! > > > > > > > > > > Now that we are coming to the next release, I wanted to make sure > we > > > > > finalize the decision on that point, because it would be nice to > not > > > > break > > > > > the behavior of system afterwards. > > > > > > > > > > Right now, when tasks are chained together, the system copies the > > > > elements > > > > > always between different tasks in the same chain. > > > > > > > > > > I think this policy was established under the assumption that > copies > > do > > > > not > > > > > cost anything, given our own test examples, which mainly use > > immutable > > > > > types like Strings, boxed primitives, .. > > > > > > > > > > In practice, a lot of data types are actually quite expensive to > > copy. > > > > > > > > > > For example, a rather common data type in the event analysis of > > > > web-sources > > > > > is JSON Object. > > > > > Flink treats this as a generic type. Depending on its concrete > > > > > implementation, Kryo may have perform a serialization copy, which > > means > > > > > encoding into bytes (JSON encoding, charset encoding) and decoding > > > again. > > > > > > > > > > This has a massive impact on the out-of-the-box performance of the > > > > system. > > > > > Given that, I was wondering whether we should set to default policy > > to > > > > "not > > > > > copying". > > > > > > > > > > That is basically the behavior of the batch API, and there has so > far > > > > never > > > > > been an issue with that (people running into the trap of > overwritten > > > > > mutable elements). > > > > > > > > > > What do you think? > > > > > > > > > > Stephan > > > > > > > > > > > > > > > > > > >