Thanks Matthias.
Maybe I should clarify, that I do not want to change the guarantees for the
enableObjectReuse mode, but for the disableObjectReuse mode.
The rules for the enableObjectReuse mode should remain the same.


2016-02-18 9:37 GMT+01:00 Matthias J. Sax <mj...@apache.org>:

> Hi,
>
> I like Fabian's proposal. The idea of object reuse is performance gain,
> and we should not sacrifice this. Even more important is that the rules
> are easy to understand!
>
> -Matthias
>
>
> On 02/17/2016 06:17 PM, Fabian Hueske wrote:
> > Hi,
> >
> >
> >
> > Flink's DataSet API features a configuration parameter called
> > enableObjectReuse(). If activated, Flink's runtime will create fewer
> > objects which results in better performance and lower garbage collection
> > overhead. Depending on whether the configuration switch is enabled or
> not,
> > user functions may or may not perform certain operations on objects they
> > receive from Flink or emit to Flink.
> >
> >
> >
> > At the moment, there are quite a few open issues and discussions going on
> > about the object reuse mode, including the JIRA issues FLINK-3333,
> > FLINK-1521, FLINK-3335, FLINK-3340, FLINK-3394, and FLINK-3291.
> >
> >
> >
> > IMO, the most important issue is FLINK-3333 which is about improving the
> > documentation of the object reuse mode. The current version [1] is
> > ambiguous and includes details about operator chaining which are hard to
> > understand and to reason about for users. Hence it is not very clear
> which
> > guarantees Flink gives for objects in user functions under which
> > conditions. This documentation needs to be improved and I think this
> should
> > happen together with the 1.0 release.
> >
> >
> >
> > Greg and Gabor proposed two new versions:
> >
> > 1. Greg's version [2]  improves and clarifies the current documentation
> > without significantly changing the semantics. It also discusses operator
> > chaining, but gives more details.
> > 2. Gabor's proposal [3] aims to make the discussion of object reuse
> > independent of operator chaining which I think is a very good idea
> because
> > it is not transparent to the user when function chaining happens. Gabor
> > formulated four questions to answer what users can do with and expect
> from
> > objects that they received or emitted from a function. In order to make
> the
> > answers to these questions independent of function chaining and still
> keep
> > the contracts as defined by the current documentation, we have to default
> > to rather restrictive rules. For instance, functions must always emit new
> > object instances in case of disabled object reuse mode. These strict
> rules
> > would for example also require DataSourceFunctions to copy all records
> > which they receive from an InputFormat (see FLINK-3335). IMO, the strict
> > guarantees make the disableObjectReuse mode harder to use and reason
> about
> > than the enableObjectReuse mode whereas the opposite should be the case.
> >
> >
> >
> > I would like to suggest a third option. Similar as Gabor, I think the
> rules
> > should be independent of function chaining and I would like to break it
> > down into a handful of easy rules. However, I think we should loosen up
> the
> > guarantees for user functions under disableObjectReuse mode a bit.
> >
> > Right now, the documentation states that under enableObjectReuse mode,
> > input objects are not changed across functions calls. Hence users can
> > remember these objects across functions calls and their value will not
> > change. I propose to give this guarantee only within functions calls and
> > only for objects which are not emitted. Hence, this rule only applies for
> > functions that can consume multiple values through an iterator such as
> > GroupReduce, CoGroup, or MapPartition. In object disableObjectReuse mode,
> > these functions are allowed to remember the values e.g., in a collection,
> > and their value will not change when the iterator is forwarded. Once the
> > function call returns, the values might change. Since  functions with
> > iterators cannot be directly chained, it will be safe to emit the same
> > object instance several times (hence FLINK-3335 would become invalid).
> >
> >
> >
> > The difference to the current guarantees is that input objects become
> > invalid after the function call returned. Since, the disableObjectReuse
> > mode was mainly introduced to allow for caching objects across iterator
> > calls within a GroupReduceFunction or CoGroupFunction (not across
> function
> > calls), I think this is a reasonable restriction.
> >
> >
> >
> > tl;dr;
> >
> > If we want to make the documentation of object reuse independent of
> > chaining we have to
> >
> > - EITHER, give tighter guarantees / be more restrictive than now and
> update
> > internals which might lead to performance regression. This would be
> in-line
> > with the current documentation but somewhat defeat the purpose of the
> > disabledObjectReuse mode, IMO.
> >
> > - OR, give weaker guarantees, which breaks with the current
> documentation,
> > but would not affect performance or be easier to follow for users, IMO.
> >
> >
> > Greg and Gabor, please correct me if I did not get your points right or
> > missed something.
> >
> > What do others think?
> >
> >
> > Fabian
> >
> >
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior
> >
> > [2]
> >
> https://issues.apache.org/jira/browse/FLINK-3333?focusedCommentId=15139151
> >
> > [3]
> >
> https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg
> >
>
>

Reply via email to