Hi Martin, thank you for the feedback. Let me try to answer some of your concerns.
On 9 February 2016 at 15:35, Martin Neumann <mneum...@sics.se> wrote: > During this year's FOSDEM Martin Junghans and I set together and gathered > some feedback for the Flink project. It is based on our personal experience > as well as the feedback and questions from People we taught the system. > This is going to be a longer email therefore I have split things into > categories: > > > *Website and Documentation:* > > 1. *Out-dated Google Search results*: Google searches lead to outdated > web site versions (e.g. “flink transformations” or “flink iterations” > return the 0.7 version of the corresponding pages). > I'm not sure we can do much about this. I would suggest searching in the documentation instead of relying on Google. There is a search box on the top of all documentation pages. > 2. *Invalid Links on Website: *Links are confusing / broken (e.g. the > Gelly /ML Links on the start page lead to the top of the feature page > (which start with streaming) *-> maybe this can be validated > automatically?* > > That was bug recently reported and fixed (see FLINK-3316). If you find more of those, please report by opening a JIRA or Pull Request. > > *Batch API:* > > 1. *.reduceGroup(GroupReduceFunction) and > .groupCombine(CombineGroupFunction): *In other functions such as > .flatMap(FlatMapFunction) the function call matches the naming of the > operator. This structure is quite convenient for new user since they can > make use of the autocompletion features of the IDE, basically start > typing > the function call and you get the correct class. This does not work for > .reduceGroup() and .groupCombine() since the names are switched around. > *-> > maybe the function can be renamed* > I agree this might be strange for new users, but I think it will be much more annoying for existing users if we change this. In my view, it's not an important case to justify breaking the API. > 2. *.print() and env.execute(): *Often .print() is used for debugging > and developing programs replacing regular data sinks. Such a project > will > not run until the env.execute() is removed. It's very easy to forget to > add > it back in, once you change the .print() back to a proper sink. The > project > now will compile fine but will not produce any output since .execute() > is > missing. This is a very difficult bug to find especially since there is > no > warning or error when running the job. It’s common that people use more > than one .print() statement during debugging and development. This can > lead > to confusion since each .print() forces the program to execute so the > execution behavior is different than without the print. This is > especially > important, if the program contains non-deterministic data generation > (like > generating IDs). In the stream API .print() would not require to > remove .execute() as a result the behavior of the two interfaces is > inconsistent. > This is indeed an issue that many users find hard to get used to. We have changed the behavior of print() a couple of times before and I'm not sure it would be wise to do so again. Actually, once a user understands the difference between eager and lazy sinks, I think it's quite easy to avoid mistakes. > 3. *calling new when applying an operator eg: .reduceGroup(new > GroupReduceFunction()): *Some of the people I taught the API’s to where > confused by this. They knew it was a distributed system and they were > wondering where the constructor would be actually called. They expected > to > hand a class to the function that would be initialized on each of the > worker nodes. *-> maybe have a section about this in the documentation* > I'm not sure I understand the confusion with this one. The goal of high-level APIs is to relieve the users from having to think about distribution. The only thing they need to understand is the DataSet/DataStream abstractions and how to create transformations on them. > 4. *.project() loses type information / does not support .returns(..): > *The > project transformation currently loses type information which affects > chained call with other transformations. One workaround is the > definition > of an intermediate dataset. However, to be consistent with other > operators, > project should support .returns() to define a type information if > needed. > > I'm not sure _why_ this is the case. Maybe someone who knows more can clarify this one. > > *Stream API:* > > 1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every > operator that consumes a KeyedDataStream produces a DataStream. This > means > it is not possible to create a program that uses a keyBy() followed by a > sequence of transformation for each key without having to reapply > keyBy() > after each of those operators. (This was a common problem in my work for > Ericsson and Spotify) > I might be missing something here, but if you want to apply a transformation on a keyed stream without changing the keys, isn't a map transformation enough? Can you give an example of a case where you had this problem? > 2. *split() operator with multiple output types.: *Its common to have to > split a single Stream into a different streams. For example a stream > containing different system events might need to be broken into a stream > for each type. The current split() operator requires all outputs to have > the same data type. I cases where there are no direct type hierarchies > the > user needs to implement a wrapper type to make use of this function. An > operator similar to split that allows output streams to have different > types would greatly simplify those use cases > > > cheers Martin > Cheers, -Vasia.