During this year's FOSDEM Martin Junghans and I set together and gathered some feedback for the Flink project. It is based on our personal experience as well as the feedback and questions from People we taught the system. This is going to be a longer email therefore I have split things into categories:
*Website and Documentation:* 1. *Out-dated Google Search results*: Google searches lead to outdated web site versions (e.g. “flink transformations” or “flink iterations” return the 0.7 version of the corresponding pages). 2. *Invalid Links on Website: *Links are confusing / broken (e.g. the Gelly /ML Links on the start page lead to the top of the feature page (which start with streaming) *-> maybe this can be validated automatically?* *Batch API:* 1. *.reduceGroup(GroupReduceFunction) and .groupCombine(CombineGroupFunction): *In other functions such as .flatMap(FlatMapFunction) the function call matches the naming of the operator. This structure is quite convenient for new user since they can make use of the autocompletion features of the IDE, basically start typing the function call and you get the correct class. This does not work for .reduceGroup() and .groupCombine() since the names are switched around. *-> maybe the function can be renamed* 2. *.print() and env.execute(): *Often .print() is used for debugging and developing programs replacing regular data sinks. Such a project will not run until the env.execute() is removed. It's very easy to forget to add it back in, once you change the .print() back to a proper sink. The project now will compile fine but will not produce any output since .execute() is missing. This is a very difficult bug to find especially since there is no warning or error when running the job. It’s common that people use more than one .print() statement during debugging and development. This can lead to confusion since each .print() forces the program to execute so the execution behavior is different than without the print. This is especially important, if the program contains non-deterministic data generation (like generating IDs). In the stream API .print() would not require to remove .execute() as a result the behavior of the two interfaces is inconsistent. 3. *calling new when applying an operator eg: .reduceGroup(new GroupReduceFunction()): *Some of the people I taught the API’s to where confused by this. They knew it was a distributed system and they were wondering where the constructor would be actually called. They expected to hand a class to the function that would be initialized on each of the worker nodes. *-> maybe have a section about this in the documentation* 4. *.project() loses type information / does not support .returns(..): *The project transformation currently loses type information which affects chained call with other transformations. One workaround is the definition of an intermediate dataset. However, to be consistent with other operators, project should support .returns() to define a type information if needed. *Stream API:* 1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every operator that consumes a KeyedDataStream produces a DataStream. This means it is not possible to create a program that uses a keyBy() followed by a sequence of transformation for each key without having to reapply keyBy() after each of those operators. (This was a common problem in my work for Ericsson and Spotify) 2. *split() operator with multiple output types.: *Its common to have to split a single Stream into a different streams. For example a stream containing different system events might need to be broken into a stream for each type. The current split() operator requires all outputs to have the same data type. I cases where there are no direct type hierarchies the user needs to implement a wrapper type to make use of this function. An operator similar to split that allows output streams to have different types would greatly simplify those use cases cheers Martin