During this year's FOSDEM Martin Junghans and I set together and gathered
some feedback for the Flink project. It is based on our personal experience
as well as the feedback and questions from People we taught the system.
This is going to be a longer email therefore I have split things into
categories:


*Website and Documentation:*

   1. *Out-dated Google Search results*: Google searches lead to outdated
   web site versions (e.g. “flink transformations” or “flink iterations”
   return the 0.7 version of the corresponding pages).
   2. *Invalid Links on Website: *Links are confusing / broken (e.g. the
   Gelly /ML Links on the start page lead to the top of the feature page
   (which start with streaming) *-> maybe this can be validated
   automatically?*


*Batch API:*

   1. *.reduceGroup(GroupReduceFunction) and
   .groupCombine(CombineGroupFunction): *In other functions such as
   .flatMap(FlatMapFunction) the function call matches the naming of the
   operator. This structure is quite convenient for new user since they can
   make use of the autocompletion features of the IDE, basically start typing
   the function call and you get the correct class. This does not work for
   .reduceGroup() and .groupCombine() since the names are switched around. *->
   maybe the function can be renamed*
   2. *.print() and env.execute(): *Often .print() is used for debugging
   and developing programs replacing regular data sinks. Such a project will
   not run until the env.execute() is removed. It's very easy to forget to add
   it back in, once you change the .print() back to a proper sink. The project
   now will compile fine but will not produce any output since .execute() is
   missing. This is a very difficult bug to find especially since there is no
   warning or error when running the job. It’s common that people use more
   than one .print() statement during debugging and development. This can lead
   to confusion since each .print() forces the program to execute so the
   execution behavior is different than without the print. This is especially
   important, if the program contains non-deterministic data generation (like
   generating IDs). In the stream API .print() would not require to
   remove .execute() as a result the behavior of the two interfaces is
   inconsistent.
   3. *calling new when applying an operator eg: .reduceGroup(new
   GroupReduceFunction()): *Some of the people I taught the API’s to where
   confused by this. They knew it was a distributed system and they were
   wondering where the constructor would be actually called. They expected to
   hand a class to the function that would be initialized on each of the
   worker nodes. *-> maybe have a section about this in the documentation*
   4. *.project() loses type information / does not support .returns(..): *The
   project transformation currently loses type information which affects
   chained call with other transformations. One workaround is the definition
   of an intermediate dataset. However, to be consistent with other operators,
   project should support .returns() to define a type information if needed.


*Stream API:*

   1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every
   operator that consumes a KeyedDataStream produces a DataStream. This means
   it is not possible to create a program that uses a keyBy() followed by a
   sequence of transformation for each key without having to reapply keyBy()
   after each of those operators. (This was a common problem in my work for
   Ericsson and Spotify)
   2. *split() operator with multiple output types.: *Its common to have to
   split a single Stream into a different streams. For example a stream
   containing different system events might need to be broken into a stream
   for each type. The current split() operator requires all outputs to have
   the same data type. I cases where there are no direct type hierarchies the
   user needs to implement a wrapper type to make use of this function. An
   operator similar to split that allows output streams to have different
   types would greatly simplify those use cases


cheers Martin

Reply via email to