Hi Martin,

thank you for the feedback. Let me try to answer some of your concerns.


On 9 February 2016 at 15:35, Martin Neumann <mneum...@sics.se> wrote:

> During this year's FOSDEM Martin Junghans and I set together and gathered
> some feedback for the Flink project. It is based on our personal experience
> as well as the feedback and questions from People we taught the system.
> This is going to be a longer email therefore I have split things into
> categories:
>
>
> *Website and Documentation:*
>
>    1. *Out-dated Google Search results*: Google searches lead to outdated
>    web site versions (e.g. “flink transformations” or “flink iterations”
>    return the 0.7 version of the corresponding pages).
>

​I'm not sure we can do much about this. I would suggest searching in the
documentation instead of relying on Google.
There is a search box on the top of all documentation pages.



>    2. *Invalid Links on Website: *Links are confusing / broken (e.g. the
>    Gelly /ML Links on the start page lead to the top of the feature page
>    (which start with streaming) *-> maybe this can be validated
>    automatically?*
>
>
​That was bug recently reported and fixed (see FLINK-3316). If you find
​ more of those, please report by opening a JIRA or Pull Request​.



>
> *Batch API:*
>
>    1. *.reduceGroup(GroupReduceFunction) and
>    .groupCombine(CombineGroupFunction): *In other functions such as
>    .flatMap(FlatMapFunction) the function call matches the naming of the
>    operator. This structure is quite convenient for new user since they can
>    make use of the autocompletion features of the IDE, basically start
> typing
>    the function call and you get the correct class. This does not work for
>    .reduceGroup() and .groupCombine() since the names are switched around.
> *->
>    maybe the function can be renamed*
>

​I agree this might be strange for new users, but I think it will be much
more annoying for existing users if we change this. In my view, it's not an
important case to justify breaking the API.



>    2. *.print() and env.execute(): *Often .print() is used for debugging
>    and developing programs replacing regular data sinks. Such a project
> will
>    not run until the env.execute() is removed. It's very easy to forget to
> add
>    it back in, once you change the .print() back to a proper sink. The
> project
>    now will compile fine but will not produce any output since .execute()
> is
>    missing. This is a very difficult bug to find especially since there is
> no
>    warning or error when running the job. It’s common that people use more
>    than one .print() statement during debugging and development. This can
> lead
>    to confusion since each .print() forces the program to execute so the
>    execution behavior is different than without the print. This is
> especially
>    important, if the program contains non-deterministic data generation
> (like
>    generating IDs). In the stream API .print() would not require to
>    remove .execute() as a result the behavior of the two interfaces is
>    inconsistent.
>

​This is indeed an issue that many users find hard to get used to. We have
changed the behavior of print() a couple of times before and I'm not sure
it would be wise to do so again. Actually, once a user understands the
difference between eager and lazy sinks, I think it's quite easy​ to avoid
mistakes.



>    3. *calling new when applying an operator eg: .reduceGroup(new
>    GroupReduceFunction()): *Some of the people I taught the API’s to where
>    confused by this. They knew it was a distributed system and they were
>    wondering where the constructor would be actually called. They expected
> to
>    hand a class to the function that would be initialized on each of the
>    worker nodes. *-> maybe have a section about this in the documentation*
>

​I'm not sure I understand the confusion with this one. The goal of
high-level APIs is to relieve the users from having to think about
distribution. The only thing they need to understand is the
DataSet/DataStream abstractions and how to create transformations on them.


>    4. *.project() loses type information / does not support .returns(..):
> *The
>    project transformation currently loses type information which affects
>    chained call with other transformations. One workaround is the
> definition
>    of an intermediate dataset. However, to be consistent with other
> operators,
>    project should support .returns() to define a type information if
> needed.
>
>
​I'm not sure _why_ this is the case. Maybe someone who knows more can
clarify this one.​



>
> *Stream API:*
>
>    1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every
>    operator that consumes a KeyedDataStream produces a DataStream. This
> means
>    it is not possible to create a program that uses a keyBy() followed by a
>    sequence of transformation for each key without having to reapply
> keyBy()
>    after each of those operators. (This was a common problem in my work for
>    Ericsson and Spotify)
>

I might be missing something here, but if you want to apply a
transformation on a keyed stream without changing the keys, isn't a map
transformation ​enough? Can you give an example of a case where you had
this problem?



>    2. *split() operator with multiple output types.: *Its common to have to
>    split a single Stream into a different streams. For example a stream
>    containing different system events might need to be broken into a stream
>    for each type. The current split() operator requires all outputs to have
>    the same data type. I cases where there are no direct type hierarchies
> the
>    user needs to implement a wrapper type to make use of this function. An
>    operator similar to split that allows output streams to have different
>    types would greatly simplify those use cases
>
>
> cheers Martin
>

​Cheers,
-Vasia.​

Reply via email to