I can elaborate in the project(...) method:

".returns()" is there to supply TypeInformation in cases where the system
cannot determine it. In the case of "project()", the system can perfectly
determine the output type info from the input and the projection.

For just getting a typed result, I would use Java's generic method syntax,
then you can get around defining an intermediate variable:

DataSet<Tuple3<Long, String, Integer>> input = ...;

Tuple2<Long, Integer> aTuple = input.<Tuple2<Long,
Integer>>project(0,2).collect().get(0);


Greetings,
Stephan


On Tue, Feb 9, 2016 at 7:54 PM, Vasiliki Kalavri <vasilikikala...@gmail.com>
wrote:

> Hi Martin,
>
> thank you for the feedback. Let me try to answer some of your concerns.
>
>
> On 9 February 2016 at 15:35, Martin Neumann <mneum...@sics.se> wrote:
>
> > During this year's FOSDEM Martin Junghans and I set together and gathered
> > some feedback for the Flink project. It is based on our personal
> experience
> > as well as the feedback and questions from People we taught the system.
> > This is going to be a longer email therefore I have split things into
> > categories:
> >
> >
> > *Website and Documentation:*
> >
> >    1. *Out-dated Google Search results*: Google searches lead to outdated
> >    web site versions (e.g. “flink transformations” or “flink iterations”
> >    return the 0.7 version of the corresponding pages).
> >
>
> ​I'm not sure we can do much about this. I would suggest searching in the
> documentation instead of relying on Google.
> There is a search box on the top of all documentation pages.
>
>
>
> >    2. *Invalid Links on Website: *Links are confusing / broken (e.g. the
> >    Gelly /ML Links on the start page lead to the top of the feature page
> >    (which start with streaming) *-> maybe this can be validated
> >    automatically?*
> >
> >
> ​That was bug recently reported and fixed (see FLINK-3316). If you find
> ​ more of those, please report by opening a JIRA or Pull Request​.
>
>
>
> >
> > *Batch API:*
> >
> >    1. *.reduceGroup(GroupReduceFunction) and
> >    .groupCombine(CombineGroupFunction): *In other functions such as
> >    .flatMap(FlatMapFunction) the function call matches the naming of the
> >    operator. This structure is quite convenient for new user since they
> can
> >    make use of the autocompletion features of the IDE, basically start
> > typing
> >    the function call and you get the correct class. This does not work
> for
> >    .reduceGroup() and .groupCombine() since the names are switched
> around.
> > *->
> >    maybe the function can be renamed*
> >
>
> ​I agree this might be strange for new users, but I think it will be much
> more annoying for existing users if we change this. In my view, it's not an
> important case to justify breaking the API.
>
>
>
> >    2. *.print() and env.execute(): *Often .print() is used for debugging
> >    and developing programs replacing regular data sinks. Such a project
> > will
> >    not run until the env.execute() is removed. It's very easy to forget
> to
> > add
> >    it back in, once you change the .print() back to a proper sink. The
> > project
> >    now will compile fine but will not produce any output since .execute()
> > is
> >    missing. This is a very difficult bug to find especially since there
> is
> > no
> >    warning or error when running the job. It’s common that people use
> more
> >    than one .print() statement during debugging and development. This can
> > lead
> >    to confusion since each .print() forces the program to execute so the
> >    execution behavior is different than without the print. This is
> > especially
> >    important, if the program contains non-deterministic data generation
> > (like
> >    generating IDs). In the stream API .print() would not require to
> >    remove .execute() as a result the behavior of the two interfaces is
> >    inconsistent.
> >
>
> ​This is indeed an issue that many users find hard to get used to. We have
> changed the behavior of print() a couple of times before and I'm not sure
> it would be wise to do so again. Actually, once a user understands the
> difference between eager and lazy sinks, I think it's quite easy​ to avoid
> mistakes.
>
>
>
> >    3. *calling new when applying an operator eg: .reduceGroup(new
> >    GroupReduceFunction()): *Some of the people I taught the API’s to
> where
> >    confused by this. They knew it was a distributed system and they were
> >    wondering where the constructor would be actually called. They
> expected
> > to
> >    hand a class to the function that would be initialized on each of the
> >    worker nodes. *-> maybe have a section about this in the
> documentation*
> >
>
> ​I'm not sure I understand the confusion with this one. The goal of
> high-level APIs is to relieve the users from having to think about
> distribution. The only thing they need to understand is the
> DataSet/DataStream abstractions and how to create transformations on them.
>
>
> >    4. *.project() loses type information / does not support .returns(..):
> > *The
> >    project transformation currently loses type information which affects
> >    chained call with other transformations. One workaround is the
> > definition
> >    of an intermediate dataset. However, to be consistent with other
> > operators,
> >    project should support .returns() to define a type information if
> > needed.
> >
> >
> ​I'm not sure _why_ this is the case. Maybe someone who knows more can
> clarify this one.​
>
>
>
> >
> > *Stream API:*
> >
> >    1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every
> >    operator that consumes a KeyedDataStream produces a DataStream. This
> > means
> >    it is not possible to create a program that uses a keyBy() followed
> by a
> >    sequence of transformation for each key without having to reapply
> > keyBy()
> >    after each of those operators. (This was a common problem in my work
> for
> >    Ericsson and Spotify)
> >
>
> I might be missing something here, but if you want to apply a
> transformation on a keyed stream without changing the keys, isn't a map
> transformation ​enough? Can you give an example of a case where you had
> this problem?
>
>
>
> >    2. *split() operator with multiple output types.: *Its common to have
> to
> >    split a single Stream into a different streams. For example a stream
> >    containing different system events might need to be broken into a
> stream
> >    for each type. The current split() operator requires all outputs to
> have
> >    the same data type. I cases where there are no direct type hierarchies
> > the
> >    user needs to implement a wrapper type to make use of this function.
> An
> >    operator similar to split that allows output streams to have different
> >    types would greatly simplify those use cases
> >
> >
> > cheers Martin
> >
>
> ​Cheers,
> -Vasia.​
>

Reply via email to