Hi devs,

As described in the FLIP-131[1] we intend to deprecate and remove the
DataSet API in the future in favour of the DataStream API for both
bounded/batch and unbounded/streaming jobs. Ideally, we should be able
to stay in the same performance ballpark with bounded DataStream
programs as equivalent DataSet programs.

One of the ideas to do so is to introduce a sorting before keyed
operators and replace the StateBackend with a simplified one. In other
words you could see that as a switch from a hash based aggregations with
quite costly StateBackends (RocksDB) vs sort-based aggregations with
aggregations purely in memory. You can see more details in the FLIP-140 [2]

The FLIP contains some open questions that I'd really appreciate an
input from the community. Some of the questions include:

 1. How to sort/group keys? What representation of the key should we
    use? Should we sort on the binary form or should we depend on
    Comparators being available.
 2. Where in the stack should we apply the sorting (this rather a
    discussion about internals)
 3. How should we deal with custom implementations of StreamOperators

I am really looking forward to all your feedback!

Best,

Dawid

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741&src=contextnavpagetreemode

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-140%3A+Introduce+bounded+style+execution+for+keyed+streams

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to