Hi devs, As described in the FLIP-131[1] we intend to deprecate and remove the DataSet API in the future in favour of the DataStream API for both bounded/batch and unbounded/streaming jobs. Ideally, we should be able to stay in the same performance ballpark with bounded DataStream programs as equivalent DataSet programs.
One of the ideas to do so is to introduce a sorting before keyed operators and replace the StateBackend with a simplified one. In other words you could see that as a switch from a hash based aggregations with quite costly StateBackends (RocksDB) vs sort-based aggregations with aggregations purely in memory. You can see more details in the FLIP-140 [2] The FLIP contains some open questions that I'd really appreciate an input from the community. Some of the questions include: 1. How to sort/group keys? What representation of the key should we use? Should we sort on the binary form or should we depend on Comparators being available. 2. Where in the stack should we apply the sorting (this rather a discussion about internals) 3. How should we deal with custom implementations of StreamOperators I am really looking forward to all your feedback! Best, Dawid [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741&src=contextnavpagetreemode [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-140%3A+Introduce+bounded+style+execution+for+keyed+streams
signature.asc
Description: OpenPGP digital signature