Etienne Chauchot created BEAM-8470: -------------------------------------- Summary: Create a new Spark runner based on Spark Structured streaming framework Key: BEAM-8470 URL: https://issues.apache.org/jira/browse/BEAM-8470 Project: Beam Issue Type: Improvement Components: runner-spark Reporter: Etienne Chauchot Assignee: Etienne Chauchot
h1. Why is it worth creating a new runner based on structured streaming: Because this new framework brings: * Unified batch and streaming semantics: * no more RDD/DStream distinction, as in Beam (only PCollection) * Better state management: * incremental state instead of saving all each time * No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get * Schemas in datasets: * The dataset knows the structure of the data (fields) and can optimize later on * Schemas in PCollection in Beam * New Source API * Very close to Beam bounded source and unbounded sources h1. Why make a new runner from scratch? * Structured streaming framework is very different from the RDD/Dstream framework h1. We hope to gain * More up to date runner in terms of libraries: leverage new features * Leverage learnt practices from the previous runners * Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code. * Simplify the code and ease the maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005)