[ https://issues.apache.org/jira/browse/BEAM-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443663#comment-17443663 ]
Yun Tang commented on BEAM-8470: -------------------------------- It seems the new SparkStructuredStreamingRunner still does not support streaming mode, does it mean we cannot run nexmark with spark structed streaming? [~echauchot] > Create a new Spark runner based on Spark Structured streaming framework > ----------------------------------------------------------------------- > > Key: BEAM-8470 > URL: https://issues.apache.org/jira/browse/BEAM-8470 > Project: Beam > Issue Type: New Feature > Components: runner-spark > Reporter: Etienne Chauchot > Assignee: Etienne Chauchot > Priority: P2 > Labels: structured-streaming > Fix For: 2.18.0 > > Time Spent: 16h > Remaining Estimate: 0h > > h1. Why is it worth creating a new runner based on structured streaming: > Because this new framework brings: > * Unified batch and streaming semantics: > * no more RDD/DStream distinction, as in Beam (only PCollection) > * Better state management: > * incremental state instead of saving all each time > * No more synchronous saving delaying computation: per batch and partition > delta file saved asynchronously + in-memory hashmap synchronous put/get > * Schemas in datasets: > * The dataset knows the structure of the data (fields) and can optimize > later on > * Schemas in PCollection in Beam > * New Source API > * Very close to Beam bounded source and unbounded sources > h1. Why make a new runner from scratch? > * Structured streaming framework is very different from the RDD/Dstream > framework > h1. We hope to gain > * More up to date runner in terms of libraries: leverage new features > * Leverage learnt practices from the previous runners > * Better performance thanks to the DAG optimizer (catalyst) and by > simplifying the code. > * Simplify the code and ease the maintenance > -- This message was sent by Atlassian Jira (v8.20.1#820001)