[jira] [Created] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

Etienne Chauchot (Jira) Thu, 24 Oct 2019 06:28:10 -0700

Etienne Chauchot created BEAM-8470:
--------------------------------------

             Summary: Create a new Spark runner based on Spark Structured 
streaming framework
                 Key: BEAM-8470
                 URL: https://issues.apache.org/jira/browse/BEAM-8470
             Project: Beam
          Issue Type: Improvement
          Components: runner-spark
            Reporter: Etienne Chauchot
            Assignee: Etienne Chauchot



h1. Why is it worth creating a new runner based on structured streaming:

Because this new framework brings:
 * Unified batch and streaming semantics:
 * no more RDD/DStream distinction, as in Beam (only PCollection)


 * Better state management:
 * incremental state instead of saving all each time
 * No more synchronous saving delaying computation: per batch and partition 
delta file saved asynchronously + in-memory hashmap synchronous put/get


 * Schemas in datasets:
 * The dataset knows the structure of the data (fields) and can optimize later 
on
 * Schemas in PCollection in Beam


 * New Source API
 * Very close to Beam bounded source and unbounded sources

h1. Why make a new runner from scratch?
 * Structured streaming framework is very different from the RDD/Dstream 
framework

h1. We hope to gain
 * More up to date runner in terms of libraries: leverage new features
 * Leverage learnt practices from the previous runners
 * Better performance thanks to the DAG optimizer (catalyst) and by simplifying 
the code.
 * Simplify the code and ease the maintenance

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

Reply via email to