[jira] [Work logged] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

ASF GitHub Bot (Jira) Fri, 08 Nov 2019 05:44:05 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-8470?focusedWorklogId=340493&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340493
 ]


ASF GitHub Bot logged work on BEAM-8470:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Nov/19 13:42
            Start Date: 08/Nov/19 13:42
    Worklog Time Spent: 10m 
      Work Description: echauchot commented on issue #9866: [BEAM-8470] Create 
a new Spark runner based on Spark Structured streaming framework
URL: https://github.com/apache/beam/pull/9866#issuecomment-551813147
 
 
   @aromanenko-dev FYI it is normal the UTests are failing. I just figured that 
the tests were not properly configured after the merge of the 2 modules (wrong 
pipelineOptions used) + the changes on PipelineResult (no set of the testMode 
to true in the pipelineOptions to wait for PAssert). As a consequence all 
UTests passed no matter what. I fixed that with the last commit but 
PipelineResults tests fail. @RyanSkraba who authored this part will take a look 
at it when he has time. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 340493)
    Time Spent: 6.5h  (was: 6h 20m)

> Create a new Spark runner based on Spark Structured streaming framework
> -----------------------------------------------------------------------
>
>                 Key: BEAM-8470
>                 URL: https://issues.apache.org/jira/browse/BEAM-8470
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Major
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> h1. Why is it worth creating a new runner based on structured streaming:
> Because this new framework brings:
>  * Unified batch and streaming semantics:
>  * no more RDD/DStream distinction, as in Beam (only PCollection)
>  * Better state management:
>  * incremental state instead of saving all each time
>  * No more synchronous saving delaying computation: per batch and partition 
> delta file saved asynchronously + in-memory hashmap synchronous put/get
>  * Schemas in datasets:
>  * The dataset knows the structure of the data (fields) and can optimize 
> later on
>  * Schemas in PCollection in Beam
>  * New Source API
>  * Very close to Beam bounded source and unbounded sources
> h1. Why make a new runner from scratch?
>  * Structured streaming framework is very different from the RDD/Dstream 
> framework
> h1. We hope to gain
>  * More up to date runner in terms of libraries: leverage new features
>  * Leverage learnt practices from the previous runners
>  * Better performance thanks to the DAG optimizer (catalyst) and by 
> simplifying the code.
>  * Simplify the code and ease the maintenance
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

Reply via email to