[ 
https://issues.apache.org/jira/browse/SAMZA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746352#comment-15746352
 ] 

Jagadish commented on SAMZA-1041:
---------------------------------

[~jagadish1...@gmail.com]

> Multi-stage feature for Samza
> -----------------------------
>
>                 Key: SAMZA-1041
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1041
>             Project: Samza
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>
> Samza provides a powerful framework for users to implement and deploy stream 
> processors. One of the core concepts in Samza is a processor, which is 
> deployed individually as a job. While a single job is sufficient to perform 
> some basic stream processing, we have seen users apply Samza to more complex 
> problems involving multi-stage jobs or pipelines. These range from jobs that 
> require a separate repartitioner to co-partition streams for a join, to more 
> complicated pipelines in which the stages are purposely decoupled like 
> microservices. Historically, users would create a separate job for each stage 
> of the pipeline and deploy them manually. A critical part of this manual 
> deployment is creating and modifying the intermediate streams to have the 
> appropriate configuration, in particular; partition count. This deployment 
> model has proven to be tedious and error-prone because:
> 1. Stream creation is a manual process. If the streams are not pre-created 
> with the appropriate configurations, it can lead to unexpected behavior in 
> the pipeline. For example failure to join because keys are not being routed 
> to a common Task.
> 2. Job deployment is a manual process. Each stage needs to be deployed 
> separately, even though they are often deployed in the same cadence.
> 3. Configuration is associated with a processor, which makes it more 
> difficult to reuse the processor. Configurations like task inputs and 
> container count can vary for the same processor depending on the context 
> (pipeline) in which it is executed. 
> 4. There is no early validation to detect a misconfigured pipeline. Instead 
> users tend to notice that something is wrong long after the initial 
> deployment by looking at metrics. 
> 5. There are common preconditions for processors (e.g. co-partitioned, or 
> deduplicated inputs) that could be handled automatically in a system that has 
> the “whole picture” of streams and processors.
> Our goal is to alleviate these issues by introducing a 
> yet-to-be-properly-named feature which we will call “pipelines” for now. The 
> pipeline feature will allow users to easily compose a collection of 
> processors and streams into a directed acyclic graph (DAG) and manage them as 
> one unit. A key part of this feature is automatic runtime creation of 
> intermediate and output streams. It also enables richer validation, 
> simplified deployment, and a foundation for many performance and ease-of-use 
> features. For example, repartitioners could be automatically injected where 
> needed and processors could be colocated on the same container for 
> performance. 
> Note that this feature is not the same as SAMZA-914, mostly in terms of scope 
> and simplicity. SAMZA-914 focuses on composing operators into a logical flow 
> that is executed as one processor. That entire flow is scaled out uniformly 
> by adding containers. By contrast, this pipeline feature provides isolation 
> and independent scaling of the processors. 
> Many of the details have yet to be worked out, but a design doc will be 
> posted here soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to