[ https://issues.apache.org/jira/browse/SAMZA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746352#comment-15746352 ]
Jagadish commented on SAMZA-1041: --------------------------------- [~jagadish1...@gmail.com] > Multi-stage feature for Samza > ----------------------------- > > Key: SAMZA-1041 > URL: https://issues.apache.org/jira/browse/SAMZA-1041 > Project: Samza > Issue Type: New Feature > Affects Versions: 0.12.0 > Reporter: Jake Maes > Assignee: Jake Maes > > Samza provides a powerful framework for users to implement and deploy stream > processors. One of the core concepts in Samza is a processor, which is > deployed individually as a job. While a single job is sufficient to perform > some basic stream processing, we have seen users apply Samza to more complex > problems involving multi-stage jobs or pipelines. These range from jobs that > require a separate repartitioner to co-partition streams for a join, to more > complicated pipelines in which the stages are purposely decoupled like > microservices. Historically, users would create a separate job for each stage > of the pipeline and deploy them manually. A critical part of this manual > deployment is creating and modifying the intermediate streams to have the > appropriate configuration, in particular; partition count. This deployment > model has proven to be tedious and error-prone because: > 1. Stream creation is a manual process. If the streams are not pre-created > with the appropriate configurations, it can lead to unexpected behavior in > the pipeline. For example failure to join because keys are not being routed > to a common Task. > 2. Job deployment is a manual process. Each stage needs to be deployed > separately, even though they are often deployed in the same cadence. > 3. Configuration is associated with a processor, which makes it more > difficult to reuse the processor. Configurations like task inputs and > container count can vary for the same processor depending on the context > (pipeline) in which it is executed. > 4. There is no early validation to detect a misconfigured pipeline. Instead > users tend to notice that something is wrong long after the initial > deployment by looking at metrics. > 5. There are common preconditions for processors (e.g. co-partitioned, or > deduplicated inputs) that could be handled automatically in a system that has > the “whole picture” of streams and processors. > Our goal is to alleviate these issues by introducing a > yet-to-be-properly-named feature which we will call “pipelines” for now. The > pipeline feature will allow users to easily compose a collection of > processors and streams into a directed acyclic graph (DAG) and manage them as > one unit. A key part of this feature is automatic runtime creation of > intermediate and output streams. It also enables richer validation, > simplified deployment, and a foundation for many performance and ease-of-use > features. For example, repartitioners could be automatically injected where > needed and processors could be colocated on the same container for > performance. > Note that this feature is not the same as SAMZA-914, mostly in terms of scope > and simplicity. SAMZA-914 focuses on composing operators into a logical flow > that is executed as one processor. That entire flow is scaled out uniformly > by adding containers. By contrast, this pipeline feature provides isolation > and independent scaling of the processors. > Many of the details have yet to be worked out, but a design doc will be > posted here soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)