[
https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288668#comment-14288668
]
Yi Pan (Data Infrastructure) commented on SAMZA-516:
----------------------------------------------------
[~jkreps], thanks for the answers! It is indeed a subtle tradeoff. Your
argument on the bottleneck identification and management really makes sense. My
worry is that we may not be able to completely remove the need of separate jobs
for a query. So, I still want to throw out some of my concerns/questions here
to discuss:
# bouncing the containers
{quote}
what happens if I change the query and restart first A then B, then there will
be a period of time where the old B is getting input from the new A...
{quote}
Even with a single job layout, we probably can not completely remove the above
possibility. Say there are 4 containers each running a single task, with the
re-partition + group-by, container.1 may take outputs from container.2. When we
bounce container.2 with the new query, container.1 with the old query may start
receiving outputs from the new container.1.
# input partitions are not big enough / auto-scaling: what if the first stage
of process (re-partition) takes topic1 with total of 2 partitions, while the
second stage of process (group-by) need to take topic2 with total of 15
partitions? I guess that we will have to insert a re-partition job to make sure
topic1' is also divided into 15 partitions. That would also mean that if
auto-scaling is used, instead of changing the partitions just for an
intermediate job, we would need to change the partitions for all topics in the
pipeline. The other approach is to allow non-consistent partition assignments
to containers in the job, which makes it difficult to track and loss the beauty
of this single job solution: simple and homogeneous tasks.
I will think about it further.
> Support standalone Samza jobs
> -----------------------------
>
> Key: SAMZA-516
> URL: https://issues.apache.org/jira/browse/SAMZA-516
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.9.0
> Reporter: Chris Riccomini
> Assignee: Chris Riccomini
> Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and
> YARN. With local mode, a single Java process starts the JobCoordinator,
> creates a single container, and executes it locally. All partitions are
> procesed within this container. With YARN, a YARN grid is required to
> execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza
> in Mesos.
> There have been several requests lately to be able to run Samza jobs without
> any resource manager (YARN, Mesos, etc), but still run it in a distributed
> fashion.
> The goal of this ticket is to design and implement a samza-standalone module,
> which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)