[ 
https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288668#comment-14288668
 ] 

Yi Pan (Data Infrastructure) commented on SAMZA-516:
----------------------------------------------------

[~jkreps], thanks for the answers! It is indeed a subtle tradeoff. Your 
argument on the bottleneck identification and management really makes sense. My 
worry is that we may not be able to completely remove the need of separate jobs 
for a query. So, I still want to throw out some of my concerns/questions here 
to discuss:
# bouncing the containers
{quote}
what happens if I change the query and restart first A then B, then there will 
be a period of time where the old B is getting input from the new A...
{quote}
Even with a single job layout, we probably can not completely remove the above 
possibility. Say there are 4 containers each running a single task, with the 
re-partition + group-by, container.1 may take outputs from container.2. When we 
bounce container.2 with the new query, container.1 with the old query may start 
receiving outputs from the new container.1.
# input partitions are not big enough / auto-scaling: what if the first stage 
of process (re-partition) takes topic1 with total of 2 partitions, while the 
second stage of process (group-by) need to take topic2 with total of 15 
partitions? I guess that we will have to insert a re-partition job to make sure 
topic1' is also divided into 15 partitions. That would also mean that if 
auto-scaling is used, instead of changing the partitions just for an 
intermediate job, we would need to change the partitions for all topics in the 
pipeline. The other approach is to allow non-consistent partition assignments 
to containers in the job, which makes it difficult to track and loss the beauty 
of this single job solution: simple and homogeneous tasks.

I will think about it further.


> Support standalone Samza jobs
> -----------------------------
>
>                 Key: SAMZA-516
>                 URL: https://issues.apache.org/jira/browse/SAMZA-516
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>         Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and 
> YARN. With local mode, a single Java process starts the JobCoordinator, 
> creates a single container, and executes it locally. All partitions are 
> procesed within this container.  With YARN, a YARN grid is required to 
> execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza 
> in Mesos.
> There have been several requests lately to be able to run Samza jobs without 
> any resource manager (YARN, Mesos, etc), but still run it in a distributed 
> fashion.
> The goal of this ticket is to design and implement a samza-standalone module, 
> which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to