[ 
https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288553#comment-14288553
 ] 

Jay Kreps commented on SAMZA-516:
---------------------------------

Hey [~yipan] yes exactly.

Let me try to argue that the cons are not so bad!

1. It's true that you can't bounce the individual parts of the query. But my 
argument is that the user only cares about the final output and the final 
output will only be produced if all subcomponents are active, if job A feeds 
job B then bouncing A will effectively stop B since there will be no input; and 
bouncing B will effectively stop A since it's output will just queue up and no 
final results will be produced. I think you can actually flip this argument 
around--if you want to restart or redeploy you job, which will be common, it 
will be much simpler if there is just one process to bounce and without having 
to think about dependencies (i.e. what happens if I change the query and 
restart first A then B, then there will be a period of time where the old B is 
getting input from the new A...).

2. Ah but let's me argue that this is an advantage! The problem in separately 
sizing A and B is that the users never do a good job of sizing so either A or B 
will be a bottleneck. If we give them both a shared pool of containers then 
whether the A part or the B part is more active will not matter. To make this 
stronger and more general I claim that if you have N containers to use giving 
them all to the combined job A+B will always be higher throughput than 
splitting them up and giving some to A and some to B no matter how accurately 
you estimate the split (i.e. perfect estimation would be equal, but that is 
unlikely).

3. This is true. However I think the same argument as in (2) applies. If you 
segregate A and B the total memory doesn't change but now you create a 
bottleneck from your division unless you divide perfectly.

See if you come to the same conclusions I did. It is actually quite subtle. I 
thought about it and I think the trade-off is that if you isolate parts of a 
single computation (like a query) into separate processes the following things 
are true: (1) you always have less resources for the bottleneck component 
unless you perfectly estimate, in the case of perfect estimation you have the 
same resources, but (2) you protect the one part from the other. So separating 
makes sense if you are writing job A and I am writing job B and they both have 
different purposes, then separating protects us, at the cost of perhaps 
slightly worse utilization; but if A and B are both part of the same logical 
query then you can't really protect them from each other and we just reduce 
utilization.

> Support standalone Samza jobs
> -----------------------------
>
>                 Key: SAMZA-516
>                 URL: https://issues.apache.org/jira/browse/SAMZA-516
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>         Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and 
> YARN. With local mode, a single Java process starts the JobCoordinator, 
> creates a single container, and executes it locally. All partitions are 
> procesed within this container.  With YARN, a YARN grid is required to 
> execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza 
> in Mesos.
> There have been several requests lately to be able to run Samza jobs without 
> any resource manager (YARN, Mesos, etc), but still run it in a distributed 
> fashion.
> The goal of this ticket is to design and implement a samza-standalone module, 
> which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to