[
https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288553#comment-14288553
]
Jay Kreps commented on SAMZA-516:
---------------------------------
Hey [~yipan] yes exactly.
Let me try to argue that the cons are not so bad!
1. It's true that you can't bounce the individual parts of the query. But my
argument is that the user only cares about the final output and the final
output will only be produced if all subcomponents are active, if job A feeds
job B then bouncing A will effectively stop B since there will be no input; and
bouncing B will effectively stop A since it's output will just queue up and no
final results will be produced. I think you can actually flip this argument
around--if you want to restart or redeploy you job, which will be common, it
will be much simpler if there is just one process to bounce and without having
to think about dependencies (i.e. what happens if I change the query and
restart first A then B, then there will be a period of time where the old B is
getting input from the new A...).
2. Ah but let's me argue that this is an advantage! The problem in separately
sizing A and B is that the users never do a good job of sizing so either A or B
will be a bottleneck. If we give them both a shared pool of containers then
whether the A part or the B part is more active will not matter. To make this
stronger and more general I claim that if you have N containers to use giving
them all to the combined job A+B will always be higher throughput than
splitting them up and giving some to A and some to B no matter how accurately
you estimate the split (i.e. perfect estimation would be equal, but that is
unlikely).
3. This is true. However I think the same argument as in (2) applies. If you
segregate A and B the total memory doesn't change but now you create a
bottleneck from your division unless you divide perfectly.
See if you come to the same conclusions I did. It is actually quite subtle. I
thought about it and I think the trade-off is that if you isolate parts of a
single computation (like a query) into separate processes the following things
are true: (1) you always have less resources for the bottleneck component
unless you perfectly estimate, in the case of perfect estimation you have the
same resources, but (2) you protect the one part from the other. So separating
makes sense if you are writing job A and I am writing job B and they both have
different purposes, then separating protects us, at the cost of perhaps
slightly worse utilization; but if A and B are both part of the same logical
query then you can't really protect them from each other and we just reduce
utilization.
> Support standalone Samza jobs
> -----------------------------
>
> Key: SAMZA-516
> URL: https://issues.apache.org/jira/browse/SAMZA-516
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.9.0
> Reporter: Chris Riccomini
> Assignee: Chris Riccomini
> Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and
> YARN. With local mode, a single Java process starts the JobCoordinator,
> creates a single container, and executes it locally. All partitions are
> procesed within this container. With YARN, a YARN grid is required to
> execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza
> in Mesos.
> There have been several requests lately to be able to run Samza jobs without
> any resource manager (YARN, Mesos, etc), but still run it in a distributed
> fashion.
> The goal of this ticket is to design and implement a samza-standalone module,
> which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)