[
https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288526#comment-14288526
]
Yi Pan (Data Infrastructure) commented on SAMZA-516:
----------------------------------------------------
[~jkreps], just trying to get more clear on the comments regarding to # of jobs
/ query.
{quote}
Even if it repartitions many times can't you do that all in one job that just
has a big switch statement over the inputs?
{quote}
If I have a query that involves re-partition -> group-by on the re-partitioned
key, the task for the first re-partition job is taking a topic1 partitioned by
keyA, from a certain partition. After re-partition based on keyB, it is sent to
another topic2. The second group-by job then can take the second topic2's
partition performing the calculation with the assurance that all messages with
keyB="value1" are in the same partition. If I am going to use a single job to
implement that, each task then needs to take both topics as input, and based on
the input topic, perform different part of the query logic. Is this the "big
switch statement" that you referred to? It is an interesting concept, I can see
both good and bad things here:
# Pros:
## It is much easier now if there is some state that need to be shared within
the whole query
## It is also easier to identify the most backlogged topics through the whole
query since the task would have subscribed to all topics involved in all stages
of query
# Cons:
## It would not be possible to bounce and move the "group-by" process w/o
impacting the "re-partition" process
## If different stages in the query processing requires different number of
containers, it won't be easy to adjust the # of containers used for "group-by"
individually
## It can also cause memory issue when a complex query w/ multiple sub-queries
have to be condensed in the same task
Since we can not restrict what kind of query states the user will generate, I
would rather plan for the case where I can isolate the query tasks. For simple
queries, most likely the query planner can put all operators in a single job.
Just one reminder: a 4th solution [~criccomini] and I discussed is not to run
the daemon process, rather run the JobCoordinator manually from a separate box
to write the container assignment once to ZK, and start containers on all
assigned boxes. Deploying more jobs may simply mean run the JobCoordinator for
each job once.
> Support standalone Samza jobs
> -----------------------------
>
> Key: SAMZA-516
> URL: https://issues.apache.org/jira/browse/SAMZA-516
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.9.0
> Reporter: Chris Riccomini
> Assignee: Chris Riccomini
> Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and
> YARN. With local mode, a single Java process starts the JobCoordinator,
> creates a single container, and executes it locally. All partitions are
> procesed within this container. With YARN, a YARN grid is required to
> execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza
> in Mesos.
> There have been several requests lately to be able to run Samza jobs without
> any resource manager (YARN, Mesos, etc), but still run it in a distributed
> fashion.
> The goal of this ticket is to design and implement a samza-standalone module,
> which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)