[jira] [Commented] (SAMZA-516) Support standalone Samza jobs

Yi Pan (Data Infrastructure) (JIRA) Thu, 22 Jan 2015 17:06:17 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288526#comment-14288526
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-516:
----------------------------------------------------

[~jkreps], just trying to get more clear on the comments regarding to # of jobs 
/ query.
{quote}
Even if it repartitions many times can't you do that all in one job that just 
has a big switch statement over the inputs?
{quote}
If I have a query that involves re-partition -> group-by on the re-partitioned 
key, the task for the first re-partition job is taking a topic1 partitioned by 
keyA, from a certain partition. After re-partition based on keyB, it is sent to 
another topic2. The second group-by job then can take the second topic2's 
partition performing the calculation with the assurance that all messages with 
keyB="value1" are in the same partition. If I am going to use a single job to 
implement that, each task then needs to take both topics as input, and based on 
the input topic, perform different part of the query logic. Is this the "big 
switch statement" that you referred to? It is an interesting concept, I can see 
both good and bad things here:
# Pros:
## It is much easier now if there is some state that need to be shared within 
the whole query
## It is also easier to identify the most backlogged topics through the whole 
query since the task would have subscribed to all topics involved in all stages 
of query
# Cons:
## It would not be possible to bounce and move the "group-by" process w/o 
impacting the "re-partition" process
## If different stages in the query processing requires different number of 
containers, it won't be easy to adjust the # of containers used for "group-by" 
individually
## It can also cause memory issue when a complex query w/ multiple sub-queries 
have to be condensed in the same task

Since we can not restrict what kind of query states the user will generate, I 
would rather plan for the case where I can isolate the query tasks. For simple 
queries, most likely the query planner can put all operators in a single job.

Just one reminder: a 4th solution [~criccomini] and I discussed is not to run 
the daemon process, rather run the JobCoordinator manually from a separate box 
to write the container assignment once to ZK, and start containers on all 
assigned boxes. Deploying more jobs may simply mean run the JobCoordinator for 
each job once.

> Support standalone Samza jobs
> -----------------------------
>
>                 Key: SAMZA-516
>                 URL: https://issues.apache.org/jira/browse/SAMZA-516
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>         Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and 
> YARN. With local mode, a single Java process starts the JobCoordinator, 
> creates a single container, and executes it locally. All partitions are 
> procesed within this container.  With YARN, a YARN grid is required to 
> execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza 
> in Mesos.
> There have been several requests lately to be able to run Samza jobs without 
> any resource manager (YARN, Mesos, etc), but still run it in a distributed 
> fashion.
> The goal of this ticket is to design and implement a samza-standalone module, 
> which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-516) Support standalone Samza jobs

Reply via email to