Yes, we’re using yarn. Will add yarn.container.count to our config and restart 
the job.

Thanks for the info. Exactly what we needed.

-James

On May 16, 2014 at 1:24:28 PM, Chris Riccomini ([email protected]) wrote:

Hey James,  

Your understanding is more or less correct.  

Physically, your StreamTasks run in SamzaContainers (what is referred to  
as a TaskRunner in the docs, unfortunately). Each container will be  
responsible for one or more StreamTasks, with each StreamTask mapping to  
one partition from the input streams. When consuming from Kafka, the  
container will setup one thread per broker (not per partition), and funnel  
the messages into a queue. The main event loop (inside the container)  
takes messages from this queue, and hands them to the appropriate stream  
task. Thus, only one StreamTask is ever processing a message at a given  
time, per-container.  

To increase parallelism, you need to run more containers. If you're using  
yarn, this can be done with the yarn.container.count setting.  

As a concrete example, consider a case where you have two input streams:  
IS1 with 4 partitions, and IS2 with 2 partitions. You will then have 4  
StreamTasks processing messages; this is determined by max(4, 2). The  
first two StreamTasks will get messages from both IS1 and IS2, and the  
last two StreamTasks will get messages only from IS1. If you run a single  
container, all four of these tasks will be in the container, and will  
therefore only ever process one message at a time. If you want to increase  
parallelism, you would increase your container count. If you set  
yarn.container.count=2, then two StreamTasks would run in the first  
container, and two in the second. Thus, you have a bounded max  
parallelism, which is the max partition count from your input streams. In  
this example, you can have up to four containers.  

Cheers,  
Chris  

On 5/16/14 9:31 AM, "James Jory" <[email protected]> wrote:  

>Getting started with Samza and have a question about concurrency. Looking  
>to confirm my understanding of how concurrency works with the Samza event  
>loop and StreamTasks.  
>  
>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/even  
>t-loop.html  
>  
>Assuming a (Kafka) inbound stream for a task has multiple partitions, the  
>TaskRunner will setup a consumer for each partition in a separate thread.  
>However, the messages from these consumers are funneled into a single  
>message queue managed by the event loop. This essentially results in a  
>single message being processed at-a-time across all StreamTask instances.  
>In other words, StreamTasks will never process separate messages  
>concurrently.  
>  
>If my understanding is correct, is there a way to have Samza process  
>messages concurrently across StreamTasks for the job?  
>  
>Thanks!  
>James  

Reply via email to