Hey James,

Your understanding is more or less correct.

Physically, your StreamTasks run in SamzaContainers (what is referred to
as a TaskRunner in the docs, unfortunately). Each container will be
responsible for one or more StreamTasks, with each StreamTask mapping to
one partition from the input streams. When consuming from Kafka, the
container will setup one thread per broker (not per partition), and funnel
the messages into a queue. The main event loop (inside the container)
takes messages from this queue, and hands them to the appropriate stream
task. Thus, only one StreamTask is ever processing a message at a given
time, per-container.

To increase parallelism, you need to run more containers. If you're using
yarn, this can be done with the yarn.container.count setting.

As a concrete example, consider a case where you have two input streams:
IS1 with 4 partitions, and IS2 with 2 partitions. You will then have 4
StreamTasks processing messages; this is determined by max(4, 2). The
first two StreamTasks will get messages from both IS1 and IS2, and the
last two StreamTasks will get messages only from IS1. If you run a single
container, all four of these tasks will be in the container, and will
therefore only ever process one message at a time. If you want to increase
parallelism, you would increase your container count. If you set
yarn.container.count=2, then two StreamTasks would run in the first
container, and two in the second. Thus, you have a bounded max
parallelism, which is the max partition count from your input streams. In
this example, you can have up to four containers.

Cheers,
Chris

On 5/16/14 9:31 AM, "James Jory" <[email protected]> wrote:

>Getting started with Samza and have a question about concurrency. Looking
>to confirm my understanding of how concurrency works with the Samza event
>loop and StreamTasks.
>
>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/even
>t-loop.html
>
>Assuming a (Kafka) inbound stream for a task has multiple partitions, the
>TaskRunner will setup a consumer for each partition in a separate thread.
>However, the messages from these consumers are funneled into a single
>message queue managed by the event loop. This essentially results in a
>single message being processed at-a-time across all StreamTask instances.
>In other words, StreamTasks will never process separate messages
>concurrently. 
>
>If my understanding is correct, is there a way to have Samza process
>messages concurrently across StreamTasks for the job?
>
>Thanks!
>James

Reply via email to