[ 
https://issues.apache.org/jira/browse/SAMZA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062855#comment-14062855
 ] 

Martin Kleppmann commented on SAMZA-335:
----------------------------------------

A small additional comment: a typical Kafka deployment will replicate every 
message to three brokers, so even if you co-locate the Samza container on the 
same machine as the Kafka broker with the partition you're writing to, the 
message still has to be sent over the network twice to the two replicas. So 
locality of output streams can reduce network traffic at most by 33% for data 
with a replication factor of 3. (That argument doesn't apply to input streams.)

There is another kind of locality that I feel may be more useful to consider: 
preserving durable state stores across container restarts. At the moment, every 
time you restart a Samza job, any state that it has built up must be 
reconstructed from the changelog, which may take a while if there's a lot of 
state. As an optimization, in order to make job restarts faster, it might be 
possible to reuse any on-disk state store across container restarts, and skip 
the changelog restore in that case. This would require bringing the new 
container up on the same machine as it was before the restart. However, this 
probably violates YARN's model, so I'm not sure how feasible or desirable this 
would be.

> Locality for Samza containers
> -----------------------------
>
>                 Key: SAMZA-335
>                 URL: https://issues.apache.org/jira/browse/SAMZA-335
>             Project: Samza
>          Issue Type: Improvement
>          Components: container
>            Reporter: Chinmay Soman
>
> [Documenting my discussion with Jakob]
> There doesn't seem to be any 'data' locality for deploying containers. Today, 
> this is done purely based on the free resources. It'll be nice to have some 
> kind of locality with respect to the input streams. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to