[
https://issues.apache.org/jira/browse/SAMZA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062855#comment-14062855
]
Martin Kleppmann commented on SAMZA-335:
----------------------------------------
A small additional comment: a typical Kafka deployment will replicate every
message to three brokers, so even if you co-locate the Samza container on the
same machine as the Kafka broker with the partition you're writing to, the
message still has to be sent over the network twice to the two replicas. So
locality of output streams can reduce network traffic at most by 33% for data
with a replication factor of 3. (That argument doesn't apply to input streams.)
There is another kind of locality that I feel may be more useful to consider:
preserving durable state stores across container restarts. At the moment, every
time you restart a Samza job, any state that it has built up must be
reconstructed from the changelog, which may take a while if there's a lot of
state. As an optimization, in order to make job restarts faster, it might be
possible to reuse any on-disk state store across container restarts, and skip
the changelog restore in that case. This would require bringing the new
container up on the same machine as it was before the restart. However, this
probably violates YARN's model, so I'm not sure how feasible or desirable this
would be.
> Locality for Samza containers
> -----------------------------
>
> Key: SAMZA-335
> URL: https://issues.apache.org/jira/browse/SAMZA-335
> Project: Samza
> Issue Type: Improvement
> Components: container
> Reporter: Chinmay Soman
>
> [Documenting my discussion with Jakob]
> There doesn't seem to be any 'data' locality for deploying containers. Today,
> this is done purely based on the free resources. It'll be nice to have some
> kind of locality with respect to the input streams.
--
This message was sent by Atlassian JIRA
(v6.2#6252)