[jira] [Commented] (SAMZA-335) Locality for Samza containers

Chris Riccomini (JIRA) Tue, 15 Jul 2014 14:25:46 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062673#comment-14062673
 ]


Chris Riccomini commented on SAMZA-335:
---------------------------------------

Two comments here:

# YARN, itself, has very spotty support for host-level resource requests. I 
have never gotten it working. See YARN-1412, YARN-371, YARN-2027
# What exactly do we mean by locality? I think we mean, "put a container on the 
machine that minimizes network utilization caused by said container."

For (2), there are two potential things (that I can think of) that will 
minimize network. One is to reduce input bytes over the wire, and the other is 
to reduce output bytes of the wire. For Samza jobs, input bytes can come from 
input streams, and output bytes can come from output bytes. All remaining 
network traffic is out of band, and unknown to Samza (e.g. querying a remote DB 
or web service).

One strategy that I can think of for reducing input bytes would be to take a 
container and look at all input partitions that it has, grouped by machine. For 
each machine, sum up all the bytes/sec across all the partitions on each 
machine. The machine with the highest bytes/sec is the one that the container 
should be located on.

You could take this strategy farther, and also try and move more input 
partitions onto a single broker. The extreme example of this would be, take all 
input stream partitions, and move them on to one broker. This would eliminate 
all input bytes over the wire for the container. There are a ton of 
implications to this, most of which are probably bad, but it's still a 
potential optimization of the strategy.

Coming up with a strategy for optimizing output bytes is a bit trickier. There 
are two styles of output here, one is messages that are keyed, and the other 
one is messages that aren't keyed. Keyed messages *must* be sent to the 
partition they're destined for. In an evenly distributed key space, this would 
mean that each Samza container will write evenly to all partitions. If these 
partitions are evenly distributed across a grid, then there is no optimization 
that can be done (without moving partitions, as described above).

If the outgoing messages are *not* keyed, though, there is potential 
optimization. In such a case, code inside the container could opportunistically 
force the outgoing message to go to a local partition of the outgoing stream. 
Since the message was never keyed, the partitioning is assumed to not matter 
(other than for fault tolerance and parallelism).

To be clear, I'm not proposing that we actually do any of this. :) Just 
thinking through potential strategies.

> Locality for Samza containers
> -----------------------------
>
>                 Key: SAMZA-335
>                 URL: https://issues.apache.org/jira/browse/SAMZA-335
>             Project: Samza
>          Issue Type: Improvement
>          Components: container
>            Reporter: Chinmay Soman
>
> [Documenting my discussion with Jakob]
> There doesn't seem to be any 'data' locality for deploying containers. Today, 
> this is done purely based on the free resources. It'll be nice to have some 
> kind of locality with respect to the input streams. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-335) Locality for Samza containers

Reply via email to