[
https://issues.apache.org/jira/browse/SAMZA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062764#comment-14062764
]
Raul Castro Fernandez commented on SAMZA-334:
---------------------------------------------
Adding some comments on the "auto-scaling" problem:
I think one important question is what are the factors that contribute to
bottlenecks in a Samza cluster. So far, the examples seem to suggest that the
more input data the more computational cost. This is in general true, but it is
not the only reason why there might be bottlenecks in the cluster. If this
would be the only reason, then you can make assumptions on your incoming
traffic and take decisions (offline or online) accordingly. This would make the
problem simpler.
I try to illustrate next why this is not enough:
- Even if the data of a given topic is evenly distributed across the tasks,
there might still be some processing skew. This is the case of jobs whose
computation depends on the dataset. Think of joins, or select predicated that
have a higher selectivity with a given range of keys. One could argue that with
a good partitioning scheme this should be minimal. That is true as far as the
job is stateless, i.e. all your computation depends only on input data. When
the computation is stateful, there is another source of problems.
- Another, more pathological case is that of a high skewed key. For this case,
there are not many things one can do I think.
What the above reasons mean is that it is very difficult to understand the
resource requirements of a given job/task when both the input data and the
dataset can contribute in an unpredictable way to the task resource
consumption. There is of course research work on this area, people doing
performance modeling work with stochastic modeling and Markov processes to try
to "predict" what is going to be the performance of a given job/process/system.
Their goal is to be "proactive". The main assumption for this kind of work is
that you have lots of data available and you can characterize very well your
jobs. This is in general not the case when users can write arbitrary jobs.
So what options are there?
One option would be connected to the "predictable" case that Chris describes.
It is true that with an approximate characterization of job resource usage,
cluster resources and a goal (such as maximizing makespan -> good for ops, or
minimizing processing time -> good for users) one can just model the problem as
a scheduling one, and there are plenty of strategies and solutions to achieve
reasonable good plans here. However, the strong assumption here is that the
task resource consumption is known. I would not assume this, although previous
comments seem to focus only on input data, am I missing something? Do you
expect job computational requirements to not depend on datasets? I think this
is difficult for these two reasons: i) users can write *arbitrary* programs
whose computation requirements depend on the dataset, ii) in the particular
case of SQL this is a well understood problem.
The other option would be connected to the "unpredictable" case that Chris
discusses. I believe this is the one that makes more sense. In particular I
believe it makes sense to just be "reactive" as opposed to "proactive", etc.
This is, you just run your jobs/tasks, observe how they behave and then do
something. In particular, there is no assumption about the resource utilization
of any task.
So how would one implement such a thing?
I think there are two important and differentiated problems to solve, one is
the policy to decide when to perform scale out or scale in actions. The other
is the actual mechanism to execute such actions. So, policy and mechanism:
- Policy:
There are a number of things to decide here. Is it better to evenly use the
cluster or minimize execution time of jobs?
How to detect bottlenecks. One can look at system metrics (memory or CPU
utilization), but it also may be useful to look at more application level
metrics (these will map to system resources, but they may provide more accurate
information about what is going on). Some application level metrics would be
number of times the state is updated per input message, or general selectivity
of the task, how many messages are produced per input message.
- Mechanism:
Probably the first decision to make is what the system needs to scale. One
strategy to face bottlenecks is to scale up (this is, increase the container
resources) another one is to scale out (add a new container and split the
work). The opposite would be scale down and scale in respectively. I would
always go for scale-out and scale-in. The reason is that it should be easier to
reason about this, than to reason about scaling up a container. In particular,
increasing the resources of a given container may not have a linear performance
impact, while a good scale-out strategy should...
Now, there is one more problem to take into consideration here. When one
controls resources and cluster "slots" the mechanism would be just a matter of:
"Task A is a bottleneck, scale it out in Task A.1 and Task A.2, and put Task
A.1 in other machine". I am not sure what guarantees provides YARN for one to
choose the machine where a new scaled out Task will run. Also, if containers
would provide full resource isolation, the problem will be easier, but since
this is not the case, one many not only need to enforce the scale out task to
execute in a different node, but also to choose in which particular node, i.e.
according to the target resource requirements, etc. [Not sure if this is very
clear].
All in all, I think this is a hard problem, but it may help to start breaking
assumptions and deciding on some of those tradeoffs. In the end, it may end up
not being that complex.
> Need for asymmetric container config
> ------------------------------------
>
> Key: SAMZA-334
> URL: https://issues.apache.org/jira/browse/SAMZA-334
> Project: Samza
> Issue Type: Improvement
> Components: container
> Affects Versions: 0.8.0
> Reporter: Chinmay Soman
>
> The current (and upcoming) partitioning scheme(s) suggest that there might be
> a skew in the amount of data ingested and computation performed across
> different containers for a given Samza job. This directly affects the amount
> of resources required by a container - which today are completely symmetric.
> Case A] Partitioning on Kafka partitions
> For instance, consider a partitioner job which reads data from different
> Kafka topics (having different partition layouts). In this case, its possible
> that a lot of topics have a smaller number of Kafka partitions. Consequently
> the containers processing these partitions would need more resources than
> those responsible for the higher numbered partitions.
> Case B] Partitioning based on Kafka topics
> Even in this case, its very easy for some containers to be doing more work
> than others - leading to a skew in resource requirements.
> Today, the container config is based on the requirements for the worst (doing
> the most work) container. Needless to say, this leads to resource wastage. A
> better approach needs to consider what is the true requirement per container
> (instead of per job).
--
This message was sent by Atlassian JIRA
(v6.2#6252)