[ 
https://issues.apache.org/jira/browse/SAMZA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062764#comment-14062764
 ] 

Raul Castro Fernandez commented on SAMZA-334:
---------------------------------------------

Adding some comments on the "auto-scaling" problem:

I think one important question is what are the factors that contribute to 
bottlenecks in a Samza cluster. So far, the examples seem to suggest that the 
more input data the more computational cost. This is in general true, but it is 
not the only reason why there might be bottlenecks in the cluster. If this 
would be the only reason, then you can make assumptions on your incoming 
traffic and take decisions (offline or online) accordingly. This would make the 
problem simpler.

I try to illustrate next why this is not enough:

- Even if the data of a given topic is evenly distributed across the tasks, 
there might still be some processing skew. This is the case of jobs whose 
computation depends on the dataset. Think of joins, or select predicated that 
have a higher selectivity with a given range of keys. One could argue that with 
a good partitioning scheme this should be minimal. That is true as far as the 
job is stateless, i.e. all your computation depends only on input data. When 
the computation is stateful, there is another source of problems.

- Another, more pathological case is that of a high skewed key. For this case, 
there are not many things one can do I think.

What the above reasons mean is that it is very difficult to understand the 
resource requirements of a given job/task when both the input data and the 
dataset can contribute in an unpredictable way to the task resource 
consumption. There is of course research work on this area, people doing 
performance modeling work with stochastic modeling and Markov processes to try 
to "predict" what is going to be the performance of a given job/process/system. 
Their goal is to be "proactive". The main assumption for this kind of work is 
that you have lots of data available and you can characterize very well your 
jobs. This is in general not the case when users can write arbitrary jobs.

So what options are there?

One option would be connected to the "predictable" case that Chris describes. 
It is true that with an approximate characterization of job resource usage, 
cluster resources and a goal (such as maximizing makespan -> good for ops, or 
minimizing processing time -> good for users) one can just model the problem as 
a scheduling one, and there are plenty of strategies and solutions to achieve 
reasonable good plans here. However, the strong assumption here is that the 
task resource consumption is known. I would not assume this, although previous 
comments seem to focus only on input data, am I missing something?  Do you 
expect job computational requirements to not depend on datasets? I think this 
is difficult for these two reasons: i) users can write *arbitrary* programs 
whose computation requirements depend on the dataset, ii) in the particular 
case of SQL this is a well understood problem.

The other option would be connected to the "unpredictable" case that Chris 
discusses. I believe this is the one that makes more sense. In particular I 
believe it makes sense to just be "reactive" as opposed to "proactive", etc. 
This is, you just run your jobs/tasks, observe how they behave and then do 
something. In particular, there is no assumption about the resource utilization 
of any task.

So how would one implement such a thing?

I think there are two important and differentiated problems to solve, one is 
the policy to decide when to perform scale out or scale in actions. The other 
is the actual mechanism to execute such actions. So, policy and mechanism:

- Policy:
There are a number of things to decide here. Is it better to evenly use the 
cluster or minimize execution time of jobs?
How to detect bottlenecks. One can look at system metrics (memory or CPU 
utilization), but it also may be useful to look at more application level 
metrics (these will map to system resources, but they may provide more accurate 
information about what is going on). Some application level metrics would be 
number of times the state is updated per input message, or general selectivity 
of the task, how many messages are produced per input message.

- Mechanism:
Probably the first decision to make is what the system needs to scale. One 
strategy to face bottlenecks is to scale up (this is, increase the container 
resources) another one is to scale out (add a new container and split the 
work). The opposite would be scale down and scale in respectively. I would 
always go for scale-out and scale-in. The reason is that it should be easier to 
reason about this, than to reason about scaling up a container. In particular, 
increasing the resources of a given container may not have a linear performance 
impact, while a good scale-out strategy should... 

Now, there is one more problem to take into consideration here. When one 
controls resources and cluster "slots" the mechanism would be just a matter of: 
"Task A is a bottleneck, scale it out in Task A.1 and Task A.2, and put Task 
A.1 in other machine". I am not sure what guarantees provides YARN for one to 
choose the machine where a new scaled out Task will run. Also, if containers 
would provide full resource isolation, the problem will be easier, but since 
this is not the case, one many not only need to enforce the scale out task to 
execute in a different node, but also to choose in which particular node, i.e. 
according to the target resource requirements, etc. [Not sure if this is very 
clear].

All in all, I think this is a hard problem, but it may help to start breaking 
assumptions and deciding on some of those tradeoffs. In the end, it may end up 
not being that complex.

> Need for asymmetric container config
> ------------------------------------
>
>                 Key: SAMZA-334
>                 URL: https://issues.apache.org/jira/browse/SAMZA-334
>             Project: Samza
>          Issue Type: Improvement
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Chinmay Soman
>
> The current (and upcoming) partitioning scheme(s) suggest that there might be 
> a skew in the amount of data ingested and computation performed across 
> different containers for a given Samza job. This directly affects the amount 
> of resources required by a container - which today are completely symmetric.
> Case A] Partitioning on Kafka partitions
> For instance, consider a partitioner job which reads data from different 
> Kafka topics (having different partition layouts). In this case, its possible 
> that a lot of topics have a smaller number of Kafka partitions. Consequently 
> the containers processing these partitions would need more resources than 
> those responsible for the higher numbered partitions. 
> Case B] Partitioning based on Kafka topics
> Even in this case, its very easy for some containers to be doing more work 
> than others - leading to a skew in resource requirements.
> Today, the container config is based on the requirements for the worst (doing 
> the most work) container. Needless to say, this leads to resource wastage. A 
> better approach needs to consider what is the true requirement per container 
> (instead of per job).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to