The design goal of CapacityScheduler is maximizing the utilization of cluster 
resources but it does not fairly allocate the share amongst the total number of 
users present in the system.

The user limit states the number of concurrent users who can use the slots in 
the queue. But then these limits are elastic in nature, as there is no 
preemption as the slots get freed up the new tasks will be allotted those slot 
to meet the user limit.

In order for your requirement, you can possibly submit the large tasks to a 
queue which have max task limit set, so your long running jobs don't take up 
whole of the cluster capacity and submit shorter, smaller jobs to fast moving 
queue with something like 10% user limit which allows 10 concurrent user per 
queue.

The actual distribution of the of the capacity across longer/shorter jobs 
depends on your workload.


On 4/30/11 1:14 AM, "Rosanna Man" <rosa...@auditude.com> wrote:

Hi Sreekanth,

Thank you very much for your clarification. Setting the max task limits on 
queues will work but can we do something on the max user limit? Is it 
pre-emptible also? We are exploring about the possibility of running the 
queries with different users for capacity scheduler to maximize the use of the 
resources.

Basically, our goal is to maximize the resources (mappers and reducers) while 
providing a fair share to the short tasks while a big task is running. How do 
you normally achieve hat?

Thanks,
Rosanna

On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" <sreer...@yahoo-inc.com> wrote:

Hi

Currently CapacityScheduler does not have pre-emption. So basically when the 
Job1 starts finishing and freeing up the Job2's tasks will start getting 
scheduled. One way you can prevent that queue capacities are not elastic in 
nature is by setting max task limits on queues. That way your job1 will never 
execeed first queues capacity




On 4/28/11 11:48 PM, "Rosanna Man" <rosa...@auditude.com> wrote:

Hi all,

We are using capacity scheduler to schedule resources among different queues 
for 1 user (hadoop) only. We have set the queues to have equal share of the 
resources. However, when 1st task starts in the first queue and is consuming 
all the resources, the 2nd task starts in the 2nd queue will be starved from 
reducer until the first task finished. A lot of processing is being stuck when 
a large query is executing.

We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before 
but it gives an error when the mapper gives no output (which is fine in our use 
cases).

Anyone can give us some advice?

Thanks,
Rosanna


--
Sreekanth Ramakrishnan

Reply via email to