Re: Not enough free slots to run the job

2017-10-27 Thread Fabian Hueske
Hi David,

that's correct. A TM is a single process. A slot is just a virtual concept
in the TM process and runs its program slice in multiple threads.
Besides managed memory (which is split into chunks add assigned to slots)
all other resources (CPU, heap, network, disk) are not isolated and free to
use for all threads.

The DataSet API operators operate almost exclusively on managed memory.
Heap memory is only used for in-flight data and not to store larger amounts.
So having unused slots leaves some of the configured memory unused.

Best, Fabian

2017-10-27 3:44 GMT+02:00 David Dreyfus :

> Hello,
>
> I know this is an older thread, but ...
>
> If some slots are left empty it doesn't necessarily mean that machine
> resources are wasted. Some managed memory might be unavailable, but CPU,
> heap memory, network, and disk are shared across slots. To the extent there
> are multiple operators executing within a slot, multiple threads are
> executing consuming those resources. It's not clear what the actual
> performance degradation would be, if any. Correct?
>
> David
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>


Re: Not enough free slots to run the job

2017-10-26 Thread David Dreyfus
Hello,

I know this is an older thread, but ...

If some slots are left empty it doesn't necessarily mean that machine
resources are wasted. Some managed memory might be unavailable, but CPU,
heap memory, network, and disk are shared across slots. To the extent there
are multiple operators executing within a slot, multiple threads are
executing consuming those resources. It's not clear what the actual
performance degradation would be, if any. Correct?

David 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Not enough free slots to run the job

2016-03-21 Thread Ovidiu-Cristian MARCU
Thanks, very clear :)!

best,
Ovidiu
> On 21 Mar 2016, at 16:31, Robert Metzger  wrote:
> 
> Hi,
> 
> lets say you have 10 TaskManagers with 2 slots each. In total you have 20 
> slots available.
> Starting a job with parallelism=18 allows you to restart immediately if one 
> TaskManager fails.
> Now, regarding your questions:
> Q1: yes, using fewer slots than available reduces the likelihood of running 
> into "not enough slots" issues. And yes, you will not use all available 
> resources in your cluster, hence lose some performance.
> Q2: It depends. In the example above, the job would restart. As long as there 
> are enough slots available, jobs will restart.
> 
> 
> On Mon, Mar 21, 2016 at 3:30 PM, Ovidiu-Cristian MARCU 
> mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> Hi Robert,
> 
> I am not sure I understand so please confirm if I understand correctly your 
> suggestions:
> - to use less slots than available slots capacity to avoid issues like when a 
> TaskManager is not giving its slots because of some problems registering the 
> TM;
> (This means I will lose some performance by not using all the available 
> capacity)
> -if a job is failing because of losing a TaskManager (and its slots) the job 
> will not restart even if available slots are free to use.
> (for this case the ‘spare slots’ will not be of help right; losing a TM means 
> the job will fail, no recovery)
> 
> Thanks!
> 
> Best,
> Ovidiu
> 
> 
>> On 21 Mar 2016, at 14:15, Robert Metzger > <mailto:rmetz...@apache.org>> wrote:
>> 
>> Hi Ovidiu,
>> 
>> right now the scheduler in Flink will not use more slots than requested.
>> To avoid issues on recovery, we usually recommend users to have some spare 
>> slots (run job with p=15 on a cluster with slots=20). I agree that it would 
>> make sense to add a flag which allows a job to grab more slots if they are 
>> available. The problem with that is however, that jobs can currently not 
>> change their parallelism. So if a job fails, it can not downscale to restart 
>> on the remaining slots.
>> That's why the spare slots approach is currently the only way to go.
>> 
>> Regards,
>> Robert
>> 
>> On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU 
>> mailto:ovidiu-cristian.ma...@inria.fr>> 
>> wrote:
>> Hi,
>> 
>> For the situation where a program specify a maximum parallelism (so it is 
>> supposed to use all available task slots) we can have the possibility that 
>> one of the task managers is not registered for various reasons.
>> In this case the job will fail for not enough free slots to run the job.
>> 
>> For me this means the scheduler has a limitation to work by statically 
>> assign tasks to the task slots the job is configured.
>> 
>> Instead I would like to be able to specify a minimum parallelism of a job 
>> but also the possibility to dynamically use more task slots if additional 
>> task slots can be used.
>> Another use case will be that if during the execution of a job we lose one 
>> node so some task slots, if the minimum parallelism is still ensured, the 
>> job should recover and continue its execution instead of just failing.
>> 
>> Is it possible to make such changes?
>> 
>> Best,
>> Ovidiu
>> 
> 
> 



Re: Not enough free slots to run the job

2016-03-21 Thread Robert Metzger
Hi,

lets say you have 10 TaskManagers with 2 slots each. In total you have 20
slots available.
Starting a job with parallelism=18 allows you to restart immediately if one
TaskManager fails.
Now, regarding your questions:
Q1: yes, using fewer slots than available reduces the likelihood of running
into "not enough slots" issues. And yes, you will not use all available
resources in your cluster, hence lose some performance.
Q2: It depends. In the example above, the job would restart. As long as
there are enough slots available, jobs will restart.


On Mon, Mar 21, 2016 at 3:30 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi Robert,
>
> I am not sure I understand so please confirm if I understand correctly
> your suggestions:
> - to use less slots than available slots capacity to avoid issues like
> when a TaskManager is not giving its slots because of some problems
> registering the TM;
> (This means I will lose some performance by not using all the available
> capacity)
> -if a job is failing because of losing a TaskManager (and its slots) the
> job will not restart even if available slots are free to use.
> (for this case the ‘spare slots’ will not be of help right; losing a TM
> means the job will fail, no recovery)
>
> Thanks!
>
> Best,
> Ovidiu
>
>
> On 21 Mar 2016, at 14:15, Robert Metzger  wrote:
>
> Hi Ovidiu,
>
> right now the scheduler in Flink will not use more slots than requested.
> To avoid issues on recovery, we usually recommend users to have some spare
> slots (run job with p=15 on a cluster with slots=20). I agree that it would
> make sense to add a flag which allows a job to grab more slots if they are
> available. The problem with that is however, that jobs can currently not
> change their parallelism. So if a job fails, it can not downscale to
> restart on the remaining slots.
> That's why the spare slots approach is currently the only way to go.
>
> Regards,
> Robert
>
> On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Hi,
>>
>> For the situation where a program specify a maximum parallelism (so it is
>> supposed to use all available task slots) we can have the possibility that
>> one of the task managers is not registered for various reasons.
>> In this case the job will fail for not enough free slots to run the job.
>>
>> For me this means the scheduler has a limitation to work by statically
>> assign tasks to the task slots the job is configured.
>>
>> Instead I would like to be able to specify a minimum parallelism of a job
>> but also the possibility to dynamically use more task slots if additional
>> task slots can be used.
>> Another use case will be that if during the execution of a job we lose
>> one node so some task slots, if the minimum parallelism is still ensured,
>> the job should recover and continue its execution instead of just failing.
>>
>> Is it possible to make such changes?
>>
>> Best,
>> Ovidiu
>
>
>
>


Re: Not enough free slots to run the job

2016-03-21 Thread Ovidiu-Cristian MARCU
Hi Robert,

I am not sure I understand so please confirm if I understand correctly your 
suggestions:
- to use less slots than available slots capacity to avoid issues like when a 
TaskManager is not giving its slots because of some problems registering the TM;
(This means I will lose some performance by not using all the available 
capacity)
-if a job is failing because of losing a TaskManager (and its slots) the job 
will not restart even if available slots are free to use.
(for this case the ‘spare slots’ will not be of help right; losing a TM means 
the job will fail, no recovery)

Thanks!

Best,
Ovidiu


> On 21 Mar 2016, at 14:15, Robert Metzger  wrote:
> 
> Hi Ovidiu,
> 
> right now the scheduler in Flink will not use more slots than requested.
> To avoid issues on recovery, we usually recommend users to have some spare 
> slots (run job with p=15 on a cluster with slots=20). I agree that it would 
> make sense to add a flag which allows a job to grab more slots if they are 
> available. The problem with that is however, that jobs can currently not 
> change their parallelism. So if a job fails, it can not downscale to restart 
> on the remaining slots.
> That's why the spare slots approach is currently the only way to go.
> 
> Regards,
> Robert
> 
> On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU 
> mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> Hi,
> 
> For the situation where a program specify a maximum parallelism (so it is 
> supposed to use all available task slots) we can have the possibility that 
> one of the task managers is not registered for various reasons.
> In this case the job will fail for not enough free slots to run the job.
> 
> For me this means the scheduler has a limitation to work by statically assign 
> tasks to the task slots the job is configured.
> 
> Instead I would like to be able to specify a minimum parallelism of a job but 
> also the possibility to dynamically use more task slots if additional task 
> slots can be used.
> Another use case will be that if during the execution of a job we lose one 
> node so some task slots, if the minimum parallelism is still ensured, the job 
> should recover and continue its execution instead of just failing.
> 
> Is it possible to make such changes?
> 
> Best,
> Ovidiu
> 



Re: Not enough free slots to run the job

2016-03-21 Thread Robert Metzger
Hi Ovidiu,

right now the scheduler in Flink will not use more slots than requested.
To avoid issues on recovery, we usually recommend users to have some spare
slots (run job with p=15 on a cluster with slots=20). I agree that it would
make sense to add a flag which allows a job to grab more slots if they are
available. The problem with that is however, that jobs can currently not
change their parallelism. So if a job fails, it can not downscale to
restart on the remaining slots.
That's why the spare slots approach is currently the only way to go.

Regards,
Robert

On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi,
>
> For the situation where a program specify a maximum parallelism (so it is
> supposed to use all available task slots) we can have the possibility that
> one of the task managers is not registered for various reasons.
> In this case the job will fail for not enough free slots to run the job.
>
> For me this means the scheduler has a limitation to work by statically
> assign tasks to the task slots the job is configured.
>
> Instead I would like to be able to specify a minimum parallelism of a job
> but also the possibility to dynamically use more task slots if additional
> task slots can be used.
> Another use case will be that if during the execution of a job we lose one
> node so some task slots, if the minimum parallelism is still ensured, the
> job should recover and continue its execution instead of just failing.
>
> Is it possible to make such changes?
>
> Best,
> Ovidiu


Not enough free slots to run the job

2016-03-18 Thread Ovidiu-Cristian MARCU
Hi,

For the situation where a program specify a maximum parallelism (so it is 
supposed to use all available task slots) we can have the possibility that one 
of the task managers is not registered for various reasons.
In this case the job will fail for not enough free slots to run the job.

For me this means the scheduler has a limitation to work by statically assign 
tasks to the task slots the job is configured.

Instead I would like to be able to specify a minimum parallelism of a job but 
also the possibility to dynamically use more task slots if additional task 
slots can be used.
Another use case will be that if during the execution of a job we lose one node 
so some task slots, if the minimum parallelism is still ensured, the job should 
recover and continue its execution instead of just failing.

Is it possible to make such changes?

Best,
Ovidiu