Re: Flink Exception - assigned slot container was removed

2018-11-27 Thread Andrey Zagrebin
Hi

It can be also that some task managers fail, then pending checkpoints are 
declined and slots are removed.
Can you also have a look into task managers logs?

Best,
Andrey

> On 26 Nov 2018, at 12:19, qi luo  wrote:
> 
> This is weird. Could you paste your entire exception trace here?
> 
>> On Nov 26, 2018, at 4:37 PM, Flink Developer > > wrote:
>> 
>> In addition, after the Flink job has failed from the above exception, the 
>> Flink job is unable to recover from previous checkpoint. Is this the 
>> expected behavior? How can the job be recovered successfully from this?
>> 
>> ‐‐‐ Original Message ‐‐‐
>> On Monday, November 26, 2018 12:30 AM, Flink Developer 
>> mailto:developer...@protonmail.com>> wrote:
>> 
>>> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 
>>> 360 but it seems to still have encountered the issue. Does this mean if 
>>> a slot or "flink worker" has not processed items for 1 hour, that it will 
>>> be removed?
>>> 
>>> Would any other flink configuration properties help for this?
>>> 
>>> slot.request.timeout
>>> web.timeout
>>> heartbeat.interval
>>> heartbeat.timeout
>>> 
>>> 
>>> ‐‐‐ Original Message ‐‐‐
>>> On Sunday, November 25, 2018 6:56 PM, 罗齐 >> > wrote:
>>> 
 Hi,
 
 It looks that some of your slots were freed during the job execution 
 (possibly due to idle for too long). AFAIK the exception was thrown when a 
 pending Slot request was removed. You can try increase the 
 “Slot.idle.timeout” to mitigate this issue (default is 5, try 360 
 or higher).
 
 Regards,
 Qi
 
> On Nov 26, 2018, at 7:36 AM, Flink Developer  > wrote:
> 
> Hi, I have a Flink application sourcing from a topic in Kafka (400 
> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
> checkpointing every 2 mins. The Flink app runs with parallelism 400 so 
> that each worker handles a partition. This is using Flink 1.5.2. The 
> Flink cluster uses 10 task managers with 40 slots each.
> 
> After running for a few days straight, it encounters a Flink exception:
> Org.apache.flink.util.FlinkException: The assigned slot 
> container_1234567_0003_01_09_1 was removed.
> 
> This causes the Flink job to fail. It is odd to me. I am unsure what 
> causes this. Also, during this time, I see some checkpoints stating 
> "checkpoint was declined (tasks not ready)". At this point, the job is 
> unable to recover and fails. Does this happen if a slot or worker is not 
> doing processing for X amount of time? Would I need to increase the Flink 
> config properties for the following when creating the Flink cluster in 
> yarn?
> 
> Slot.idle.timeout
> Slot.request.timeout
> Web.timeout
> Heartbeat.interval
> Heartbeat.timeout
> 
> Any help would be greatly appreciated.
> 
>>> 
>> 
> 



Re: Flink Exception - assigned slot container was removed

2018-11-26 Thread qi luo
This is weird. Could you paste your entire exception trace here?

> On Nov 26, 2018, at 4:37 PM, Flink Developer  
> wrote:
> 
> In addition, after the Flink job has failed from the above exception, the 
> Flink job is unable to recover from previous checkpoint. Is this the expected 
> behavior? How can the job be recovered successfully from this?
> 
> ‐‐‐ Original Message ‐‐‐
> On Monday, November 26, 2018 12:30 AM, Flink Developer 
>  wrote:
> 
>> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 
>> 360 but it seems to still have encountered the issue. Does this mean if 
>> a slot or "flink worker" has not processed items for 1 hour, that it will be 
>> removed?
>> 
>> Would any other flink configuration properties help for this?
>> 
>> slot.request.timeout
>> web.timeout
>> heartbeat.interval
>> heartbeat.timeout
>> 
>> 
>> ‐‐‐ Original Message ‐‐‐
>> On Sunday, November 25, 2018 6:56 PM, 罗齐  wrote:
>> 
>>> Hi,
>>> 
>>> It looks that some of your slots were freed during the job execution 
>>> (possibly due to idle for too long). AFAIK the exception was thrown when a 
>>> pending Slot request was removed. You can try increase the 
>>> “Slot.idle.timeout” to mitigate this issue (default is 5, try 360 
>>> or higher).
>>> 
>>> Regards,
>>> Qi
>>> 
 On Nov 26, 2018, at 7:36 AM, Flink Developer >>> > wrote:
 
 Hi, I have a Flink application sourcing from a topic in Kafka (400 
 partitions) and sinking to S3 using bucketingsink and using RocksDb for 
 checkpointing every 2 mins. The Flink app runs with parallelism 400 so 
 that each worker handles a partition. This is using Flink 1.5.2. The Flink 
 cluster uses 10 task managers with 40 slots each.
 
 After running for a few days straight, it encounters a Flink exception:
 Org.apache.flink.util.FlinkException: The assigned slot 
 container_1234567_0003_01_09_1 was removed.
 
 This causes the Flink job to fail. It is odd to me. I am unsure what 
 causes this. Also, during this time, I see some checkpoints stating 
 "checkpoint was declined (tasks not ready)". At this point, the job is 
 unable to recover and fails. Does this happen if a slot or worker is not 
 doing processing for X amount of time? Would I need to increase the Flink 
 config properties for the following when creating the Flink cluster in 
 yarn?
 
 Slot.idle.timeout
 Slot.request.timeout
 Web.timeout
 Heartbeat.interval
 Heartbeat.timeout
 
 Any help would be greatly appreciated.
 
>> 
> 



Re: Flink Exception - assigned slot container was removed

2018-11-26 Thread Flink Developer
In addition, after the Flink job has failed from the above exception, the Flink 
job is unable to recover from previous checkpoint. Is this the expected 
behavior? How can the job be recovered successfully from this?

‐‐‐ Original Message ‐‐‐
On Monday, November 26, 2018 12:30 AM, Flink Developer 
 wrote:

> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 360 
> but it seems to still have encountered the issue. Does this mean if a slot or 
> "flink worker" has not processed items for 1 hour, that it will be removed?
>
> Would any other flink configuration properties help for this?
>
> slot.request.timeout
> web.timeout
> heartbeat.interval
> heartbeat.timeout
>
> ‐‐‐ Original Message ‐‐‐
> On Sunday, November 25, 2018 6:56 PM, 罗齐  wrote:
>
>> Hi,
>>
>> It looks that some of your slots were freed during the job execution 
>> (possibly due to idle for too long). AFAIK the exception was thrown when a 
>> pending Slot request was removed. You can try increase the 
>> “Slot.idle.timeout” to mitigate this issue (default is 5, try 360 or 
>> higher).
>>
>> Regards,
>> Qi
>>
>>> On Nov 26, 2018, at 7:36 AM, Flink Developer  
>>> wrote:
>>>
>>> Hi, I have a Flink application sourcing from a topic in Kafka (400 
>>> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
>>> checkpointing every 2 mins. The Flink app runs with parallelism 400 so that 
>>> each worker handles a partition. This is using Flink 1.5.2. The Flink 
>>> cluster uses 10 task managers with 40 slots each.
>>>
>>> After running for a few days straight, it encounters a Flink exception:
>>> Org.apache.flink.util.FlinkException: The assigned slot 
>>> container_1234567_0003_01_09_1 was removed.
>>>
>>> This causes the Flink job to fail. It is odd to me. I am unsure what causes 
>>> this. Also, during this time, I see some checkpoints stating "checkpoint 
>>> was declined (tasks not ready)". At this point, the job is unable to 
>>> recover and fails. Does this happen if a slot or worker is not doing 
>>> processing for X amount of time? Would I need to increase the Flink config 
>>> properties for the following when creating the Flink cluster in yarn?
>>>
>>> Slot.idle.timeout
>>> Slot.request.timeout
>>> Web.timeout
>>> Heartbeat.interval
>>> Heartbeat.timeout
>>>
>>> Any help would be greatly appreciated.

Re: Flink Exception - assigned slot container was removed

2018-11-26 Thread Flink Developer
Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 360 
but it seems to still have encountered the issue. Does this mean if a slot or 
"flink worker" has not processed items for 1 hour, that it will be removed?

Would any other flink configuration properties help for this?

slot.request.timeout
web.timeout
heartbeat.interval
heartbeat.timeout

‐‐‐ Original Message ‐‐‐
On Sunday, November 25, 2018 6:56 PM, 罗齐  wrote:

> Hi,
>
> It looks that some of your slots were freed during the job execution 
> (possibly due to idle for too long). AFAIK the exception was thrown when a 
> pending Slot request was removed. You can try increase the 
> “Slot.idle.timeout” to mitigate this issue (default is 5, try 360 or 
> higher).
>
> Regards,
> Qi
>
>> On Nov 26, 2018, at 7:36 AM, Flink Developer  
>> wrote:
>>
>> Hi, I have a Flink application sourcing from a topic in Kafka (400 
>> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
>> checkpointing every 2 mins. The Flink app runs with parallelism 400 so that 
>> each worker handles a partition. This is using Flink 1.5.2. The Flink 
>> cluster uses 10 task managers with 40 slots each.
>>
>> After running for a few days straight, it encounters a Flink exception:
>> Org.apache.flink.util.FlinkException: The assigned slot 
>> container_1234567_0003_01_09_1 was removed.
>>
>> This causes the Flink job to fail. It is odd to me. I am unsure what causes 
>> this. Also, during this time, I see some checkpoints stating "checkpoint was 
>> declined (tasks not ready)". At this point, the job is unable to recover and 
>> fails. Does this happen if a slot or worker is not doing processing for X 
>> amount of time? Would I need to increase the Flink config properties for the 
>> following when creating the Flink cluster in yarn?
>>
>> Slot.idle.timeout
>> Slot.request.timeout
>> Web.timeout
>> Heartbeat.interval
>> Heartbeat.timeout
>>
>> Any help would be greatly appreciated.

Re: Flink Exception - assigned slot container was removed

2018-11-25 Thread 罗齐
Hi,

It looks that some of your slots were freed during the job execution (possibly 
due to idle for too long). AFAIK the exception was thrown when a pending Slot 
request was removed. You can try increase the “Slot.idle.timeout” to mitigate 
this issue (default is 5, try 360 or higher).

Regards,
Qi

> On Nov 26, 2018, at 7:36 AM, Flink Developer  
> wrote:
> 
> Hi, I have a Flink application sourcing from a topic in Kafka (400 
> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
> checkpointing every 2 mins. The Flink app runs with parallelism 400 so that 
> each worker handles a partition. This is using Flink 1.5.2. The Flink cluster 
> uses 10 task managers with 40 slots each.
> 
> After running for a few days straight, it encounters a Flink exception:
> Org.apache.flink.util.FlinkException: The assigned slot 
> container_1234567_0003_01_09_1 was removed.
> 
> This causes the Flink job to fail. It is odd to me. I am unsure what causes 
> this. Also, during this time, I see some checkpoints stating "checkpoint was 
> declined (tasks not ready)". At this point, the job is unable to recover and 
> fails. Does this happen if a slot or worker is not doing processing for X 
> amount of time? Would I need to increase the Flink config properties for the 
> following when creating the Flink cluster in yarn?
> 
> Slot.idle.timeout
> Slot.request.timeout
> Web.timeout
> Heartbeat.interval
> Heartbeat.timeout
> 
> Any help would be greatly appreciated.
> 



Flink Exception - assigned slot container was removed

2018-11-25 Thread Flink Developer
Hi, I have a Flink application sourcing from a topic in Kafka (400 partitions) 
and sinking to S3 using bucketingsink and using RocksDb for checkpointing every 
2 mins. The Flink app runs with parallelism 400 so that each worker handles a 
partition. This is using Flink 1.5.2. The Flink cluster uses 10 task managers 
with 40 slots each.

After running for a few days straight, it encounters a Flink exception:
Org.apache.flink.util.FlinkException: The assigned slot 
container_1234567_0003_01_09_1 was removed.

This causes the Flink job to fail. It is odd to me. I am unsure what causes 
this. Also, during this time, I see some checkpoints stating "checkpoint was 
declined (tasks not ready)". At this point, the job is unable to recover and 
fails. Does this happen if a slot or worker is not doing processing for X 
amount of time? Would I need to increase the Flink config properties for the 
following when creating the Flink cluster in yarn?

Slot.idle.timeout
Slot.request.timeout
Web.timeout
Heartbeat.interval
Heartbeat.timeout

Any help would be greatly appreciated.