Hi

It can be also that some task managers fail, then pending checkpoints are 
declined and slots are removed.
Can you also have a look into task managers logs?

Best,
Andrey

> On 26 Nov 2018, at 12:19, qi luo <luoqi...@gmail.com> wrote:
> 
> This is weird. Could you paste your entire exception trace here?
> 
>> On Nov 26, 2018, at 4:37 PM, Flink Developer <developer...@protonmail.com 
>> <mailto:developer...@protonmail.com>> wrote:
>> 
>> In addition, after the Flink job has failed from the above exception, the 
>> Flink job is unable to recover from previous checkpoint. Is this the 
>> expected behavior? How can the job be recovered successfully from this?
>> 
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Monday, November 26, 2018 12:30 AM, Flink Developer 
>> <developer...@protonmail.com <mailto:developer...@protonmail.com>> wrote:
>> 
>>> Thanks for the suggestion Qi. I tried increasing slot.idle.timeout to 
>>> 3600000 but it seems to still have encountered the issue. Does this mean if 
>>> a slot or "flink worker" has not processed items for 1 hour, that it will 
>>> be removed?
>>> 
>>> Would any other flink configuration properties help for this?
>>> 
>>> slot.request.timeout
>>> web.timeout
>>> heartbeat.interval
>>> heartbeat.timeout
>>> 
>>> 
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Sunday, November 25, 2018 6:56 PM, 罗齐 <luoqi...@bytedance.com 
>>> <mailto:luoqi...@bytedance.com>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> It looks that some of your slots were freed during the job execution 
>>>> (possibly due to idle for too long). AFAIK the exception was thrown when a 
>>>> pending Slot request was removed. You can try increase the 
>>>> “Slot.idle.timeout” to mitigate this issue (default is 50000, try 3600000 
>>>> or higher).
>>>> 
>>>> Regards,
>>>> Qi
>>>> 
>>>>> On Nov 26, 2018, at 7:36 AM, Flink Developer <developer...@protonmail.com 
>>>>> <mailto:developer...@protonmail.com>> wrote:
>>>>> 
>>>>> Hi, I have a Flink application sourcing from a topic in Kafka (400 
>>>>> partitions) and sinking to S3 using bucketingsink and using RocksDb for 
>>>>> checkpointing every 2 mins. The Flink app runs with parallelism 400 so 
>>>>> that each worker handles a partition. This is using Flink 1.5.2. The 
>>>>> Flink cluster uses 10 task managers with 40 slots each.
>>>>> 
>>>>> After running for a few days straight, it encounters a Flink exception:
>>>>> Org.apache.flink.util.FlinkException: The assigned slot 
>>>>> container_1234567_0003_01_000009_1 was removed.
>>>>> 
>>>>> This causes the Flink job to fail. It is odd to me. I am unsure what 
>>>>> causes this. Also, during this time, I see some checkpoints stating 
>>>>> "checkpoint was declined (tasks not ready)". At this point, the job is 
>>>>> unable to recover and fails. Does this happen if a slot or worker is not 
>>>>> doing processing for X amount of time? Would I need to increase the Flink 
>>>>> config properties for the following when creating the Flink cluster in 
>>>>> yarn?
>>>>> 
>>>>> Slot.idle.timeout
>>>>> Slot.request.timeout
>>>>> Web.timeout
>>>>> Heartbeat.interval
>>>>> Heartbeat.timeout
>>>>> 
>>>>> Any help would be greatly appreciated.
>>>>> 
>>> 
>> 
> 

Reply via email to