Re: Stream job failed after increasing number retained checkpoints

Piotr Nowojski Tue, 09 Jan 2018 23:55:30 -0800

Hi,

Increasing akka’s timeouts is rarely a solution for any problems - it either do 
not help, or just mask the issue making it less visible. But yes, it is 
possible to bump the limits: 
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akka
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akka>


I don’t think that state.checkpoints.num-retained was thought to handle such 
large numbers of retained checkpoint so maybe there are some known/unknown 
limitations. Stefan, do you know something in this regard?

Parallel thing to do is that like for any other akka timeout, you should track 
down the root cause of it. This one warning line doesn’t tell much. From where 
does it come from? Client log? Job manager log? Task manager log? Please search 
on the opposite side of the time outing connection for possible root cause of 
the timeout including:
- possible error/exceptions/warnings
- long GC pauses or other blocking operations (possibly long unnatural gaps in 
the logs)
- machine health (CPU usage, disks usage, network connections)

Piotrek

> On 9 Jan 2018, at 16:38, Jose Miguel Tejedor Fernandez 
> <jose.fernan...@rovio.com> wrote:
> 
> Hello,
> 
> I have several stream jobs running (v. 1.3.1 ) in production which always 
> fails after a fixed period of around 30h after being executing. That's the 
> WARN trace before failing:
> 
> Association with remote system 
> [akka.tcp://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876 
> <http://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876/>] has failed, 
> address is now gated for [5000] ms. Reason: [Association failed with 
> [akka.tcp://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876 
> <http://fl...@ip-10-1-51-134.cloud-internal.acme.com:39876/>]] Caused by: [No 
> response from remote for outbound association. Handshake timed out after 
> [20000 ms].
> 
> The main change done in the job configuration was to increase the 
> state.checkpoints.num-retained from 1 to 2880. I am using asynchronous 
> RocksDB to persists to snapshot the state. (I attach some screenshots with 
> the  checkpoint conf from webUI)
> 
> May my assumption be correct that the increase of checkpoints.num-retained is 
> causing the problem? Any known issue regarding this?
> Besides, Is there any way to increase the Akka handshake timeout from the 
> current 20000 ms to a higher value? I considered that it may be convenient to 
> increase the timeout to 1 minute instead.
> 
> BR
> 
> 
> <Screen Shot 2018-01-09 at 17.35.25.png><Screen Shot 2018-01-09 at 
> 17.35.18.png><Screen Shot 2018-01-09 at 17.35.00.png>

Re: Stream job failed after increasing number retained checkpoints

Reply via email to