Re: Flink 1.8.3 GC issues

2020-11-12 Thread Aljoscha Krettek
Created an issue for this: https://issues.apache.org/jira/browse/BEAM-11251 On 11.11.20 19:09, Aljoscha Krettek wrote: Hi, nice work on debugging this! We need the synchronized block in the source because the call to reader.advance() (via the invoker) and reader.getCurrent() (via

Re: Flink 1.8.3 GC issues

2020-11-11 Thread Aljoscha Krettek
Hi, nice work on debugging this! We need the synchronized block in the source because the call to reader.advance() (via the invoker) and reader.getCurrent() (via emitElement()) must be atomic with respect to state. We cannot advance the reader state, not emit that record but still checkpoint

Re: Flink 1.8.3 GC issues

2020-10-23 Thread Piotr Nowojski
Hi Josson, Thanks for great investigation and coming back to use. Aljoscha, could you help us here? It looks like you were involved in this original BEAM-3087 issue. Best, Piotrek pt., 23 paź 2020 o 07:36 Josson Paul napisał(a): > @Piotr Nowojski @Nico Kruber > > An update. > > I am able

Re: Flink 1.8.3 GC issues

2020-10-22 Thread Josson Paul
@Piotr Nowojski @Nico Kruber An update. I am able to figure out the problem code. A change in the Apache Beam code is causing this problem. Beam introduced a lock on the “emit” in Unbounded Source. The lock is on the Flink’s check point lock. Now the same lock is used by Flink’s timer

Re: Flink 1.8.3 GC issues

2020-09-14 Thread Piotr Nowojski
Hi Josson, The TM logs that you attached are only from a 5 minutes time period. Are you sure they are encompassing the period before the potential failure and after the potential failure? It would be also nice if you would provide the logs matching to the charts (like the one you were providing

Re: Flink 1.8.3 GC issues

2020-09-11 Thread Piotr Nowojski
Hi Josson, Have you checked the logs as Nico suggested? At 18:55 there is a dip in non-heap memory, just about when the problems started happening. Maybe you could post the TM logs? Have you tried updating JVM to a newer version? Also it looks like the heap size is the same between 1.4 and 1.8,

Re: Flink 1.8.3 GC issues

2020-09-10 Thread Nico Kruber
What looks a bit strange to me is that with a running job, the SystemProcessingTimeService should actually not be collected (since it is still in use)! My guess is that something is indeed happening during that time frame (maybe job restarts?) and I would propose to check your logs for

Re: Flink 1.8.3 GC issues

2020-09-10 Thread Piotr Nowojski
Hi Josson, Thanks again for the detailed answer, and sorry that I can not help you with some immediate answer. I presume that jvm args for 1.8 are the same? Can you maybe post what exactly has crashed in your cases a) and b)? Re c), in the previously attached word document, it looks like Flink

Re: Flink 1.8.3 GC issues

2020-09-09 Thread Piotr Nowojski
Hi Josson, Thanks for getting back. What are the JVM settings and in particular GC settings that you are using (G1GC?)? It could also be an issue that in 1.4 you were just slightly below the threshold of GC issues, while in 1.8, something is using a bit more memory, causing the GC issues to

Re: Flink 1.8.3 GC issues

2020-09-08 Thread Josson Paul
Hi Piotr, 2) SystemProcessingTimeService holds the HeapKeyedStateBackend and HeapKeyedStateBackend has lot of Objects and that is filling the Heap 3) I am not using Flink Kafka Connector. But we are using Apache Beam kafka connector. There is a change in the Apache Beam version. But the

Re: Flink 1.8.3 GC issues

2020-09-03 Thread Piotr Nowojski
Hi Josson, 2. Are you sure that all/vast majority of those objects are pointing towards SystemProcessingTimeService? And is this really the problem of those objects? Are they taking that much of the memory? 3. It still could be Kafka's problem, as it's likely that between 1.4 and 1.8.x we bumped

Re: Flink 1.8.3 GC issues

2020-09-03 Thread Josson Paul
1) We are in the process of migrating to Flink 1.11. But it is going to take a while before we can make everything work with the latest version. Meanwhile since this is happening in production I am trying to solve this. 2) Finalizae class is pointing to

Re: Flink 1.8.3 GC issues

2020-09-03 Thread Piotr Nowojski
Hi, Have you tried using a more recent Flink version? 1.8.x is no longer supported, and latest versions might not have this issue anymore. Secondly, have you tried backtracking those references to the Finalizers? Assuming that Finalizer is indeed the class causing problems. Also it may well be