[jira] [Commented] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655070#comment-17655070 ] Thomas Wozniakowski commented on FLINK-30562: - [^flink-asf-30562-clean.zip] I've produced a (relatively) simple project here that reproduces the problem. Please let me know if you have any questions. > CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+ > > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > Attachments: flink-asf-30562-clean.zip > > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermark alignment, perhaps this has > caused some kind of regression in the CEP library? To reiterate, *this > problem only occurs with parallelism of 2 or more. Setting the parallelism to > 1 immediately fixes the issue* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-30562: Attachment: flink-asf-30562-clean.zip > CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+ > > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > Attachments: flink-asf-30562-clean.zip > > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermark alignment, perhaps this has > caused some kind of regression in the CEP library? To reiterate, *this > problem only occurs with parallelism of 2 or more. Setting the parallelism to > 1 immediately fixes the issue* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-30562: Component/s: API / DataStream > CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+ > > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermark alignment, perhaps this has > caused some kind of regression in the CEP library? To reiterate, *this > problem only occurs with parallelism of 2 or more. Setting the parallelism to > 1 immediately fixes the issue* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30562) CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-30562: Summary: CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+ (was: Patterns are not emitted with parallelism >1 since 1.15.x+) > CEP Operator misses patterns on SideOutputs and parallelism >1 since 1.15.x+ > > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermark alignment, perhaps this has > caused some kind of regression in the CEP library? To reiterate, *this > problem only occurs with parallelism of 2 or more. Setting the parallelism to > 1 immediately fixes the issue* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655011#comment-17655011 ] Thomas Wozniakowski commented on FLINK-30562: - Hi [~bgeng777] I've made some progress in narrowing down the problem. I am still working on producing a reproducible code snippet I can share, but the problem is definitely related to *Side Outputs*. For context, we use Side Outputs to route events to different CEP operators depending on a Customer ID value (different customers are interested in different CEP sequences). We previously used the {{.split()}} operator before it was deprecated. We set up the side outputs with a call like this (I have dramatically simplified the code but the problem is still occurring with the code in this form): {code:java} streamWithSideOutputs = stream.process(new BrandedSideOutputFunction()); // Where the side output function ... public static class BrandedSideOutputFunction extends ProcessFunction { private final OutputTag outputTag = new OutputTag<>("RED_BRAND", TypeInformation.of(PlatformEvent.class)); @Override public void processElement(PlatformEvent value, Context ctx, Collector out) { ctx.output(outputTag, value); out.collect(value); } } {code} You'll note that obviously this side output function only actually outputs to one, hardcoded side output. The real code is more complex but as I say, the problem still occurs with the code as written above. With this {{.process(...)}} call upstream of the CEP operators, and the {{parallelism}} set to a value greater than 1, the Patterns will fail to be detected roughly 1/3rd of the time. Note that this happens even if I connect the CEP operator to either the *main* {{DataStream}} or to a side output via {{.getSideOutput(tag)}}. If the {{parallelism}} is set to 1, or if I remove the side-output generating {{.process(...)}} call and connect the CEP operator directly to the existing {{DataStream}}, the Patterns will be detected 100% of the time. There seems to be something up with the interaction between side outputs, parallelism and the CEP operator in Flink 1.15.0+. I will keep working on producing a project I can share reproducing this problem, but hopefully this gives you something to go on? > Patterns are not emitted with parallelism >1 since 1.15.x+ > -- > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermar
[jira] [Commented] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654511#comment-17654511 ] Thomas Wozniakowski commented on FLINK-30562: - Hi [~bgeng777], thanks for the quick response. Your demo is roughly the same as the one I'm trying to set up to reproduce the issue in a compact way. I will use it for guidance to see if I can get something useful available. My experiments are showing: *Flink Versions 1.4.3, parallelism: any* CEP operators produce expected output *Flink Versions 1.5.x+, parallelism: 1* CEP operators produce expected output *Flink Versions 1.5.x+, parallelism: 2+* CEP operators produce no output at all It's worth noting that we did not change any code related to our CEP usage between these tests, we simply updated the library versions. We are using more pattern constraints than exist in your test file, I'm wondering if it might be related to one of those. For example, we use ".within(...)" and ".times(...)" on most of our Pattern definitions. > Patterns are not emitted with parallelism >1 since 1.15.x+ > -- > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermark alignment, perhaps this has > caused some kind of regression in the CEP library? To reiterate, *this > problem only occurs with parallelism of 2 or more. Setting the parallelism to > 1 immediately fixes the issue* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-30562: Description: (Apologies for the speculative and somewhat vague ticket, but I wanted to raise this while I am investigating to see if anyone has suggestions to help me narrow down the problem.) We are encountering an issue where our streaming Flink job has stopped working correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The Keyed CEP operators that our job uses are no longer emitting Patterns reliably, but critically *this is only happening when parallelism is set to a value greater than 1*. Our local build tests were previously set up using in-JVM `MiniCluster` instances, or dockerised Flink clusters all set with a parallelism of 1, so this problem was not caught and it caused an outage when we upgraded the cluster version in production. Observing the job using the Flink console in production, I can see that events are *arriving* into the Keyed CEP operators, but no Pattern events are being emitted out of any of the operators. Furthermore, all the reported Watermark values are zero, though I don't know if that is a red herring as it seems Watermark reporting seems to have changed since 1.14.x. I am currently attempting to create a stripped down version of our streaming job to demonstrate the problem, but this is quite tricky to set up. In the meantime I would appreciate any hints that could point me in the right direction. I have isolated the problem to the Keyed CEP operator by removing our real sinks and sources from the failing test. I am still seeing the erroneous behaviour when setting up a job as: # Events are read from a list using `env.fromCollection( ... )` # CEP operator processes events # Output is captured in another list for assertions My best guess at the moment is something to do with Watermark emission? There seems to have been changes related to watermark alignment, perhaps this has caused some kind of regression in the CEP library? To reiterate, *this problem only occurs with parallelism of 2 or more. Setting the parallelism to 1 immediately fixes the issue* was: (Apologies for the speculative and somewhat vague ticket, but I wanted to raise this while I am investigating to see if anyone has suggestions to help me narrow down the problem.) We are encountering an issue where our streaming Flink job has stopped working correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The Keyed CEP operators that our job uses are no longer emitting Patterns reliably, but critically *this is only happening when parallelism is set to a value greater than 1*. Our local build tests were previously set up using in-JVM `MiniCluster` instances, or dockerised Flink clusters all set with a parallelism of 1, so this problem was not caught and it caused an outage when we upgraded the cluster version in production. Observing the job using the Flink console in production, I can see that events are *arriving* into the Keyed CEP operators, but no Pattern events are being emitted out of any of the operators. Furthermore, all the reported Watermark values are zero, though I don't know if that is a red herring as it seems Watermark reporting seems to have changed since 1.14.x. I am currently attempting to create a stripped down version of our streaming job to demonstrate the problem, but this is quite tricky to set up. In the meantime I would appreciate any hints that could point me in the right direction. I have isolated the problem to the Keyed CEP operator by removing our real sinks and sources from the failing test. I am still seeing the erroneous behaviour when setting up a job as: # Events are read from a list using `env.fromCollection( ... )` # CEP operator processes events # Output is captured in another list for assertions My best guess at the moment is something to do with Watermark emission? There seems to have been changes related to watermark alignment, perhaps this has caused some kind of regression in the CEP library? > Patterns are not emitted with parallelism >1 since 1.15.x+ > -- > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am
[jira] [Created] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+
Thomas Wozniakowski created FLINK-30562: --- Summary: Patterns are not emitted with parallelism >1 since 1.15.x+ Key: FLINK-30562 URL: https://issues.apache.org/jira/browse/FLINK-30562 Project: Flink Issue Type: Bug Components: Library / CEP Affects Versions: 1.15.3, 1.16.0 Environment: Problem observed in: Production: Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and sink to AWS SQS Local: Completely local MiniCluster based test with no external sinks or sources Reporter: Thomas Wozniakowski (Apologies for the speculative and somewhat vague ticket, but I wanted to raise this while I am investigating to see if anyone has suggestions to help me narrow down the problem.) We are encountering an issue where our streaming Flink job has stopped working correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The Keyed CEP operators that our job uses are no longer emitting Patterns reliably, but critically *this is only happening when parallelism is set to a value greater than 1*. Our local build tests were previously set up using in-JVM `MiniCluster` instances, or dockerised Flink clusters all set with a parallelism of 1, so this problem was not caught and it caused an outage when we upgraded the cluster version in production. Observing the job using the Flink console in production, I can see that events are *arriving* into the Keyed CEP operators, but no Pattern events are being emitted out of any of the operators. Furthermore, all the reported Watermark values are zero, though I don't know if that is a red herring as it seems Watermark reporting seems to have changed since 1.14.x. I am currently attempting to create a stripped down version of our streaming job to demonstrate the problem, but this is quite tricky to set up. In the meantime I would appreciate any hints that could point me in the right direction. I have isolated the problem to the Keyed CEP operator by removing our real sinks and sources from the failing test. I am still seeing the erroneous behaviour when setting up a job as: # Events are read from a list using `env.fromCollection( ... )` # CEP operator processes events # Output is captured in another list for assertions My best guess at the moment is something to do with Watermark emission? There seems to have been changes related to watermark alignment, perhaps this has caused some kind of regression in the CEP library? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247999#comment-17247999 ] Thomas Wozniakowski commented on FLINK-19970: - [~dwysakowicz] I've emailed you over the code + JAR. Please give me a shout here or over email if you need anything else from me. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245902#comment-17245902 ] Thomas Wozniakowski commented on FLINK-19970: - Ok, I am working on this now. It's going to take me a while to cut down everything to get it into one self-contained JAR but I'll send it over as soon as I'm done. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245327#comment-17245327 ] Thomas Wozniakowski commented on FLINK-19970: - Ok, I've built another version of our app this time using every artefact built from your branch (all parts of Flink). The bug is still visible (state still grows unbounded). I'm going to try and get approval internally to send you a cut down version of our app that exhibits this behaviour. Do you just need a job JAR that you can start on a cluster and see the effect? Also it would be great if there is a slightly more confidential way I can send the code to you, it's not hyper sensitive or anything but I'd rather not post it on a public JIRA ticket. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244094#comment-17244094 ] Thomas Wozniakowski commented on FLINK-19970: - Ah sorry, I mixed up the `key` and `id` value when I was reading your code. This is a screenshot from IntelliJ of the imported libraries from my test JAR: !screenshot-3.png! I pushed your branch to our internal artifactory under the snapshot version. I'm going to try pushing again with a more specific version name to make sure I'm not somehow pulling in snapshot versions from somewhere else, but getting this far was extremely painful already due to getting the maven build to work nice with our artifactory. Would it be possible to stick a temporary log line somewhere in an init function for one of the CEP operators that I could look out for to confirm 100% that it's running with the right branch version on my remote test cluster? > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-19970: Attachment: screenshot-3.png > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244083#comment-17244083 ] Thomas Wozniakowski commented on FLINK-19970: - Hey [~dwysakowicz] Am I correct in saying that your scenario above matches my scenario #2 from the original description? That is to say the "constant key rotation" scenario, where keys only have one or two events before they go dormant and should be cleaned up? The test I ran above was scenario #1 which was the "no key rotation" one. Just the same keys emitting events over and over forever. The equivalent in your test would be to hold the `id` value in your generated events constant. I will rerun with the branch version on scenario #2 (to match your test setup) and see how it behaves. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244047#comment-17244047 ] Thomas Wozniakowski commented on FLINK-19970: - Hey [~dwysakowicz], Just to confirm, it's just the `flink-cep` module that needs to be replaced in my Job JAR? No other flink libraries need to be swapped out and it's ok to run the actual cluster on the release version? > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-19970: Attachment: screenshot-2.png > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243915#comment-17243915 ] Thomas Wozniakowski commented on FLINK-19970: - Results over a longer period, state still growing linearly: !screenshot-2.png! > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png, > screenshot-2.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243327#comment-17243327 ] Thomas Wozniakowski commented on FLINK-19970: - !screenshot-1.png! Unfortunately it doesn't appear to have stopped the state leak. I haven't let the test run for the full period but you can see the size continue to grow past the 30-45 minute mark where it should start discarding events. I'll keep it running for the full 24 to be representative but I think we might need to keep digging. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-19970: Attachment: screenshot-1.png > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png, screenshot-1.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243202#comment-17243202 ] Thomas Wozniakowski commented on FLINK-19970: - Ok, our load test is running now after a surprisingly painful setup process importing branch code. I'm re-running scenario #1 above and will post the results after a few hours (it should be clear if the problem is fixed) > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242478#comment-17242478 ] Thomas Wozniakowski commented on FLINK-19970: - For some reason I can't compile your branch, I get: {code} [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /home/jamalarm/src/open/flink/flink-libraries/flink-cep/src/test/java/org/apache/flink/cep/nfa/NFAITCase.java:[39,19] package javafx.util does not exist [INFO] 1 error [INFO] - {code} Any ideas? > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242459#comment-17242459 ] Thomas Wozniakowski commented on FLINK-19970: - Do I need to rebuild our Job JAR using the CEP library from your branch? Or is this a TaskManager-side fix? > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242433#comment-17242433 ] Thomas Wozniakowski commented on FLINK-19970: - Hey [~dwysakowicz] - great to hear you have a candidate fix! In order for us to test it I would need to package your branch as a docker container. Our load testing environment runs exclusively on containers and it would be... a profound headache to try and make it run any other way. We can publish it to an internal ECR repo on our side, that shouldn't be too hard. Is there a relatively straightforward way to check your branch out and package it up as an image locally? Thanks > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241445#comment-17241445 ] Thomas Wozniakowski commented on FLINK-19970: - Hi [~wind_ljy], No, we didn't identify a fix. We're not really familiar with the Flink codebase and our team is pretty small, so our plan was to wait until [~dwysakowicz] was finished with the 1.12.0 release and had some time to look at the issue. We would obviously be very grateful if you were able to spare some time to dig into this. Our production system is effectively a ticking time bomb at the moment with this issue. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19293) RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore
[ https://issues.apache.org/jira/browse/FLINK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237395#comment-17237395 ] Thomas Wozniakowski commented on FLINK-19293: - Hi [~AHeise] Sorry, I meant to update this ticket but forgot. So we spent some time isolating the behaviour and actually realised this is nothing to do with RocksDB. It's a bug in the CEP library where the state grows endlessly, I raised another ticket about it here: https://issues.apache.org/jira/browse/FLINK-19970 Do you want me to go ahead and close this one? I'm not sure there's work to do here. > RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore > -- > > Key: FLINK-19293 > URL: https://issues.apache.org/jira/browse/FLINK-19293 > Project: Flink > Issue Type: Bug > Components: Library / CEP, Runtime / Checkpointing, Runtime / State > Backends >Affects Versions: 1.10.1 >Reporter: Thomas Wozniakowski >Priority: Major > Attachments: Screenshot 2020-09-18 at 13.58.30.png > > > Hi Guys, > I am seeing some strange behaviour that may be a bug, or may just be intended. > We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and > 2 TaskManagers, parallelism 4. The job itself is simple: > # Source: kinesis connector reading from a single shard stream > # CEP: ~25 CEP Keyed Pattern operators watching the event stream for > different kinds of behaviour. They all have ".withinSeconds()" applied. > Nothing is set up to grow endlessly. > # Sink: Single operator writing messages to SQS (custom code) > We are seeing the checkpoint size grow constantly until the job is restarted > using a savepoint/restore. The size continues to grow past the point that the > ".withinSeconds()" limits should cause old data to be discarded. The > growth is also out of proportion to the general platform growth (which is > actually trending down at the moment due to COVID). > I've attached a snapshot from our monitoring dashboard below. You can see the > huge drops in state_size on a savepoint/restore. > Our state configuration is as follows: > Backend: RocksDB > Mode: EXACTLY_ONCE > Max Concurrent: 1 > Externalised Checkpoints: RETAIN_ON_CANCELLATION > Async: TRUE > Incremental: TRUE > TTL Compaction Filter enabled: TRUE > We are worried that the CEP library may be leaking state somewhere, leaving > some objects not cleaned up. Unfortunately I can't share one of these > checkpoints with the community due to the sensitive nature of the data > contained within, but if anyone has any suggestions for how I could analyse > the checkpoints to look for leaks, please let me know. > Thanks in advance for the help -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232913#comment-17232913 ] Thomas Wozniakowski commented on FLINK-19970: - Hey [~dwysakowicz] - thanks for the update. Please let us know if there's anything we can do to help. Happy to test out a branch version of Flink in our load test environment when it is available. The only issue is that we use the docker images so I'd need some way to build a branch docker image. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227329#comment-17227329 ] Thomas Wozniakowski commented on FLINK-19970: - [~dwysakowicz] Please give me a shout if there's any more diagnostic info I can attach to assist with debugging this. This one is blocking us quite severely so I'm more than happy to help any way I can. > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. My > understanding is that the checkpoint size should level off after ~45 minutes > or so then stay constant. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-19970: Description: We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right (shallow) half is Test 2. My understanding is that the checkpoint size should level off after ~45 minutes or so then stay constant. !image-2020-11-04-11-35-12-126.png! Could someone please assist us with this? Unless we have dramatically misunderstood how the CEP library is supposed to function this seems like a pretty severe bug. was: We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right
[jira] [Commented] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226022#comment-17226022 ] Thomas Wozniakowski commented on FLINK-19970: - Hi [~dwysakowicz], Sorry for not including that info in the original post. We are using event time with a custom watermarking strategy based on an average of the last 10 events' timestamps + a constant buffer of 15 minutes. The watermarking strategy is working just fine. Test #2 is actually still running and I can see the Low Watermark of the CEP operators is 1604492129185 (15 minutes ago) as expected. Note that this setup is also producing matches just fine (with increased frequency of event emission). If the watermarks weren't being correctly assigned then we would never see matches coming out the other end of the CEP operators, right? > State leak in CEP Operators (expired events/keys not removed from state) > > > Key: FLINK-19970 > URL: https://issues.apache.org/jira/browse/FLINK-19970 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.11.2 > Environment: Flink 1.11.2 run using the official docker containers in > AWS ECS Fargate. > 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory >Reporter: Thomas Wozniakowski >Priority: Critical > Attachments: image-2020-11-04-11-35-12-126.png > > > We have been observing instability in our production environment recently, > seemingly related to state backends. We ended up building a load testing > environment to isolate factors and have discovered that the CEP library > appears to have some serious problems with state expiry. > h2. Job Topology > Source: Kinesis (standard connector) -> keyBy() and forward to... > CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward > output to... > Sink: SQS (custom connector) > The CEP Patterns in the test look like this: > {code:java} > Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(20) > .subtype(ScanEvent.class) > .within(Duration.minutes(30)); > {code} > h2. Taskmanager Config > {code:java} > taskmanager.numberOfTaskSlots: $numberOfTaskSlots > taskmanager.data.port: 6121 > taskmanager.rpc.port: 6122 > taskmanager.exit-on-fatal-akka-error: true > taskmanager.memory.process.size: $memoryProcessSize > taskmanager.memory.jvm-metaspace.size: 256m > taskmanager.memory.managed.size: 0m > jobmanager.rpc.port: 6123 > blob.server.port: 6130 > rest.port: 8081 > web.submit.enable: true > fs.s3a.connection.maximum: 50 > fs.s3a.threads.max: 50 > akka.framesize: 250m > akka.watch.threshold: 14 > state.checkpoints.dir: s3://$savepointBucketName/checkpoints > state.savepoints.dir: s3://$savepointBucketName/savepoints > state.backend: filesystem > state.backend.async: true > s3.access-key: $s3AccessKey > s3.secret-key: $s3SecretKey > {code} > (the substitutions are controlled by terraform). > h2. Tests > h4. Test 1 (No key rotation) > 8192 actors (different keys) emitting 1 Scan Event every 10 minutes > indefinitely. Actors (keys) never rotate in or out. > h4. Test 2 (Constant key rotation) > 8192 actors that produce 2 Scan events 10 minutes apart, then retire and > never emit again. The setup creates new actors (keys) as soon as one finishes > so we always have 8192. This test basically constantly rotates the key space. > h2. Results > For both tests, the state size (checkpoint size) grows unbounded and linearly > well past the 30 minute threshold that should have caused old keys or events > to be discard from the state. In the chart below, the left (steep) half is > the 24 hours we ran Test 1, the right (shallow) half is Test 2. > !image-2020-11-04-11-35-12-126.png! > Could someone please assist us with this? Unless we have dramatically > misunderstood how the CEP library is supposed to function this seems like a > pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-19970: Description: We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right (shallow) half is Test 2. !image-2020-11-04-11-35-12-126.png! Could someone please assist us with this? Unless we have dramatically misunderstood how the CEP library is supposed to function this seems like a pretty severe bug. was: We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right (shallow) half is Test 2. !image-2020-11-04-11-35-12-126.png|thumbnail! Could someone please assist us w
[jira] [Updated] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
[ https://issues.apache.org/jira/browse/FLINK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-19970: Description: We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right (shallow) half is Test 2. !image-2020-11-04-11-35-12-126.png|thumbnail! Could someone please assist us with this? Unless we have dramatically misunderstood how the CEP library is supposed to function this seems like a pretty severe bug. was: We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right (shallow) half is Test 2. !image-2020-11-04-11-35-12-126.png|thumbnail! Could someone please
[jira] [Created] (FLINK-19970) State leak in CEP Operators (expired events/keys not removed from state)
Thomas Wozniakowski created FLINK-19970: --- Summary: State leak in CEP Operators (expired events/keys not removed from state) Key: FLINK-19970 URL: https://issues.apache.org/jira/browse/FLINK-19970 Project: Flink Issue Type: Bug Components: Library / CEP Affects Versions: 1.11.2 Environment: Flink 1.11.2 run using the official docker containers in AWS ECS Fargate. 1 Job Manager, 1 Taskmanager with 2vCPUs and 8GB memory Reporter: Thomas Wozniakowski Attachments: image-2020-11-04-11-35-12-126.png We have been observing instability in our production environment recently, seemingly related to state backends. We ended up building a load testing environment to isolate factors and have discovered that the CEP library appears to have some serious problems with state expiry. h2. Job Topology Source: Kinesis (standard connector) -> keyBy() and forward to... CEP: Array of simple Keyed CEP Pattern operators (details below) -> forward output to... Sink: SQS (custom connector) The CEP Patterns in the test look like this: {code:java} Pattern.begin(SCANS_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(20) .subtype(ScanEvent.class) .within(Duration.minutes(30)); {code} h2. Taskmanager Config {code:java} taskmanager.numberOfTaskSlots: $numberOfTaskSlots taskmanager.data.port: 6121 taskmanager.rpc.port: 6122 taskmanager.exit-on-fatal-akka-error: true taskmanager.memory.process.size: $memoryProcessSize taskmanager.memory.jvm-metaspace.size: 256m taskmanager.memory.managed.size: 0m jobmanager.rpc.port: 6123 blob.server.port: 6130 rest.port: 8081 web.submit.enable: true fs.s3a.connection.maximum: 50 fs.s3a.threads.max: 50 akka.framesize: 250m akka.watch.threshold: 14 state.checkpoints.dir: s3://$savepointBucketName/checkpoints state.savepoints.dir: s3://$savepointBucketName/savepoints state.backend: filesystem state.backend.async: true s3.access-key: $s3AccessKey s3.secret-key: $s3SecretKey {code} (the substitutions are controlled by terraform). h2. Tests h4. Test 1 (No key rotation) 8192 actors (different keys) emitting 1 Scan Event every 10 minutes indefinitely. Actors (keys) never rotate in or out. h4. Test 2 (Constant key rotation) 8192 actors that produce 2 Scan events 10 minutes apart, then retire and never emit again. The setup creates new actors (keys) as soon as one finishes so we always have 8192. This test basically constantly rotates the key space. h2. Results For both tests, the state size (checkpoint size) grows unbounded and linearly well past the 30 minute threshold that should have caused old keys or events to be discard from the state. In the chart below, the left (steep) half is the 24 hours we ran Test 1, the right (shallow) half is Test 2. !image-2020-11-04-11-35-12-126.png|thumbnail! Could someone please assist us with this? Unless we have dramatically misunderstood how the CEP library is supposed to function this seems like a pretty severe bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19293) RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore
Thomas Wozniakowski created FLINK-19293: --- Summary: RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore Key: FLINK-19293 URL: https://issues.apache.org/jira/browse/FLINK-19293 Project: Flink Issue Type: Bug Components: Library / CEP, Runtime / Checkpointing Affects Versions: 1.10.1 Reporter: Thomas Wozniakowski Attachments: Screenshot 2020-09-18 at 13.58.30.png Hi Guys, I am seeing some strange behaviour that may be a bug, or may just be intended. We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and 2 TaskManagers, parallelism 4. The job itself is simple: # Source: kinesis connector reading from a single shard stream # CEP: ~25 CEP Keyed Pattern operators watching the event stream for different kinds of behaviour. They all have ".withinSeconds()" applied. Nothing is set up to grow endlessly. # Sink: Single operator writing messages to SQS (custom code) We are seeing the checkpoint size grow constantly until the job is restarted using a savepoint/restore. The size continues to grow past the point that the ".withinSeconds()" limits should cause old data to be discarded. The growth is also out of proportion to the general platform growth (which is actually trending down at the moment due to COVID). I've attached a snapshot from our monitoring dashboard below. You can see the huge drops in state_size on a savepoint/restore. Our state configuration is as follows: Backend: RocksDB Mode: EXACTLY_ONCE Max Concurrent: 1 Externalised Checkpoints: RETAIN_ON_CANCELLATION Async: TRUE Incremental: TRUE TTL Compaction Filter enabled: TRUE We are worried that the CEP library may be leaking state somewhere, leaving some objects not cleaned up. Unfortunately I can't share one of these checkpoints with the community due to the sensitive nature of the data contained within, but if anyone has any suggestions for how I could analyse the checkpoints to look for leaks, please let me know. Thanks in advance for the help -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16588) Add Disk Space metrics to TaskManagers
[ https://issues.apache.org/jira/browse/FLINK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064889#comment-17064889 ] Thomas Wozniakowski commented on FLINK-16588: - I think that's a fair enough assessment. I'll bug Amazon to see if I can get them to add the disk space metric to Fargate externally. > Add Disk Space metrics to TaskManagers > -- > > Key: FLINK-16588 > URL: https://issues.apache.org/jira/browse/FLINK-16588 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Minor > > Hi, > We have recently switched to the RocksDB state backend. We are scraping > Taskmanager metrics from the REST endpoints to watch for memory and CPU > issues, but we currently have no good way to get the remaining disk space, so > we have no way of knowing when RocksDB is going to run out of space for state > storage. > How plausible is it to add something like a {{State.FreeStorageBytes}} metric > or something similar to the standard TaskManager metrics set? > Thanks, -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16588) Add Disk Space metrics to TaskManagers
[ https://issues.apache.org/jira/browse/FLINK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063342#comment-17063342 ] Thomas Wozniakowski commented on FLINK-16588: - So we're running them in AWS Fargate. The containers in Fargate are configured on your behalf and you can't change the local storage. You also can't access the {{docker}} commands to get information out of then. AWS does not provide any visible metrics from outside about remaining disk space so it unfortunately looks like this information will have to come from within... > Add Disk Space metrics to TaskManagers > -- > > Key: FLINK-16588 > URL: https://issues.apache.org/jira/browse/FLINK-16588 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Minor > > Hi, > We have recently switched to the RocksDB state backend. We are scraping > Taskmanager metrics from the REST endpoints to watch for memory and CPU > issues, but we currently have no good way to get the remaining disk space, so > we have no way of knowing when RocksDB is going to run out of space for state > storage. > How plausible is it to add something like a {{State.FreeStorageBytes}} metric > or something similar to the standard TaskManager metrics set? > Thanks, -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16588) Add Disk Space metrics to TaskManagers
[ https://issues.apache.org/jira/browse/FLINK-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063201#comment-17063201 ] Thomas Wozniakowski commented on FLINK-16588: - Hey [~gjy], I take your point, but as a user of Flink I try to stay as close as possible to the vanilla versions (best documented, best supported, etc). For us, that means using the official Flink docker images. On Amazon, AWS does not provide any way to observe the remaining disk space from OUTSIDE a container, so that only leaves us one option: monitor from inside. From our perspective that can be achieved 2 ways: # Fork the official docker image and add something like nagios to it # Upgrade Flink so it's included in the official docker image by default We obviously prefer the second option because then we're not maintaining our own image :) > Add Disk Space metrics to TaskManagers > -- > > Key: FLINK-16588 > URL: https://issues.apache.org/jira/browse/FLINK-16588 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Minor > > Hi, > We have recently switched to the RocksDB state backend. We are scraping > Taskmanager metrics from the REST endpoints to watch for memory and CPU > issues, but we currently have no good way to get the remaining disk space, so > we have no way of knowing when RocksDB is going to run out of space for state > storage. > How plausible is it to add something like a {{State.FreeStorageBytes}} metric > or something similar to the standard TaskManager metrics set? > Thanks, -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16588) Add Disk Space metrics to TaskManagers
Thomas Wozniakowski created FLINK-16588: --- Summary: Add Disk Space metrics to TaskManagers Key: FLINK-16588 URL: https://issues.apache.org/jira/browse/FLINK-16588 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Affects Versions: 1.10.0 Reporter: Thomas Wozniakowski Hi, We have recently switched to the RocksDB state backend. We are scraping Taskmanager metrics from the REST endpoints to watch for memory and CPU issues, but we currently have no good way to get the remaining disk space, so we have no way of knowing when RocksDB is going to run out of space for state storage. How plausible is it to add something like a {{State.FreeStorageBytes}} metric or something similar to the standard TaskManager metrics set? Thanks, -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051086#comment-17051086 ] Thomas Wozniakowski commented on FLINK-16142: - Hey [~xintongsong], we set the metaspace to 256m that that seemed to do the trick > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045467#comment-17045467 ] Thomas Wozniakowski commented on FLINK-16142: - Yeah, increasing the metaspace seems to resolve the problems. I think you might be right. Class unloading doesn't seem to happen aggressively as it is needed, i.e. if you want to load some classes in and there's no space, it doesn't trigger a stop-the-world style mega-GC that would free up the space. > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043658#comment-17043658 ] Thomas Wozniakowski edited comment on FLINK-16142 at 2/24/20 4:36 PM: -- Hey [~sewen] - we've applied the fix on our build by excluding: {code:groovy} exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*' exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*' exclude 'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*' {code} The results have been interesting. It certainly seems to _help_ the problem, but we're still seeing OOM errors in our builds where jobs are rapidly started and stopped. Looking at the TaskManager metrics, I can see that the classes actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded, 52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something is alive and is hanging onto it for a few seconds after the job exits. I've attached another heap dump, could you help us track down what might be causing this final wrinkle? Feels like we're close here! Edit: I somewhat stupidly gave the second heap dump the same name. It's the one that's uploaded more recently, sorry about that! was (Author: jamalarm): Hey [~sewen] - we've applied the fix on our build by excluding: {code:groovy} exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*' exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*' exclude 'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*' {code} The results have been interesting. It certainly seems to _help_ the problem, but we're still seeing OOM errors in our builds where jobs are rapidly started and stopped. Looking at the TaskManager metrics, I can see that the classes actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded, 52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something is alive and is hanging onto it for a few seconds after the job exits. I've attached another heap dump, could you help us track down what might be causing this final wrinkle? Feels like we're close here! > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(Ama
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043658#comment-17043658 ] Thomas Wozniakowski commented on FLINK-16142: - Hey [~sewen] - we've applied the fix on our build by excluding: {code:groovy} exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*' exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*' exclude 'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*' {code} The results have been interesting. It certainly seems to _help_ the problem, but we're still seeing OOM errors in our builds where jobs are rapidly started and stopped. Looking at the TaskManager metrics, I can see that the classes actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded, 52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something is alive and is hanging onto it for a few seconds after the job exits. I've attached another heap dump, could you help us track down what might be causing this final wrinkle? Feels like we're close here! > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) >
[jira] [Updated] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-16142: Attachment: java_pid1.hprof > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042027#comment-17042027 ] Thomas Wozniakowski edited comment on FLINK-16142 at 2/21/20 5:14 PM: -- [~sewen] I don't have access to the Kinesis Source code as it's a library, but I added that line to the SQS sink, as it's also going to be executed on the TaskManager alongside the Kinesis source (my test is only running on one taskmanager). Unfortunately it did not prevent the OOM error. was (Author: jamalarm): [~sewen] I don't have access to the Kinesis Source code as it's a library, but I added that line to the SQS sink, as it's also going to be executed on the TaskManager alongside the Kinesis sink (my test is only running on one taskmanager). Unfortunately it did not prevent the OOM error. > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a f
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042027#comment-17042027 ] Thomas Wozniakowski commented on FLINK-16142: - [~sewen] I don't have access to the Kinesis Source code as it's a library, but I added that line to the SQS sink, as it's also going to be executed on the TaskManager alongside the Kinesis sink (my test is only running on one taskmanager). Unfortunately it did not prevent the OOM error. > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041933#comment-17041933 ] Thomas Wozniakowski commented on FLINK-16142: - Just in case it is relevant, given we're talking about relocating classes. When we initially implemented this system it was intended to run on EMR. Due to the truly insane number of JARs AWS puts on the classpath (over 200) I had to painstakingly move a few things around in order to prevent clashes. We actually don't run on EMR anymore but the relocation is still there. It uses the Gradle Shadow plugin, config here: {code:groovy} shadowJar { archiveName = "pattern-detector-realtime.jar" relocate('com.amazonaws', 'com.masabi.pattern.shaded.com.amazonaws') { exclude 'com.amazonaws.handlers.*' exclude 'com.amazonaws.services.sqs.QueueUrlHandler' exclude 'com.amazonaws.services.sqs.internal.SQSRequestHandler' exclude 'com.amazonaws.services.sqs.MessageMD5ChecksumHandler' } exclude 'amazon-kinesis-producer-native-binaries/**' exclude 'cacerts/*' } {code} > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: Leak-GC-root.png, java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.jav
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041919#comment-17041919 ] Thomas Wozniakowski commented on FLINK-16142: - [~pnowojski] We load the S3 plugin via this: https://github.com/docker-flink/docker-flink/pull/94 which I actually contributed myself. Currently in production we use a different method but we're replatforming our Flink usage to docker so this is how we do it now. It's therefore being loaded into the {{plugin}} directly properly and not being put into {{lib}} > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041816#comment-17041816 ] Thomas Wozniakowski edited comment on FLINK-16142 at 2/21/20 12:48 PM: --- Hi [~sewen] I've attached the heap dump. It was actually surprisingly straightforward to take it and get it out of the container. Apologies for not getting it done sooner. was (Author: jamalarm): Hi [~sewen] I've attached the heap dump. It was actually surprisingly straightforward to take it and get it out of the container. Apologies for not getting it done sooner. [^java_pid1.hprof] > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This messag
[jira] [Updated] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-16142: Attachment: java_pid1.hprof > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > Attachments: java_pid1.hprof > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041816#comment-17041816 ] Thomas Wozniakowski commented on FLINK-16142: - Hi [~sewen] I've attached the heap dump. It was actually surprisingly straightforward to take it and get it out of the container. Apologies for not getting it done sooner. [^java_pid1.hprof] > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041683#comment-17041683 ] Thomas Wozniakowski commented on FLINK-16142: - Hi [~sewen], here is the first chunk of the logs with all the config parts: {code} Starting Task Manager config file: jobmanager.rpc.address: pattern-detector-e2e-jobmanager jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.memory.process.size: 1568m taskmanager.numberOfTaskSlots: 2 parallelism.default: 1 jobmanager.execution.failover-strategy: region blob.server.port: 6124 query.server.port: 6125 Starting taskexecutor as a console application on host 1ef836eff98e. 2020-02-21 08:46:50,418 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - 2020-02-21 08:46:50,422 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Preconfiguration: 2020-02-21 08:46:50,423 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TM_RESOURCES_JVM_PARAMS extraction logs: - Loading configuration property: jobmanager.rpc.address, pattern-detector-e2e-jobmanager - Loading configuration property: jobmanager.rpc.port, 6123 - Loading configuration property: jobmanager.heap.size, 1024m - Loading configuration property: taskmanager.memory.process.size, 1568m - Loading configuration property: taskmanager.numberOfTaskSlots, 2 - Loading configuration property: parallelism.default, 1 - Loading configuration property: jobmanager.execution.failover-strategy, region - Loading configuration property: blob.server.port, 6124 - Loading configuration property: query.server.port, 6125 - The derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead BASH_JAVA_UTILS_EXEC_RESULT:-Xmx536870902 -Xms536870902 -XX:MaxDirectMemorySize=268435458 -XX:MaxMetaspaceSize=100663296 TM_RESOURCES_DYNAMIC_CONFIGS extraction logs: - Loading configuration property: jobmanager.rpc.address, pattern-detector-e2e-jobmanager - Loading configuration property: jobmanager.rpc.port, 6123 - Loading configuration property: jobmanager.heap.size, 1024m - Loading configuration property: taskmanager.memory.process.size, 1568m - Loading configuration property: taskmanager.numberOfTaskSlots, 2 - Loading configuration property: parallelism.default, 1 - Loading configuration property: jobmanager.execution.failover-strategy, region - Loading configuration property: blob.server.port, 6124 - Loading configuration property: query.server.port, 6125 - The derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead BASH_JAVA_UTILS_EXEC_RESULT:-D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=134217730b -D taskmanager.memory.network.min=134217730b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=536870920b -D taskmanager.cpu.cores=2.0 -D taskmanager.memory.task.heap.size=402653174b -D taskmanager.memory.task.off-heap.size=0b 2020-02-21 08:46:50,423 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - 2020-02-21 08:46:50,424 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET) 2020-02-21 08:46:50,425 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: flink 2020-02-21 08:46:50,426 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: 2020-02-21 08:46:50,426 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.242-b08 2020-02-21 08:46:50,426 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 512 MiBytes 2020-02-21 08:46:50,427 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: /usr/local/openjdk-8 2020-02-21 08:46:50,427 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options: 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx536870902 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms536870902 2020-02-21 08:46:50,429 INFO org.apache.flink.runtime.taskexe
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041156#comment-17041156 ] Thomas Wozniakowski commented on FLINK-16142: - Hey [~sewen] we are using the official Flink Docker containers, with no explicit JVM overrides. Whatever the default is, we're using. It might not be relevant, but our Job JAR is compiled using the ECJ compiler and not the standard JDK compiler. Would that matter? > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041059#comment-17041059 ] Thomas Wozniakowski commented on FLINK-16142: - [~arvid heise] I managed to get hold of those metaspace stats you asked for from inside the docker container. For reference, the easiest way I found to actually achieve this is to docker exec your way into the container, install sdkman and then do "sdk install java 8.0.242-open". This seems to give you a jmap command that is compatible with the running JVM. {code:java} root@27da7e6b6873:/usr/local/openjdk-8# jmap -clstats 1 Attaching to process ID 1, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.242-b08 finding class loader instances ..done. computing per loader stat ..done. please wait.. computing liveness.liveness analysis may be inaccurate ... class_loaderclasses bytes parent_loader alive? type 24664237013 null live 0xe0201820 0 0 0xe0170730 dead java/util/ResourceBundle$RBClassLoader@0x00010007c028 0xe0201a20 18 30389 null dead sun/misc/Launcher$ExtClassLoader@0x0001f6b0 0xe0d88ed8 1 14550xe0170730 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe130bd30 1 864 0xe1027c38 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0c982c0 1 14550xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe201e008 1 14570xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0c620c8 1 866 0xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe1ffb1f0 1 866 0xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0d60cc0 1 14550xe0170730 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0cec4d0 1 864 0xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe1ff9fe0 1 864 0xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe1ffafe0 1 14550xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0cec0e8 1 864 0xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe130c118 1 864 0xe1027c38 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe201ce28 1 864 0xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0ced2e0 1 864 0xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe130c500 1 864 0xe1027c38 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0d272e0 1 1457 null dead sun/reflect/DelegatingClassLoader@0x00019c70 0xffe25bd0 1 1455 null dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0cecef8 1 864 0xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe1ffa3c8 1 864 0xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0c98af0 1 14570xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe1027138 1 873 0xe0399898 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe12da110 1 14550xe1027c38 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe130c370 1 864 0xe1027c38 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe201ca40 1 864 0xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0dda298 1 14560xe0d60c60 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe12da368 1 14570xe1027c38 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0ced088 1 864 0xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe1ffb5b8 1 14570xe1fa86d8 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe0c98c80 1 14550xe0790330 dead sun/reflect/DelegatingClassLoader@0x00019c70 0xe201d648 1
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041005#comment-17041005 ] Thomas Wozniakowski commented on FLINK-16142: - On further investigation, I can see that this issue (of classes not being unloaded) actually exists on our 1.9.2 deployment as well, it just doesn't seem to cause the OOM error (presumably because the limit is higher). Restarting my job against our remote cluster and polling the Status.JVM.ClassLoader.ClassesLoaded metric shows the number increasing by 3000 each time. Classes appear to never be unloaded... > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040885#comment-17040885 ] Thomas Wozniakowski edited comment on FLINK-16142 at 2/20/20 11:54 AM: --- Here is a threaddump (just the last one this time, before the OOM): {code} THREAD: CardsPerDevice[MTA/HIGH]{3} -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34 THREAD: CloseableReaperThread (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: DestroyJavaVM (RUNNABLE) CCL:null THREAD: Finalizer (WAITING) CCL:null THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Hashed wheel timer #1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: IOManager reader thread #1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: IOManager writer thread #1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O boss #3 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O boss #9 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O server boss #12 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O server boss #6 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #1 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #10 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #11 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #2 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #4 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #5 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #7 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #8 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING) CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34 THREAD: Reference Handler (WAITING) CCL:null THREAD: Signal Dispatcher (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE) CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34 THREAD: Timer-0 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Timer-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Timer-5 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-2 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-3 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-4 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-5 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.remote.default-remote-dispatcher-15 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.remote.default-remote-dispatcher-6 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-2 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-akka.remote.default-remote-dispatcher-4 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-scheduler-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-scheduler-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: pool-3-thread-1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 {code} I only see two classloaders at play here, the sun.misc.Launcher$AppClassLoader@75b84c92 and org.apache.flink.util.ChildFirstClassLoader@1a5d4e34. I think that looks ok right, the AppClassLoader is just the default Flink one, and the ChildFirstClassLoader is the one for that current running Job? Edit: No luck with IdleConnectionReaper.shutdown(); unfortunately. Still OOM was (Author: jamalarm): Here is a threaddump (just the last one this time, before the OOM): {code} THREAD: CardsPerDevice[MTA/HIGH]{3} -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) CCL:org.apache.flink.util.
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040885#comment-17040885 ] Thomas Wozniakowski commented on FLINK-16142: - Here is a threaddump (just the last one this time, before the OOM): {code} THREAD: CardsPerDevice[MTA/HIGH]{3} -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34 THREAD: CloseableReaperThread (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: DestroyJavaVM (RUNNABLE) CCL:null THREAD: Finalizer (WAITING) CCL:null THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Hashed wheel timer #1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: IOManager reader thread #1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: IOManager writer thread #1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O boss #3 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O boss #9 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O server boss #12 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O server boss #6 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #1 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #10 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #11 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #2 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #4 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #5 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #7 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: New I/O worker #8 (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING) CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34 THREAD: Reference Handler (WAITING) CCL:null THREAD: Signal Dispatcher (RUNNABLE) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE) CCL:org.apache.flink.util.ChildFirstClassLoader@1a5d4e34 THREAD: Timer-0 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Timer-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: Timer-5 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-2 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-3 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-4 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.actor.default-dispatcher-5 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.remote.default-remote-dispatcher-15 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-akka.remote.default-remote-dispatcher-6 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-2 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-akka.remote.default-remote-dispatcher-4 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-metrics-scheduler-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: flink-scheduler-1 (TIMED_WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 THREAD: pool-3-thread-1 (WAITING) CCL:sun.misc.Launcher$AppClassLoader@75b84c92 {code} I only see two classloaders at play here, the sun.misc.Launcher$AppClassLoader@75b84c92 and org.apache.flink.util.ChildFirstClassLoader@1a5d4e34. I think that looks ok right, the AppClassLoader is just the default Flink one, and the ChildFirstClassLoader is the one for that current running Job? > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Ve
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040239#comment-17040239 ] Thomas Wozniakowski commented on FLINK-16142: - After first run {code:json} [ { "id": "Status.JVM.ClassLoader.ClassesLoaded", "min": 10385.0, "max": 10385.0, "avg": 10385.0, "sum": 10385.0 }, { "id": "Status.JVM.ClassLoader.ClassesUnloaded", "min": 0.0, "max": 0.0, "avg": 0.0, "sum": 0.0 } ] {code} After second run {code:json} [ { "id": "Status.JVM.ClassLoader.ClassesLoaded", "min": 13063.0, "max": 13063.0, "avg": 13063.0, "sum": 13063.0 }, { "id": "Status.JVM.ClassLoader.ClassesUnloaded", "min": 67.0, "max": 67.0, "avg": 67.0, "sum": 67.0 } ] {code} After third run {code:json} [ { "id": "Status.JVM.ClassLoader.ClassesLoaded", "min": 15506.0, "max": 15506.0, "avg": 15506.0, "sum": 15506.0 }, { "id": "Status.JVM.ClassLoader.ClassesUnloaded", "min": 67.0, "max": 67.0, "avg": 67.0, "sum": 67.0 } ] {code} Definitely seems like the classes aren't being unloaded? > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKine
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040163#comment-17040163 ] Thomas Wozniakowski commented on FLINK-16142: - Ok, I put the following in the OPEN method of our custom SQS sink (just because it's somewhere in the job it's easy to run arbitrary code from): {code:java} Thread.getAllStackTraces().keySet().stream().sorted(Comparator.comparing(Thread::getName)).forEach(thread -> System.out.printf("THREAD: %s (%s)%n", thread.getName(), thread.getState().toString()) ); {code} The only bits I have changed are the blocks of XXXs which are client-specific stuff that I can't post. They're just the names of CEP operators in the job. First run output: {code:java} THREAD: XX -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) THREAD: CloseableReaperThread (WAITING) THREAD: DestroyJavaVM (RUNNABLE) THREAD: Finalizer (WAITING) THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE) THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING) THREAD: Hashed wheel timer #1 (TIMED_WAITING) THREAD: IOManager reader thread #1 (WAITING) THREAD: IOManager writer thread #1 (WAITING) THREAD: New I/O boss #3 (RUNNABLE) THREAD: New I/O boss #9 (RUNNABLE) THREAD: New I/O server boss #12 (RUNNABLE) THREAD: New I/O server boss #6 (RUNNABLE) THREAD: New I/O worker #1 (RUNNABLE) THREAD: New I/O worker #10 (RUNNABLE) THREAD: New I/O worker #11 (RUNNABLE) THREAD: New I/O worker #2 (RUNNABLE) THREAD: New I/O worker #4 (RUNNABLE) THREAD: New I/O worker #5 (RUNNABLE) THREAD: New I/O worker #7 (RUNNABLE) THREAD: New I/O worker #8 (RUNNABLE) THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING) THREAD: Reference Handler (WAITING) THREAD: Signal Dispatcher (RUNNABLE) THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE) THREAD: Timer-0 (TIMED_WAITING) THREAD: Timer-1 (TIMED_WAITING) THREAD: flink-akka.actor.default-dispatcher-2 (TIMED_WAITING) THREAD: flink-akka.actor.default-dispatcher-3 (WAITING) THREAD: flink-akka.actor.default-dispatcher-4 (WAITING) THREAD: flink-akka.actor.default-dispatcher-5 (WAITING) THREAD: flink-akka.remote.default-remote-dispatcher-6 (TIMED_WAITING) THREAD: flink-akka.remote.default-remote-dispatcher-7 (WAITING) THREAD: flink-metrics-2 (TIMED_WAITING) THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (WAITING) THREAD: flink-metrics-akka.remote.default-remote-dispatcher-4 (TIMED_WAITING) THREAD: flink-metrics-scheduler-1 (TIMED_WAITING) THREAD: flink-scheduler-1 (TIMED_WAITING) THREAD: pool-3-thread-1 (TIMED_WAITING) {code} Second run output: {code:java} THREAD: XX -> Sink: SQS: pattern-detector-e2e-test-signal-queue (1/1) (RUNNABLE) THREAD: CloseableReaperThread (WAITING) THREAD: DestroyJavaVM (RUNNABLE) THREAD: Finalizer (WAITING) THREAD: Flink Netty Server (0) Thread 0 (RUNNABLE) THREAD: Flink-MetricRegistry-thread-1 (TIMED_WAITING) THREAD: Hashed wheel timer #1 (TIMED_WAITING) THREAD: IOManager reader thread #1 (WAITING) THREAD: IOManager writer thread #1 (WAITING) THREAD: New I/O boss #3 (RUNNABLE) THREAD: New I/O boss #9 (RUNNABLE) THREAD: New I/O server boss #12 (RUNNABLE) THREAD: New I/O server boss #6 (RUNNABLE) THREAD: New I/O worker #1 (RUNNABLE) THREAD: New I/O worker #10 (RUNNABLE) THREAD: New I/O worker #11 (RUNNABLE) THREAD: New I/O worker #2 (RUNNABLE) THREAD: New I/O worker #4 (RUNNABLE) THREAD: New I/O worker #5 (RUNNABLE) THREAD: New I/O worker #7 (RUNNABLE) THREAD: New I/O worker #8 (RUNNABLE) THREAD: OutputFlusher for Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (TIMED_WAITING) THREAD: Reference Handler (WAITING) THREAD: Signal Dispatcher (RUNNABLE) THREAD: Source: Kinesis: pattern_detector_test_stream -> Remove Unwanted Events -> (Quorum Based Timestamps [SERVER_TIME] -> Filter NULL keys [deviceId], Quorum Based Timestamps [DEVICE_TIME]) (1/1) (RUNNABLE) THREAD: Timer-0 (TIMED_WAITING) THREAD: Timer-1 (TIMED_WAITING) THREAD: flink-akka.actor.default-dispatcher-2 (TIMED_WAITING) THREAD: flink-akka.actor.default-dispatcher-3 (WAITING) THREAD: flink-akka.actor.default-dispatcher-4 (WAITING) THREAD: flink-akka.actor.default-dispatcher-5 (WAITING) THREAD: flink-akka.remote.default-remote-dispatcher-6 (WAITING) THREAD: flink-akka.remote.default-remote-dispatcher-7 (TIMED_WAITING) THREAD: flink-metrics-2 (TIMED_WAITING) THREAD: flink-metrics-akka.remote.default-remote-dispatcher-3 (TIMED_WAITING) THREAD: flink-metrics-akka.re
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040146#comment-17040146 ] Thomas Wozniakowski commented on FLINK-16142: - [~kevin.cyj] it would be quite difficult for me to put arbitrary JARs in the lib folder as we're running entirely using the official docker containers and the app is specifically set up to run in docker (reads everything from environment variables, etc). I'm going to try and get the main method of the JAR to print all the live threads to the logs at the point where the job starts, hopefully that should give some insight. I can post the code of our custom SQS sink, but it's really only about 4 lines. [~pnowojski] similar problem with attaching a memory profiler, the official Flink docker images don't expose a port for attaching to the JVM so I don't think that one will be possible. I'm trying to work out if there are any other useful metrics I could monitor on the REST endpoints from outside the container to help debug this. > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above t
[jira] [Commented] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039260#comment-17039260 ] Thomas Wozniakowski commented on FLINK-16142: - Explicitly calling `.shutdown()` on the SQS client in the Sink did not prevent the OOM Metaspace error. I'm trying to monitor what's happening from the outside using metrics, looking at Status.JVM.Threads.Count I can see that the thread count is climbing a little bit as I keep restarting the jobs, but only by a few threads (~90 in total at rest, ~100 after one submission, ~110 after two, fails on three). Would you expect a small number of threads to cause this issue? > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent
[jira] [Commented] (FLINK-16142) Memory Le!k causes Metaspace OOM error on repeated job submission
[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039227#comment-17039227 ] Thomas Wozniakowski commented on FLINK-16142: - Hi Guys, Thanks for the speedy response. It's going to be a little tricky to get at the heap dumps, as our application is very specifically written to only run on Flink running in docker. Do you know if there are any configuration options that we could pass in from the outside that would encourage the dockerised Flink to dump some more useful information to the logs when the OOM-metaspace error occurs? I'll take a look at our connectors. We are currently using: - Kinesis source, official build from Maven (now that it is published there). Prior to 1.10.0 we built it from source internally. - Custom SQS sink. Basically just dumps events straight onto a queue using the AWS SDK (version 1, blocking). This code has not changed since it was originally written (for Flink 1.3). We do not explicitly shut down the threads that presumably run in the background for the SQS SDK. I will have a look and see if there's a way we can explicitly close them when the job shuts down. Any other ideas of places I can look, give me a shout. Tom > Memory Leak causes Metaspace OOM error on repeated job submission > - > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.10.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.ap
[jira] [Created] (FLINK-16142) Memory Leak causes Metaspace OOM error on repeated job submission
Thomas Wozniakowski created FLINK-16142: --- Summary: Memory Leak causes Metaspace OOM error on repeated job submission Key: FLINK-16142 URL: https://issues.apache.org/jira/browse/FLINK-16142 Project: Flink Issue Type: Bug Components: Client / Job Submission Affects Versions: 1.10.0 Reporter: Thomas Wozniakowski Hi Guys, We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our use-case exactly (RocksDB state backend running in a containerised cluster). Unfortunately, it seems like there is a memory leak somewhere in the job submission logic. We are getting this error: {code:java} 2020-02-18 10:22:10,020 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME switched from RUNNING to FAILED. java.lang.OutOfMemoryError: Metaspace at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:359) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) {code} (The only change in the above text is the OPERATOR_NAME text where I removed some of the internal specifics of our system). This will reliably happen on a fresh cluster after submitting and cancelling our job 3 times. We are using the presto-s3 plugin, the CEP library and the Kinesis connector. Please let me know what other diagnostics would be useful. Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14812) Add custom libs to Flink classpath with an environment variable.
[ https://issues.apache.org/jira/browse/FLINK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027328#comment-17027328 ] Thomas Wozniakowski commented on FLINK-14812: - FYI - I am implementing a slightly inelegant version of this in the official docker-flink images via the entry point script. https://github.com/docker-flink/docker-flink/pull/94 Though if this were to be handled inside Flink (preferable) it could be removed in the future. > Add custom libs to Flink classpath with an environment variable. > > > Key: FLINK-14812 > URL: https://issues.apache.org/jira/browse/FLINK-14812 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Deployment / Scripts >Reporter: Eui Heo >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > To use plugin library you need to add it to the flink classpath. The > documentation explains to put the jar file in the lib path. > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > However, to deploy metric-enabled Flinks on a kubernetes cluster, we have the > burden of creating and managing another container image. It would be more > efficient to add the classpath using environment variables inside the > constructFlinkClassPath function in the config.sh file. > In particular, it seems inconvenient for me to create separate images to use > the jars, even though the /opt/ flink/opt of the stock image already contains > them. > For example, there are metrics libs and file system libs: > flink-azure-fs-hadoop-1.9.1.jar > flink-s3-fs-hadoop-1.9.1.jar > flink-metrics-prometheus-1.9.1.jar > flink-metrics-influxdb-1.9.1.jar -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski closed FLINK-10960. --- Resolution: Workaround Release Note: It seems this issue was as David described, due to restoring into a smaller state machine than had existed before > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Critical > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch of CEP operators matching the config file. > I encountered a strange bug when I was testing with some artificially low > numbers in our testing environment today. The CEP code we're using (modified > slightly) is: > {code:java} > Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(config.getNumberOfUniqueEvents()) > .where(uniquenessCheckOnAlreadyMatchedEvents()) > .within(seconds(config.getWithinSeconds())); > {code} > When using the {{numberOfUniqueEvents: 2}}, I started seeing the following > error killing the job whenever a match was detected: > {quote} > ava.lang.RuntimeException: Exception occurred while processing valve output > watermark: > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 > does not exist in the NFA. NFA has states [Final State $endState$ [ > ]), Normal State eventSequence [ > StateTransition(TAKE, from eventSequenceto $endState$, with condition), > StateTransition(IGNORE, from eventSequenceto eventSequence, with > condition), > ]), Start State eventSequence:0 [ > StateTransition(TAKE, from eventSequence:0to eventSequence, with > condition), > ])] > at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) > at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) > at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) > at > org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) > at > org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) > {quote} > Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. > Changing it back to 2 brought the problem back. It seems to be specifically > related to the value of 2. > This is not a blocking issue for me because we typically use much higher > numbers than this in production anyway, but I figured you guys might want to > know about this issue. > Let me know if you need any more information. > Tom -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10960: Priority: Critical (was: Major) > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Critical > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch of CEP operators matching the config file. > I encountered a strange bug when I was testing with some artificially low > numbers in our testing environment today. The CEP code we're using (modified > slightly) is: > {code:java} > Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(config.getNumberOfUniqueEvents()) > .where(uniquenessCheckOnAlreadyMatchedEvents()) > .within(seconds(config.getWithinSeconds())); > {code} > When using the {{numberOfUniqueEvents: 2}}, I started seeing the following > error killing the job whenever a match was detected: > {quote} > ava.lang.RuntimeException: Exception occurred while processing valve output > watermark: > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 > does not exist in the NFA. NFA has states [Final State $endState$ [ > ]), Normal State eventSequence [ > StateTransition(TAKE, from eventSequenceto $endState$, with condition), > StateTransition(IGNORE, from eventSequenceto eventSequence, with > condition), > ]), Start State eventSequence:0 [ > StateTransition(TAKE, from eventSequence:0to eventSequence, with > condition), > ])] > at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) > at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) > at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) > at > org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) > at > org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) > {quote} > Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. > Changing it back to 2 brought the problem back. It seems to be specifically > related to the value of 2. > This is not a blocking issue for me because we typically use much higher > numbers than this in production anyway, but I figured you guys might want to > know about this issue. > Let me know if you need any more information. > Tom -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718899#comment-16718899 ] Thomas Wozniakowski commented on FLINK-10960: - Ok, so a way to reproduce this should be: 1. Set up a job with, say {{.times(5)}} 2. Send through 4 events that satisfy the pattern 3. Stop and savepoint the job 4. Change the config to {{times.(3)}} 5. Restore the job from a savepoint? And it should blow up because state 4 no longer exists? I'll try and write an E2E test to reproduce this. If I understand correctly this should only happen when restoring a job where the {{.times(n)}} has *decreased*, it shouldn't have a problem when it has increased? > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Critical > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch of CEP operators matching the config file. > I encountered a strange bug when I was testing with some artificially low > numbers in our testing environment today. The CEP code we're using (modified > slightly) is: > {code:java} > Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(config.getNumberOfUniqueEvents()) > .where(uniquenessCheckOnAlreadyMatchedEvents()) > .within(seconds(config.getWithinSeconds())); > {code} > When using the {{numberOfUniqueEvents: 2}}, I started seeing the following > error killing the job whenever a match was detected: > {quote} > ava.lang.RuntimeException: Exception occurred while processing valve output > watermark: > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 > does not exist in the NFA. NFA has states [Final State $endState$ [ > ]), Normal State eventSequence [ > StateTransition(TAKE, from eventSequenceto $endState$, with condition), > StateTransition(IGNORE, from eventSequenceto eventSequence, with > condition), > ]), Start State eventSequence:0 [ > StateTransition(TAKE, from eventSequence:0to eventSequence, with > condition), > ])] > at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) > at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) > at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) > at > org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) > at > org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) > {quote} > Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. > Changing it back to 2 brought the problem back. It seems to be specifically > related to the value of 2. > This is not a blocking issue for me because we typically use much higher > numbers than this in production anyway, but I figured you guys might want to > know about this issue. > Let me know if you need any more information. > Tom -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718899#comment-16718899 ] Thomas Wozniakowski edited comment on FLINK-10960 at 12/12/18 12:11 PM: Ok, so a way to reproduce this should be: 1. Set up a job with, say {{.times(5)}} 2. Send through 4 events that satisfy the pattern 3. Stop and savepoint the job 4. Change the config to {{.times(3)}} 5. Restore the job from a savepoint? And it should blow up because state 4 no longer exists? I'll try and write an E2E test to reproduce this. If I understand correctly this should only happen when restoring a job where the {{.times(n)}} has *decreased*, it shouldn't have a problem when it has increased? was (Author: jamalarm): Ok, so a way to reproduce this should be: 1. Set up a job with, say {{.times(5)}} 2. Send through 4 events that satisfy the pattern 3. Stop and savepoint the job 4. Change the config to {{times.(3)}} 5. Restore the job from a savepoint? And it should blow up because state 4 no longer exists? I'll try and write an E2E test to reproduce this. If I understand correctly this should only happen when restoring a job where the {{.times(n)}} has *decreased*, it shouldn't have a problem when it has increased? > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Critical > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch of CEP operators matching the config file. > I encountered a strange bug when I was testing with some artificially low > numbers in our testing environment today. The CEP code we're using (modified > slightly) is: > {code:java} > Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(config.getNumberOfUniqueEvents()) > .where(uniquenessCheckOnAlreadyMatchedEvents()) > .within(seconds(config.getWithinSeconds())); > {code} > When using the {{numberOfUniqueEvents: 2}}, I started seeing the following > error killing the job whenever a match was detected: > {quote} > ava.lang.RuntimeException: Exception occurred while processing valve output > watermark: > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 > does not exist in the NFA. NFA has states [Final State $endState$ [ > ]), Normal State eventSequence [ > StateTransition(TAKE, from eventSequenceto $endState$, with condition), > StateTransition(IGNORE, from eventSequenceto eventSequence, with > condition), > ]), Start State eventSequence:0 [ > StateTransition(TAKE, from eventSequence:0to eventSequence, with > condition), > ])] > at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) > at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) > at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) > at > org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) > at > org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.han
[jira] [Comment Edited] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718899#comment-16718899 ] Thomas Wozniakowski edited comment on FLINK-10960 at 12/12/18 12:11 PM: Ok, so a way to reproduce this should be: 1. Set up a job with, say {{.times(5)}} 2. Send through 4 events that satisfy the pattern 3. Stop and savepoint the job 4. Change the config to {{.times(3)}} 5. Restore the job from a savepoint? And it should blow up because state 4 no longer exists? I'll try and write an E2E test to reproduce this. If I understand correctly this should only happen when restoring a job where the {{.times( ... )}} has *decreased*, it shouldn't have a problem when it has increased? was (Author: jamalarm): Ok, so a way to reproduce this should be: 1. Set up a job with, say {{.times(5)}} 2. Send through 4 events that satisfy the pattern 3. Stop and savepoint the job 4. Change the config to {{.times(3)}} 5. Restore the job from a savepoint? And it should blow up because state 4 no longer exists? I'll try and write an E2E test to reproduce this. If I understand correctly this should only happen when restoring a job where the {{.times(n)}} has *decreased*, it shouldn't have a problem when it has increased? > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Critical > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch of CEP operators matching the config file. > I encountered a strange bug when I was testing with some artificially low > numbers in our testing environment today. The CEP code we're using (modified > slightly) is: > {code:java} > Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(config.getNumberOfUniqueEvents()) > .where(uniquenessCheckOnAlreadyMatchedEvents()) > .within(seconds(config.getWithinSeconds())); > {code} > When using the {{numberOfUniqueEvents: 2}}, I started seeing the following > error killing the job whenever a match was detected: > {quote} > ava.lang.RuntimeException: Exception occurred while processing valve output > watermark: > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 > does not exist in the NFA. NFA has states [Final State $endState$ [ > ]), Normal State eventSequence [ > StateTransition(TAKE, from eventSequenceto $endState$, with condition), > StateTransition(IGNORE, from eventSequenceto eventSequence, with > condition), > ]), Start State eventSequence:0 [ > StateTransition(TAKE, from eventSequence:0to eventSequence, with > condition), > ])] > at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) > at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) > at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) > at > org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) > at > org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler
[jira] [Commented] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718816#comment-16718816 ] Thomas Wozniakowski commented on FLINK-10960: - Hi [~dawidwys], I'm hoping you might be able to point me in the right direction, we just had this same error in production with different parameters. Below is the error I got: {code:text} java.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputStreamStatus(StatusWatermarkValve.java:152) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:188) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State purchaseSequence:4 does not exist in the NFA. NFA has states [Normal State purchaseSequence [ StateTransition(TAKE, from purchaseSequenceto $endState$, with condition), StateTransition(IGNORE, from purchaseSequenceto purchaseSequence, with condition), ]), Final State $endState$ [ ]), Normal State purchaseSequence:2 [ StateTransition(TAKE, from purchaseSequence:2to purchaseSequence:1, with condition), StateTransition(IGNORE, from purchaseSequence:2to purchaseSequence:2, with condition), ]), Start State purchaseSequence:3 [ StateTransition(TAKE, from purchaseSequence:3to purchaseSequence:2, with condition), ]), Normal State purchaseSequence:0 [ StateTransition(TAKE, from purchaseSequence:0to purchaseSequence, with condition), StateTransition(IGNORE, from purchaseSequence:0to purchaseSequence:0, with condition), ]), Normal State purchaseSequence:1 [ StateTransition(TAKE, from purchaseSequence:1to purchaseSequence:0, with condition), StateTransition(IGNORE, from purchaseSequence:1to purchaseSequence:1, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {code} I think this is something to do with savepoint restores. In this case we were making a config change that stopped the job with a savepoint, then started it again with slightly different parameters. One of these changed {{.times(8)}} to {{.times(5)}} on one of our CEP operators. Our automated build process has E2E tests for this exact case, so I don't think it's a limitation in Flink. I'm a bit at a loss here to work out what the problem is. We got the job running by deleting the savepoint and starting the job from scratch. What does that stacktrace suggest to you? I'm running lots of local tests with variants of 1. Stopping job with savepoint 2. Changing config (driving changes in CEP.Pattern()) operators 3. Restarting the job from the savepoint but I haven't managed to recreate the error locally... > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Major > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch
[jira] [Comment Edited] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718816#comment-16718816 ] Thomas Wozniakowski edited comment on FLINK-10960 at 12/12/18 11:23 AM: Hi [~dawidwys], I'm hoping you might be able to point me in the right direction, we just had this same error in production with different parameters. Below is the error I got: {code:java} java.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputStreamStatus(StatusWatermarkValve.java:152) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:188) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State purchaseSequence:4 does not exist in the NFA. NFA has states [Normal State purchaseSequence [ StateTransition(TAKE, from purchaseSequenceto $endState$, with condition), StateTransition(IGNORE, from purchaseSequenceto purchaseSequence, with condition), ]), Final State $endState$ [ ]), Normal State purchaseSequence:2 [ StateTransition(TAKE, from purchaseSequence:2to purchaseSequence:1, with condition), StateTransition(IGNORE, from purchaseSequence:2to purchaseSequence:2, with condition), ]), Start State purchaseSequence:3 [ StateTransition(TAKE, from purchaseSequence:3to purchaseSequence:2, with condition), ]), Normal State purchaseSequence:0 [ StateTransition(TAKE, from purchaseSequence:0to purchaseSequence, with condition), StateTransition(IGNORE, from purchaseSequence:0to purchaseSequence:0, with condition), ]), Normal State purchaseSequence:1 [ StateTransition(TAKE, from purchaseSequence:1to purchaseSequence:0, with condition), StateTransition(IGNORE, from purchaseSequence:1to purchaseSequence:1, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {code} I think this is something to do with savepoint restores. In this case we were making a config change that stopped the job with a savepoint, then started it again with slightly different parameters. One of these changed {{.times(8)}} to {{.times(5)}} on one of our CEP operators. Our automated build process has E2E tests for this exact case, so I don't think it's a limitation in Flink. I'm a bit at a loss here to work out what the problem is. We got the job running by deleting the savepoint and starting the job from scratch. What does that stacktrace suggest to you? I'm running lots of local tests with variants of 1. Stopping job with savepoint 2. Changing config (driving changes in CEP.Pattern()) operators 3. Restarting the job from the savepoint but I haven't managed to recreate the error locally... was (Author: jamalarm): Hi [~dawidwys], I'm hoping you might be able to point me in the right direction, we just had this same error in production with different parameters. Below is the error I got: {code:text} java.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.
[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10960: Description: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {code:java} Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); {code} When using the {{numberOfUniqueEvents: 2}}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 does not exist in the NFA. NFA has states [Final State $endState$ [ ]), Normal State eventSequence [ StateTransition(TAKE, from eventSequenceto $endState$, with condition), StateTransition(IGNORE, from eventSequenceto eventSequence, with condition), ]), Start State eventSequence:0 [ StateTransition(TAKE, from eventSequence:0to eventSequence, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {quote} Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. Changing it back to 2 brought the problem back. It seems to be specifically related to the value of 2. This is not a blocking issue for me because we typically use much higher numbers than this in production anyway, but I figured you guys might want to know about this issue. Let me know if you need any more information. Tom was: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {code:java} Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); {code} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNew
[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10960: Description: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {code:java} Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); {code} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 does not exist in the NFA. NFA has states [Final State $endState$ [ ]), Normal State eventSequence [ StateTransition(TAKE, from eventSequenceto $endState$, with condition), StateTransition(IGNORE, from eventSequenceto eventSequence, with condition), ]), Start State eventSequence:0 [ StateTransition(TAKE, from eventSequence:0to eventSequence, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {quote} Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. Changing it back to 2 brought the problem back. It seems to be specifically related to the value of 2. This is not a blocking issue for me because we typically use much higher numbers than this in production anyway, but I figured you guys might want to know about this issue. Let me know if you need any more information. Tom was: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {code:java} Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); {code} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputN
[jira] [Commented] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694686#comment-16694686 ] Thomas Wozniakowski commented on FLINK-10960: - Hey [~dawidwys], It's tricky to provide the example because the CEP usage that's producing the error is internal business logic from my company which I can't share publicly. The 2 states thing was my mistake, I was trying to strip everything specific to our code out of the error message (replacing all instances of "purchase" with "event" and I missed some, apologies for the confusion). The exception message does correspond to the linked pattern. I will have a think about how I might be able to provide a more reproducable example. > CEP: Job Failure when .times(2) is used > --- > > Key: FLINK-10960 > URL: https://issues.apache.org/jira/browse/FLINK-10960 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.2 >Reporter: Thomas Wozniakowski >Priority: Major > > Hi Guys, > Encountered a strange one today. We use the CEP library in a configurable way > where we plug a config file into the Flink Job JAR and it programmatically > sets up a bunch of CEP operators matching the config file. > I encountered a strange bug when I was testing with some artificially low > numbers in our testing environment today. The CEP code we're using (modified > slightly) is: > {code:java} > Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .times(config.getNumberOfUniqueEvents()) > .where(uniquenessCheckOnAlreadyMatchedEvents()) > .within(seconds(config.getWithinSeconds())); > {code} > When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following > error killing the job whenever a match was detected: > {quote} > ava.lang.RuntimeException: Exception occurred while processing valve output > watermark: > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) > at > org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 > does not exist in the NFA. NFA has states [Final State $endState$ [ > ]), Normal State eventSequence [ > StateTransition(TAKE, from eventSequenceto $endState$, with condition), > StateTransition(IGNORE, from eventSequenceto eventSequence, with > condition), > ]), Start State eventSequence:0 [ > StateTransition(TAKE, from eventSequence:0to eventSequence, with > condition), > ])] > at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) > at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) > at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) > at > org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) > at > org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) > at > org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) > {quote} > Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. > Changing it back to 2 brought the problem back. It seems to be specifically > related to the value of 2. > This is not a blocking issue for me because we typically use much higher > numbers than this in production anyway, but I figured you guys might want to > know about this issue. > Let me know if you need any more information. > Tom -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10960: Description: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {code:java} Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); {code} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 does not exist in the NFA. NFA has states [Final State $endState$ [ ]), Normal State eventSequence [ StateTransition(TAKE, from eventSequenceto $endState$, with condition), StateTransition(IGNORE, from eventSequenceto purchaseSequence, with condition), ]), Start State purchaseSequence:0 [ StateTransition(TAKE, from eventSequence:0to purchaseSequence, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {quote} Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. Changing it back to 2 brought the problem back. It seems to be specifically related to the value of 2. This is not a blocking issue for me because we typically use much higher numbers than this in production anyway, but I figured you guys might want to know about this issue. Let me know if you need any more information. Tom was: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {{ Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); }} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMi
[jira] [Updated] (FLINK-10960) CEP: Job Failure when .times(2) is used
[ https://issues.apache.org/jira/browse/FLINK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10960: Description: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {{ Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); }} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 does not exist in the NFA. NFA has states [Final State $endState$ [ ]), Normal State eventSequence [ StateTransition(TAKE, from eventSequenceto $endState$, with condition), StateTransition(IGNORE, from eventSequenceto purchaseSequence, with condition), ]), Start State purchaseSequence:0 [ StateTransition(TAKE, from eventSequence:0to purchaseSequence, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {quote} Changing the config to {{numberOfUniqueEvents: 3}} fixed the problem. Changing it back to 2 brought the problem back. It seems to be specifically related to the value of 2. This is not a blocking issue for me because we typically use much higher numbers than this in production anyway, but I figured you guys might want to know about this issue. Let me know if you need any more information. Tom was: Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {{ Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); }} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcr
[jira] [Created] (FLINK-10960) CEP: Job Failure when .times(2) is used
Thomas Wozniakowski created FLINK-10960: --- Summary: CEP: Job Failure when .times(2) is used Key: FLINK-10960 URL: https://issues.apache.org/jira/browse/FLINK-10960 Project: Flink Issue Type: Bug Components: CEP Affects Versions: 1.6.2 Reporter: Thomas Wozniakowski Hi Guys, Encountered a strange one today. We use the CEP library in a configurable way where we plug a config file into the Flink Job JAR and it programmatically sets up a bunch of CEP operators matching the config file. I encountered a strange bug when I was testing with some artificially low numbers in our testing environment today. The CEP code we're using (modified slightly) is: {{ Pattern.begin(EVENT_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .times(config.getNumberOfUniqueEvents()) .where(uniquenessCheckOnAlreadyMatchedEvents()) .within(seconds(config.getWithinSeconds())); }} When using the {{ numberOfUniqueEvents: 2 }}, I started seeing the following error killing the job whenever a match was detected: {quote} ava.lang.RuntimeException: Exception occurred while processing valve output watermark: at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:265) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) at org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:184) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.util.FlinkRuntimeException: State eventSequence:2 does not exist in the NFA. NFA has states [Final State $endState$ [ ]), Normal State eventSequence [ StateTransition(TAKE, from eventSequenceto $endState$, with condition), StateTransition(IGNORE, from eventSequenceto purchaseSequence, with condition), ]), Start State purchaseSequence:0 [ StateTransition(TAKE, from eventSequence:0to purchaseSequence, with condition), ])] at org.apache.flink.cep.nfa.NFA.isStartState(NFA.java:144) at org.apache.flink.cep.nfa.NFA.isStateTimedOut(NFA.java:270) at org.apache.flink.cep.nfa.NFA.advanceTime(NFA.java:244) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.advanceTime(AbstractKeyedCEPPatternOperator.java:389) at org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.onEventTime(AbstractKeyedCEPPatternOperator.java:293) at org.apache.flink.streaming.api.operators.InternalTimerServiceImpl.advanceWatermark(InternalTimerServiceImpl.java:251) at org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:128) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:746) at org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:262) {quote} Changing the config to {{ numberOfUniqueEvents: 3 }} fixed the problem. Changing it back to 2 brought the problem back. It seems to be specifically related to the value of 2. This is not a blocking issue for me because we typically use much higher numbers than this in production anyway, but I figured you guys might want to know about this issue. Let me know if you need any more information. Tom -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10570) State grows unbounded when "within" constraint not applied
[ https://issues.apache.org/jira/browse/FLINK-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674424#comment-16674424 ] Thomas Wozniakowski commented on FLINK-10570: - Hi [~dawidwys], I notice this is scheduled for inclusion in 1.6.3, but it looks like the PR was merged several days before 1.6.2 was released. Just wanted to check if this fix might have snuck into 1.6.2 before it went out? Otherwise we're good to wait on 1.6.3, but it would be super handy if the fix was available now :) > State grows unbounded when "within" constraint not applied > -- > > Key: FLINK-10570 > URL: https://issues.apache.org/jira/browse/FLINK-10570 > Project: Flink > Issue Type: Bug > Components: CEP >Affects Versions: 1.6.1 >Reporter: Thomas Wozniakowski >Assignee: Dawid Wysakowicz >Priority: Major > Labels: pull-request-available > Fix For: 1.6.3, 1.7.0 > > > We have been running some failure monitoring using the CEP library. Simple > stuff that should probably have been implemented with a window, rather than > CEP, but we had already set the project up to use CEP elsewhere and it was > trivial to add this. > We ran the following pattern (on 1.4.2): > {code:java} > begin(PURCHASE_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) > .subtype(PurchaseEvent.class) > .times(100) > {code} > and then flat selected the responses if the failure ratio was over a certain > threshold. > With 1.6.1, the state size of the CEP operator for this pattern grows > unbounded, and eventually destroys the job with an OOM exception. We have > many CEP operators in this job but all the rest use a "within" call. > In 1.4.2, it seems events would be discarded once they were no longer in the > 100 most recent, now it seems they are held onto indefinitely. > We have a workaround (we're just going to add a "within" call to force the > CEP operator to discard old events), but it would be useful if we could have > the old behaviour back. > Please let me know if I can provide any more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10570) State grows unbounded when "within" constraint not applied
Thomas Wozniakowski created FLINK-10570: --- Summary: State grows unbounded when "within" constraint not applied Key: FLINK-10570 URL: https://issues.apache.org/jira/browse/FLINK-10570 Project: Flink Issue Type: Bug Components: CEP Affects Versions: 1.6.1 Reporter: Thomas Wozniakowski We have been running some failure monitoring using the CEP library. Simple stuff that should probably have been implemented with a window, rather than CEP, but we had already set the project up to use CEP elsewhere and it was trivial to add this. We ran the following pattern (on 1.4.2): {code:java} begin(PURCHASE_SEQUENCE, AfterMatchSkipStrategy.skipPastLastEvent()) .subtype(PurchaseEvent.class) .times(100) {code} and then flat selected the responses if the failure ratio was over a certain threshold. With 1.6.1, the state size of the CEP operator for this pattern grows unbounded, and eventually destroys the job with an OOM exception. We have many CEP operators in this job but all the rest use a "within" call. In 1.4.2, it seems events would be discarded once they were no longer in the 100 most recent, now it seems they are held onto indefinitely. We have a workaround (we're just going to add a "within" call to force the CEP operator to discard old events), but it would be useful if we could have the old behaviour back. Please let me know if I can provide any more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638747#comment-16638747 ] Thomas Wozniakowski commented on FLINK-10475: - Sure - I can update the docs. I'll say that it's recommend to use *3.5.4-beta* or *3.4.13*. Sound reasonable? > Standalone HA - Leader election is not triggered on loss of leader (ZK > 3.5.3-beta only) > --- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Minor > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Summary: Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only) (was: Standalone HA - Leader election is not triggered on loss of leader) > Standalone HA - Leader election is not triggered on loss of leader (ZK > 3.5.3-beta only) > --- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Priority: Minor (was: Blocker) > Standalone HA - Leader election is not triggered on loss of leader (ZK > 3.5.3-beta only) > --- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Minor > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636033#comment-16636033 ] Thomas Wozniakowski commented on FLINK-10475: - Aha - so it appears to be the version of Zookeeper. Using *3.5.3-beta* causes the silent no-failover, using *3.5.4-beta* works as intended. Maybe worth adding a client side check to refuse to start if connecting to a *3.5.3-beta* quorum? > Standalone HA - Leader election is not triggered on loss of leader > -- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Description: Hey Guys, Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. Please give me a shout if I can provide any more useful information EDIT Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup previously worked with 1.4.3. was: Hey Guys, Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. Please give me a shout if I can provide any more useful information EDIT Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup previously worked with 1.4.3. > Standalone HA - Leader election is not triggered on loss of leader > -- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Affects Version/s: 1.6.1 > Standalone HA - Leader election is not triggered on loss of leader > -- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4. Happy to see that the issue of > jobgraphs hanging around forever has been resolved in standalone/zookeeper HA > mode, but now I'm seeing a different issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Description: Hey Guys, Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. Please give me a shout if I can provide any more useful information EDIT Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup previously worked with 1.4.3. was: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. Please give me a shout if I can provide any more useful information EDIT Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup previously worked with 1.4.3. > Standalone HA - Leader election is not triggered on loss of leader > -- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.6.1, 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Description: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. Please give me a shout if I can provide any more useful information EDIT Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup previously worked with 1.4.3. was: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: {quote} 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:745) {quote} Please give me a shout if I can provide any more useful information Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum
[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635441#comment-16635441 ] Thomas Wozniakowski commented on FLINK-10475: - [~till.rohrmann] Thanks for the response - I've added the three JM logs in the description, together with a bit more info. > Standalone HA - Leader election is not triggered on loss of leader > -- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4. Happy to see that the issue of > jobgraphs hanging around forever has been resolved in standalone/zookeeper HA > mode, but now I'm seeing a different issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > The logs of the remaining job managers were full of this: > {quote} > 2018-10-01 15:35:44,558 ERROR > org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not > retrieve the redirect address. > java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: > Ask timed out on > [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after > [1 ms]. Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:745) > {quote} > Please give me a shout if I can provide any more useful information > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Description: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: {quote} 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:745) {quote} Please give me a shout if I can provide any more useful information Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered. In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup previously worked with 1.4.3. was: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: {quote} 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFenced
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Attachment: t1.log t2.log t3.log > Standalone HA - Leader election is not triggered on loss of leader > -- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 >Reporter: Thomas Wozniakowski >Priority: Blocker > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4. Happy to see that the issue of > jobgraphs hanging around forever has been resolved in standalone/zookeeper HA > mode, but now I'm seeing a different issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > The logs of the remaining job managers were full of this: > {quote} > 2018-10-01 15:35:44,558 ERROR > org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not > retrieve the redirect address. > java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: > Ask timed out on > [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after > [1 ms]. Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:745) > {quote} > Please give me a shout if I can provide any more useful information -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634203#comment-16634203 ] Thomas Wozniakowski commented on FLINK-10184: - [~till.rohrmann] I've now tested the fix on 1.5.4. It seems to have fixed the job graph problem, but I'm encountering another blocking issue on HA failover (leader election not triggering at all). I've raised FLINK-10475 to track it. > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Description: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: {quote} 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:745) {quote} was: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: ``` 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExcep
[jira] [Updated] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10475: Description: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: {quote} 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:745) {quote} Please give me a shout if I can provide any more useful information was: Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: {quote} 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.j
[jira] [Created] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader
Thomas Wozniakowski created FLINK-10475: --- Summary: Standalone HA - Leader election is not triggered on loss of leader Key: FLINK-10475 URL: https://issues.apache.org/jira/browse/FLINK-10475 Project: Flink Issue Type: Bug Affects Versions: 1.5.4 Reporter: Thomas Wozniakowski Hey Guys, Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue. It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover. The remaining jobmanagers never triggered a leader election, and simply got stuck. The logs of the remaining job managers were full of this: ``` 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not retrieve the redirect address. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [1 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:745) ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621772#comment-16621772 ] Thomas Wozniakowski commented on FLINK-10184: - Hey [~till.rohrmann], It's kind of non-trivial for me to test the fixes, as our cluster is currently running the non-hadoop 1.4.3 build. As far as I can see the only snapshot builds available contain hadoop, so I didn't know if the tests would be representative. I was waiting on the official release binaries before spending time testing. I can have a go at testing from a local maven build, but I've had significant trouble wrestling with maven on the Flink codebase in the past (trying to build locally). If you could point me at a branch (say for the 1.5 release) and let me know what maven command I should use to build it with no hadoop, and scala 2.11, then I would be very grateful. I could then use those binaries for testing. Tom > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.7.0, 1.6.2, 1.5.5 > > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609299#comment-16609299 ] Thomas Wozniakowski commented on FLINK-10184: - Hey [~till.rohrmann] - this may sound like a silly question, but I'm not actually sure what the best way to deploy your fix is... Should I check out the branch and do a maven build, then deploy the cluster using those artifacts? do I also need to rebuild my job jar using a patched version of the Flink dependency? Apologies - we're just set up to pull binaries from the Apache servers at the moment, so this isn't super obvious to me... Tom > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.6.1, 1.7.0, 1.5.4 > > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605882#comment-16605882 ] Thomas Wozniakowski commented on FLINK-10184: - Hi [~till.rohrmann] Happy to help out with testing the fix. I'll keep an eye on the Flink blog for the next release (or you can @ me if you need quicker feedback). I'll deploy it to our testing environments and re-run my tests. The bug was easy enough to reproduce in my experience. Tom > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.6.1, 1.7.0, 1.5.4 > > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605485#comment-16605485 ] Thomas Wozniakowski commented on FLINK-10184: - Hi [~till.rohrmann], Yes. We are running 3/3/3 zookeeper/jobmanager/taskmanagers in standalone mode. Please let me know if you need any more info. This issue is currently blocking us and I'm more than happy to assist as much as I can in fixing it. Tom > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.6.1, 1.7.0, 1.5.4 > > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587293#comment-16587293 ] Thomas Wozniakowski commented on FLINK-10184: - I'm just combing through the Zookeeper logs to see if there's anything that might be helpful. I'm going to dump anything that looks a bit odd here: {quote} 2018-08-21 10:27:05,657 [myid:160] - INFO [ProcessThread(sid:160 cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when processing sessionid:0x75066362001c type:create cxid:0x6 zxid:0x200fe txntype:-1 reqpath:n/a Error Path:/flink/cluster_one/leaderlatch/rest_server_lock Error:KeeperErrorCode = NoNode for /flink/cluster_one/leaderlatch/rest_server_lock 2018-08-21 10:27:05,938 [myid:160] - INFO [ProcessThread(sid:160 cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when processing sessionid:0x75066362001c type:create cxid:0x24 zxid:0x20104 txntype:-1 reqpath:n/a Error Path:/flink/cluster_one/leaderlatch/resource_manager_lock Error:KeeperErrorCode = NoNode for /flink/cluster_one/leaderlatch/resource_manager_lock 2018-08-21 10:27:05,944 [myid:160] - INFO [ProcessThread(sid:160 cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when processing sessionid:0x75066362001c type:create cxid:0x29 zxid:0x20105 txntype:-1 reqpath:n/a Error Path:/flink/cluster_one/leaderlatch/dispatcher_lock Error:KeeperErrorCode = NoNode for /flink/cluster_one/leaderlatch/dispatcher_lock {quote} {quote} 2018-08-21 10:28:35,032 [myid:160] - INFO [ProcessThread(sid:160 cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when processing sessionid:0x75066362001c type:create cxid:0xde zxid:0x20145 txntype:-1 reqpath:n/a Error Path:/flink/cluster_one/checkpoints/a5d03dfb348783950c006fe8d6e73fc5/061/ada19912-8c78-4f15-b1ef-f0acc5011559 Error:KeeperErrorCode = NodeExists for /flink/cluster_one/checkpoints/a5d03dfb348783950c006fe8d6e73fc5/061/ada19912-8c78-4f15-b1ef-f0acc5011559 2018-08-21 10:28:35,184 [myid:160] - INFO [ProcessThread(sid:160 cport:-1)::PrepRequestProcessor@880] - Got user-level KeeperException when processing sessionid:0x75066362001c type:create cxid:0xe2 zxid:0x20146 txntype:-1 reqpath:n/a Error Path:/flink/cluster_one/leaderlatch/a5d03dfb348783950c006fe8d6e73fc5/job_manager_lock Error:KeeperErrorCode = NoNode for /flink/cluster_one/leaderlatch/a5d03dfb348783950c006fe8d6e73fc5/job_manager_lock {quote} > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.6.1, 1.7.0, 1.5.4 > > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587280#comment-16587280 ] Thomas Wozniakowski commented on FLINK-10184: - We also performed a Zookeeper upgrade as part of our cluster upgrade (from {{3.5.3-beta}} to {{3.5.4-beta}}). I have just rerun the tests, the bug is reproducible against both versions of Zookeeper, so this does not appear to be the culprit. > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > Fix For: 1.6.1, 1.7.0, 1.5.4 > > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Wozniakowski updated FLINK-10184: Affects Version/s: 1.6.0 > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2, 1.6.0 >Reporter: Thomas Wozniakowski >Priority: Blocker > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587119#comment-16587119 ] Thomas Wozniakowski edited comment on FLINK-10184 at 8/21/18 7:52 AM: -- Hey [~wcummings], I'm not 100% sure what is wrong, but I believe a good starting point would be {{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}} or {{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove)}} was (Author: jamalarm): Hey [~wcummings], I'm not 100% sure what is wrong, but I believe a good starting point would be {{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}} or {{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove(java.lang.String, org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.RemoveCallback)}} > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > -- > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.2 >Reporter: Thomas Wozniakowski >Priority: Blocker > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)