[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22053: Description: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) ~[scala-library-2.11.12.jar:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) ~[scala-library-2.11.12.jar:?] ... 12 more {noformat} To reproduce {noformat} @Test public void testRecoveryWithFinishedSplit() throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); env.fromSequence(0, 100); env.execute(); } {noformat} was: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at
[jira] [Assigned] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-22053: --- Assignee: (was: Arvid Heise) > NumberSequenceSource causes fatal exception when less splits than parallelism. > -- > > Key: FLINK-22053 > URL: https://issues.apache.org/jira/browse/FLINK-22053 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Priority: Major > > If more splits than > {noformat} > Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' > at > org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > ~[scala-library-2.11.12.jar:?] > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > ~[scala-library-2.11.12.jar:?] > ... 12 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22053: Description: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) ~[scala-library-2.11.12.jar:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) ~[scala-library-2.11.12.jar:?] ... 12 more {noformat} was: If a checkpoint happens after the only split is processed, the split is checkpointed with (from > to). Upon recovery this split causes an exception in the coordinator and a subsequent fatal exception. {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at
[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22053: Description: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) ~[scala-library-2.11.12.jar:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) ~[scala-library-2.11.12.jar:?] ... 12 more {noformat} To reproduce {noformat} @Test public void testRecoveryWithFinishedSplit() throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(12); env.fromSequence(0, 10); env.execute(); } {noformat} was: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at
[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22053: Description: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) ~[scala-library-2.11.12.jar:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) ~[scala-library-2.11.12.jar:?] ... 12 more {noformat} To reproduce {noformat} @Test public void testLessSplitsThanParallelism() throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(12); env.fromSequence(0, 10); env.execute(); } {noformat} was: If more splits than {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at
[jira] [Resolved] (FLINK-21945) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-21945. - Fix Version/s: 1.13.0 Resolution: Fixed Merged into master as 89fde844d05125c49a9bbb9d0676cd1babb3b222..af5719993dd8d5164e03171810bbd709523d0927. > Disable checkpointing of inflight data in pointwise connections for unaligned > checkpoints > - > > Key: FLINK-21945 > URL: https://issues.apache.org/jira/browse/FLINK-21945 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22053) Recovery of a completed split in NumberSequenceSource causes fatal exception.
Arvid Heise created FLINK-22053: --- Summary: Recovery of a completed split in NumberSequenceSource causes fatal exception. Key: FLINK-22053 URL: https://issues.apache.org/jira/browse/FLINK-22053 Project: Flink Issue Type: Bug Components: API / Core Affects Versions: 1.12.2, 1.13.0 Reporter: Arvid Heise If a checkpoint happens after the only split is processed, the split is checkpointed with (from > to). Upon recovery this split causes an exception in the coordinator and a subsequent fatal exception. {noformat} Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) ~[scala-library-2.11.12.jar:?] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) ~[akka-actor_2.11-2.5.21.jar:2.5.21] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) ~[scala-library-2.11.12.jar:?] ... 12 more {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312892#comment-17312892 ] Arvid Heise commented on FLINK-22053: - Merged into master as a93251b4599e7c7d77ec6f0796825a93224eb010, merged into 1.12 as 945684114092f590e6cb90b78ce7fe4ccc7ada6c. > NumberSequenceSource causes fatal exception when less splits than parallelism. > -- > > Key: FLINK-22053 > URL: https://issues.apache.org/jira/browse/FLINK-22053 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > > If more splits than > {noformat} > Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' > at > org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > ~[scala-library-2.11.12.jar:?] > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > ~[scala-library-2.11.12.jar:?] > ... 12 more > {noformat} > To reproduce > {noformat} > @Test > public void testLessSplitsThanParallelism() throws Exception { > StreamExecutionEnvironment env = > StreamExecutionEnvironment.getExecutionEnvironment(); > env.setParallelism(12); > env.fromSequence(0, 10); > env.execute(); > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.
[ https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-22053. - Fix Version/s: 1.12.3 1.13.0 Resolution: Fixed > NumberSequenceSource causes fatal exception when less splits than parallelism. > -- > > Key: FLINK-22053 > URL: https://issues.apache.org/jira/browse/FLINK-22053 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0, 1.12.3 > > > If more splits than > {noformat} > Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to' > at > org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111) > ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180) > ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > ~[scala-library-2.11.12.jar:?] > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > ~[akka-actor_2.11-2.5.21.jar:2.5.21] > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > ~[scala-library-2.11.12.jar:?] > ... 12 more > {noformat} > To reproduce > {noformat} > @Test > public void testLessSplitsThanParallelism() throws Exception { > StreamExecutionEnvironment env = > StreamExecutionEnvironment.getExecutionEnvironment(); > env.setParallelism(12); > env.fromSequence(0, 10); > env.execute(); > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-22081: --- Assignee: Chen Qin (was: Prem Santosh) > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.11.0, 1.11.1, 1.11.2, > 1.11.3, 1.11.4, 1.12.0, 1.12.1, 1.12.2, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13550) Support for CPU FlameGraphs in new web UI
[ https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313104#comment-17313104 ] Arvid Heise commented on FLINK-13550: - Merged into master for 1.13 as e9385051cd2ac7110f02b361ac503d9153441f9f..12a99e8fe84c28fb250028b4fde4025ec9dc00c9. Thanks [~dmvk] and [~afedulov] for your contributions! > Support for CPU FlameGraphs in new web UI > - > > Key: FLINK-13550 > URL: https://issues.apache.org/jira/browse/FLINK-13550 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST, Runtime / Web Frontend >Reporter: David Morávek >Assignee: Alexander Fedulov >Priority: Major > Labels: pull-request-available > > For a better insight into a running job, it would be useful to have ability > to render a CPU flame graph for a particular job vertex. > Flink already has a stack-trace sampling mechanism in-place, so it should be > straightforward to implement. > This should be done by implementing a new endpoint in REST API, which would > sample the stack-trace the same way as current BackPressureTracker does, only > with a different sampling rate and length of sampling. > [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the > feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-13552) Render vertex FlameGraph in web UI
[ https://issues.apache.org/jira/browse/FLINK-13552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise closed FLINK-13552. --- Release Note: Directly resolved in parent ticket. Resolution: Done > Render vertex FlameGraph in web UI > -- > > Key: FLINK-13552 > URL: https://issues.apache.org/jira/browse/FLINK-13552 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Web Frontend >Reporter: David Morávek >Assignee: David Morávek >Priority: Major > > Add a new FlameGraph tab in "vertex detail" page, that will actively poll > flame graph endpoint and render it using d3 library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-13551) Add vertex FlameGraph REST endpoint
[ https://issues.apache.org/jira/browse/FLINK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise closed FLINK-13551. --- Release Note: Directly resolved in parent ticket. Resolution: Fixed > Add vertex FlameGraph REST endpoint > --- > > Key: FLINK-13551 > URL: https://issues.apache.org/jira/browse/FLINK-13551 > Project: Flink > Issue Type: Sub-task > Components: Runtime / REST >Reporter: David Morávek >Priority: Major > > Add a new endpoint that returns data for flame graph rendering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-13550) Support for CPU FlameGraphs in new web UI
[ https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-13550. - Fix Version/s: 1.13.0 Release Note: Flink now offers Flamegraphs for each node in the job graph. Please enable this experimental feature by setting the respective configuration flag rest.flamegraph.enabled. Resolution: Fixed > Support for CPU FlameGraphs in new web UI > - > > Key: FLINK-13550 > URL: https://issues.apache.org/jira/browse/FLINK-13550 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST, Runtime / Web Frontend >Reporter: David Morávek >Assignee: Alexander Fedulov >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > For a better insight into a running job, it would be useful to have ability > to render a CPU flame graph for a particular job vertex. > Flink already has a stack-trace sampling mechanism in-place, so it should be > straightforward to implement. > This should be done by implementing a new endpoint in REST API, which would > sample the stack-trace the same way as current BackPressureTracker does, only > with a different sampling rate and length of sampling. > [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the > feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307383#comment-17307383 ] Arvid Heise commented on FLINK-21936: - As a first step, we might want to provide users an explicit way to express the guarantees that they expect of pointwise connection. Only if the users wants to retain orderness, we have to disable UC for that exchange. I'm assuming that the vast majority of pointwise connections do not require the guarantees. > Disable checkpointing of inflight data in pointwise connections for unaligned > checkpoints > - > > Key: FLINK-21936 > URL: https://issues.apache.org/jira/browse/FLINK-21936 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > > We currently do not have any hard guarantees on pointwise connection > regarding data consistency. However, since data was structured implicitly in > the same way as any preceding source or keyby, some users relied on this > behavior to divide compute-intensive tasks into smaller chunks while relying > on ordering guarantees. > As long as the parallelism does not change, unaligned checkpoints (UC) > retains these properties. With the implementation of rescaling of UC > (FLINK-19801), that has changed. For most exchanges, there is a meaningful > way to reassign state from one channel to another (even in random order). For > some exchanges, the mapping is ambiguous and requires post-filtering. > However, for point-wise connections, it's impossible while retaining these > properties. > Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to > rescale from parallelism p = 1 to p = 2, suddenly the records inside the > keyby channels need to be divided into two channels according to the > keygroups. That is easily possible by using the keygroup ranges of the > operators and a way to determine the key(group) of the record (independent of > the actual approach). For the forward channel, we completely lack the key > context. No record in the forward channel has any keygroup assigned; it's > also not possible to calculate it as there is no guarantee that the key is > still present. > The root cause for this limitation is the conceptual mismatch between what we > provide and what some users assume we provide (or we assume that the users > assume). For example, it's impossible to use (keyed) state in task2 right > now, because there is no key context, but we still want to guarantee > orderness in respect to that key context. > For 1.13, the easiest solution is to disable channel state in pointwise > connections. For any non-trivial application with at least one shuffle, the > number of pointwise channels (linear to p) is quickly dwarfed by all-to-all > connections (quadratic to p). I'd add some alternative ideas to the > discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints
Arvid Heise created FLINK-21936: --- Summary: Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints Key: FLINK-21936 URL: https://issues.apache.org/jira/browse/FLINK-21936 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.13.0 Reporter: Arvid Heise Assignee: Arvid Heise We currently do not have any hard guarantees on pointwise connection regarding data consistency. However, since data was structured implicitly in the same way as any preceding source or keyby, some users relied on this behavior to divide compute-intensive tasks into smaller chunks while relying on ordering guarantees. As long as the parallelism does not change, unaligned checkpoints (UC) retains these properties. With the implementation of rescaling of UC (FLINK-19801), that has changed. For most exchanges, there is a meaningful way to reassign state from one channel to another (even in random order). For some exchanges, the mapping is ambiguous and requires post-filtering. However, for point-wise connections, it's impossible while retaining these properties. Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to rescale from parallelism p = 1 to p = 2, suddenly the records inside the keyby channels need to be divided into two channels according to the keygroups. That is easily possible by using the keygroup ranges of the operators and a way to determine the key(group) of the record (independent of the actual approach). For the forward channel, we completely lack the key context. No record in the forward channel has any keygroup assigned; it's also not possible to calculate it as there is no guarantee that the key is still present. The root cause for this limitation is the conceptual mismatch between what we provide and what some users assume we provide (or we assume that the users assume). For example, it's impossible to use (keyed) state in task2 right now, because there is no key context, but we still want to guarantee orderness in respect to that key context. For 1.13, the easiest solution is to disable channel state in pointwise connections. For any non-trivial application with at least one shuffle, the number of pointwise channels (linear to p) is quickly dwarfed by all-to-all connections (quadratic to p). I'd add some alternative ideas to the discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307386#comment-17307386 ] Arvid Heise commented on FLINK-21936: - Alternatively or additionally, we might want to add the keyed context to forward channels. Then users could also use state in these tasks. For that, we would need to encode the keygroup in the data stream. Note that we would also need to find a way to encode splits at least for UC recovery (maybe the source coordinator can assign unique numbers to splits?). > Disable checkpointing of inflight data in pointwise connections for unaligned > checkpoints > - > > Key: FLINK-21936 > URL: https://issues.apache.org/jira/browse/FLINK-21936 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > > We currently do not have any hard guarantees on pointwise connection > regarding data consistency. However, since data was structured implicitly in > the same way as any preceding source or keyby, some users relied on this > behavior to divide compute-intensive tasks into smaller chunks while relying > on ordering guarantees. > As long as the parallelism does not change, unaligned checkpoints (UC) > retains these properties. With the implementation of rescaling of UC > (FLINK-19801), that has changed. For most exchanges, there is a meaningful > way to reassign state from one channel to another (even in random order). For > some exchanges, the mapping is ambiguous and requires post-filtering. > However, for point-wise connections, it's impossible while retaining these > properties. > Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to > rescale from parallelism p = 1 to p = 2, suddenly the records inside the > keyby channels need to be divided into two channels according to the > keygroups. That is easily possible by using the keygroup ranges of the > operators and a way to determine the key(group) of the record (independent of > the actual approach). For the forward channel, we completely lack the key > context. No record in the forward channel has any keygroup assigned; it's > also not possible to calculate it as there is no guarantee that the key is > still present. > The root cause for this limitation is the conceptual mismatch between what we > provide and what some users assume we provide (or we assume that the users > assume). For example, it's impossible to use (keyed) state in task2 right > now, because there is no key context, but we still want to guarantee > orderness in respect to that key context. > For 1.13, the easiest solution is to disable channel state in pointwise > connections. For any non-trivial application with at least one shuffle, the > number of pointwise channels (linear to p) is quickly dwarfed by all-to-all > connections (quadratic to p). I'd add some alternative ideas to the > discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints
[ https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307388#comment-17307388 ] Arvid Heise commented on FLINK-21936: - A completely different approach would be possible with dynamic rescaling (epoch-based). We would drain the recovered data (with old parallelism) and then rewire from source to sink. However, that feels like Flink 3.0. > Disable checkpointing of inflight data in pointwise connections for unaligned > checkpoints > - > > Key: FLINK-21936 > URL: https://issues.apache.org/jira/browse/FLINK-21936 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > > We currently do not have any hard guarantees on pointwise connection > regarding data consistency. However, since data was structured implicitly in > the same way as any preceding source or keyby, some users relied on this > behavior to divide compute-intensive tasks into smaller chunks while relying > on ordering guarantees. > As long as the parallelism does not change, unaligned checkpoints (UC) > retains these properties. With the implementation of rescaling of UC > (FLINK-19801), that has changed. For most exchanges, there is a meaningful > way to reassign state from one channel to another (even in random order). For > some exchanges, the mapping is ambiguous and requires post-filtering. > However, for point-wise connections, it's impossible while retaining these > properties. > Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to > rescale from parallelism p = 1 to p = 2, suddenly the records inside the > keyby channels need to be divided into two channels according to the > keygroups. That is easily possible by using the keygroup ranges of the > operators and a way to determine the key(group) of the record (independent of > the actual approach). For the forward channel, we completely lack the key > context. No record in the forward channel has any keygroup assigned; it's > also not possible to calculate it as there is no guarantee that the key is > still present. > The root cause for this limitation is the conceptual mismatch between what we > provide and what some users assume we provide (or we assume that the users > assume). For example, it's impossible to use (keyed) state in task2 right > now, because there is no key context, but we still want to guarantee > orderness in respect to that key context. > For 1.13, the easiest solution is to disable channel state in pointwise > connections. For any non-trivial application with at least one shuffle, the > number of pointwise channels (linear to p) is quickly dwarfed by all-to-all > connections (quadratic to p). I'd add some alternative ideas to the > discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint
Arvid Heise created FLINK-21992: --- Summary: Investigate potential buffer leak in unaligned checkpoint Key: FLINK-21992 URL: https://issues.apache.org/jira/browse/FLINK-21992 Project: Flink Issue Type: Bug Affects Versions: 1.13.0 Reporter: Arvid Heise A user on mailing list reported that his job gets stuck with unaligned checkpoint enabled. http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html We received two similar reports in the past, but the users didn't follow up, so it was not as easy to diagnose as this time where the initial report already contains many relevant data points. Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"
[ https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299809#comment-17299809 ] Arvid Heise commented on FLINK-21535: - A quick assessment of the 3 cases: the test is just running too long and some test implementations that track all records are running OOM. The root cause however is rather that the test take >10 min when they should finish <<1min. I'll investigate further. > UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap > space" > - > > Key: FLINK-21535 > URL: https://issues.apache.org/jira/browse/FLINK-21535 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56 > {code} > 2021-02-27T02:11:41.5659201Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-02-27T02:11:41.5659947Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-02-27T02:11:41.5660794Z at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > 2021-02-27T02:11:41.5661618Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2021-02-27T02:11:41.5662356Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2021-02-27T02:11:41.5663104Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5664016Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5664817Z at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > 2021-02-27T02:11:41.5665638Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2021-02-27T02:11:41.5666405Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2021-02-27T02:11:41.5667609Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5668358Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5669218Z at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > 2021-02-27T02:11:41.5669928Z at > akka.dispatch.OnComplete.internal(Future.scala:264) > 2021-02-27T02:11:41.5670540Z at > akka.dispatch.OnComplete.internal(Future.scala:261) > 2021-02-27T02:11:41.5671268Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > 2021-02-27T02:11:41.5671881Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > 2021-02-27T02:11:41.5672512Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5673219Z at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > 2021-02-27T02:11:41.5674085Z at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > 2021-02-27T02:11:41.5674794Z at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > 2021-02-27T02:11:41.5675466Z at > akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > 2021-02-27T02:11:41.5676181Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > 2021-02-27T02:11:41.5676977Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > 2021-02-27T02:11:41.5677717Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > 2021-02-27T02:11:41.5678409Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > 2021-02-27T02:11:41.5679071Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5679776Z at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > 2021-02-27T02:11:41.5680576Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5681383Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5682167Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5683040Z at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2021-02-27T02:11:41.5683759Z at >
[jira] [Resolved] (FLINK-19801) Add support for virtual channels
[ https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-19801. - Resolution: Fixed > Add support for virtual channels > > > Key: FLINK-19801 > URL: https://issues.apache.org/jira/browse/FLINK-19801 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Affects Versions: 1.12.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > During rescaling of unaligned checkpoints, if state from multiple former > channels are read on input or output side to recover a specific channel, then > these buffers are multiplexed on output side and demultiplexed on input side > to guarantee a consistent recovery of spanning records: > Assume two channels C1, C2 connect operator A and B and both have one buffer > in the output and in the input part of the channel respectively, where a > record spans. Assume that the buffers are named O1 for output buffer of C1 > and I2 for input buffer of C2 etc. Then after rescaling both channels become > one channel C. Then, the buffers may be restored as I1, I2, O1, O2. > Channels use the mapping of FLINK-19533 to infer the need for virtual > channels and distribute the needed resources. Virtual channels are removed on > the EndOfChannelRecovery epoch marker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19801) Add support for virtual channels
[ https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299805#comment-17299805 ] Arvid Heise commented on FLINK-19801: - Merged as 8bfc905bf5e9a7e523f0f083c948cbb32ac260fd..b4c57c056ecc54bb1a8d04e6d4222639036dccfa into master. > Add support for virtual channels > > > Key: FLINK-19801 > URL: https://issues.apache.org/jira/browse/FLINK-19801 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Affects Versions: 1.12.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > During rescaling of unaligned checkpoints, if state from multiple former > channels are read on input or output side to recover a specific channel, then > these buffers are multiplexed on output side and demultiplexed on input side > to guarantee a consistent recovery of spanning records: > Assume two channels C1, C2 connect operator A and B and both have one buffer > in the output and in the input part of the channel respectively, where a > record spans. Assume that the buffers are named O1 for output buffer of C1 > and I2 for input buffer of C2 etc. Then after rescaling both channels become > one channel C. Then, the buffers may be restored as I1, I2, O1, O2. > Channels use the mapping of FLINK-19533 to infer the need for virtual > channels and distribute the needed resources. Virtual channels are removed on > the EndOfChannelRecovery epoch marker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"
[ https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299809#comment-17299809 ] Arvid Heise edited comment on FLINK-21535 at 3/11/21, 6:54 PM: --- A quick assessment of the 3 cases: the test is just running too long and some test implementations that track all records are running OOM. The root cause however is rather that the test take >10 min when they should finish <<1min. I'll investigate further. Quite possible that FLINK-21689 is a duplicate. was (Author: aheise): A quick assessment of the 3 cases: the test is just running too long and some test implementations that track all records are running OOM. The root cause however is rather that the test take >10 min when they should finish <<1min. I'll investigate further. > UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap > space" > - > > Key: FLINK-21535 > URL: https://issues.apache.org/jira/browse/FLINK-21535 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56 > {code} > 2021-02-27T02:11:41.5659201Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-02-27T02:11:41.5659947Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-02-27T02:11:41.5660794Z at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > 2021-02-27T02:11:41.5661618Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2021-02-27T02:11:41.5662356Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2021-02-27T02:11:41.5663104Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5664016Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5664817Z at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > 2021-02-27T02:11:41.5665638Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2021-02-27T02:11:41.5666405Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2021-02-27T02:11:41.5667609Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5668358Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5669218Z at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > 2021-02-27T02:11:41.5669928Z at > akka.dispatch.OnComplete.internal(Future.scala:264) > 2021-02-27T02:11:41.5670540Z at > akka.dispatch.OnComplete.internal(Future.scala:261) > 2021-02-27T02:11:41.5671268Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > 2021-02-27T02:11:41.5671881Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > 2021-02-27T02:11:41.5672512Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5673219Z at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > 2021-02-27T02:11:41.5674085Z at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > 2021-02-27T02:11:41.5674794Z at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > 2021-02-27T02:11:41.5675466Z at > akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > 2021-02-27T02:11:41.5676181Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > 2021-02-27T02:11:41.5676977Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > 2021-02-27T02:11:41.5677717Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > 2021-02-27T02:11:41.5678409Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > 2021-02-27T02:11:41.5679071Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5679776Z at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > 2021-02-27T02:11:41.5680576Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5681383Z at >
[jira] [Created] (FLINK-21797) Performance regression on 03/11/21
Arvid Heise created FLINK-21797: --- Summary: Performance regression on 03/11/21 Key: FLINK-21797 URL: https://issues.apache.org/jira/browse/FLINK-21797 Project: Flink Issue Type: Improvement Components: Runtime / Network Affects Versions: 1.13.0 Reporter: Arvid Heise Assignee: Arvid Heise http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3,5=tupleKeyBy=2=200=off=on=on -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20130) Add ZStandard format to inputs
[ https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304783#comment-17304783 ] Arvid Heise commented on FLINK-20130: - Remerged as a407322a452b8a371d0ce25e8f5f8418556371ef into master. > Add ZStandard format to inputs > -- > > Key: FLINK-20130 > URL: https://issues.apache.org/jira/browse/FLINK-20130 > Project: Flink > Issue Type: Improvement > Components: API / Core >Reporter: João Boto >Assignee: João Boto >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > Allow Flink to read files compressed in ZStandard (.zst) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-20130) Add ZStandard format to inputs
[ https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-20130. - Resolution: Fixed > Add ZStandard format to inputs > -- > > Key: FLINK-20130 > URL: https://issues.apache.org/jira/browse/FLINK-20130 > Project: Flink > Issue Type: Improvement > Components: API / Core >Reporter: João Boto >Assignee: João Boto >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > Allow Flink to read files compressed in ZStandard (.zst) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-21689) UnalignedCheckpointITCase does not terminate
[ https://issues.apache.org/jira/browse/FLINK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise closed FLINK-21689. --- Resolution: Duplicate > UnalignedCheckpointITCase does not terminate > > > Key: FLINK-21689 > URL: https://issues.apache.org/jira/browse/FLINK-21689 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.13.0 >Reporter: Chesnay Schepler >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > So far we assumed that the UC tests fail because of FLINK-21400, but even > with that in place they still do not pass. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-21540) finegrained_resource_management tests hang on azure
[ https://issues.apache.org/jira/browse/FLINK-21540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise closed FLINK-21540. --- Resolution: Duplicate > finegrained_resource_management tests hang on azure > > > Key: FLINK-21540 > URL: https://issues.apache.org/jira/browse/FLINK-21540 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13905=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"
[ https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300147#comment-17300147 ] Arvid Heise commented on FLINK-21535: - I verified that FLINK-21540, FLINK-21599, and FLINK-21689 are duplicates and closed them as such. It's also "only" a test issue. Fix for test coming today. > UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap > space" > - > > Key: FLINK-21535 > URL: https://issues.apache.org/jira/browse/FLINK-21535 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56 > {code} > 2021-02-27T02:11:41.5659201Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-02-27T02:11:41.5659947Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-02-27T02:11:41.5660794Z at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > 2021-02-27T02:11:41.5661618Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2021-02-27T02:11:41.5662356Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2021-02-27T02:11:41.5663104Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5664016Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5664817Z at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > 2021-02-27T02:11:41.5665638Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2021-02-27T02:11:41.5666405Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2021-02-27T02:11:41.5667609Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5668358Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5669218Z at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > 2021-02-27T02:11:41.5669928Z at > akka.dispatch.OnComplete.internal(Future.scala:264) > 2021-02-27T02:11:41.5670540Z at > akka.dispatch.OnComplete.internal(Future.scala:261) > 2021-02-27T02:11:41.5671268Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > 2021-02-27T02:11:41.5671881Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > 2021-02-27T02:11:41.5672512Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5673219Z at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > 2021-02-27T02:11:41.5674085Z at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > 2021-02-27T02:11:41.5674794Z at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > 2021-02-27T02:11:41.5675466Z at > akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > 2021-02-27T02:11:41.5676181Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > 2021-02-27T02:11:41.5676977Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > 2021-02-27T02:11:41.5677717Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > 2021-02-27T02:11:41.5678409Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > 2021-02-27T02:11:41.5679071Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5679776Z at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > 2021-02-27T02:11:41.5680576Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5681383Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5682167Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5683040Z at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2021-02-27T02:11:41.5683759Z at >
[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"
[ https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301851#comment-17301851 ] Arvid Heise commented on FLINK-21535: - Merged as c177d15323d5025f0cf737b98bb051efbc08a149 into master. > UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap > space" > - > > Key: FLINK-21535 > URL: https://issues.apache.org/jira/browse/FLINK-21535 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56 > {code} > 2021-02-27T02:11:41.5659201Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-02-27T02:11:41.5659947Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-02-27T02:11:41.5660794Z at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > 2021-02-27T02:11:41.5661618Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2021-02-27T02:11:41.5662356Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2021-02-27T02:11:41.5663104Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5664016Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5664817Z at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > 2021-02-27T02:11:41.5665638Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2021-02-27T02:11:41.5666405Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2021-02-27T02:11:41.5667609Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5668358Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5669218Z at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > 2021-02-27T02:11:41.5669928Z at > akka.dispatch.OnComplete.internal(Future.scala:264) > 2021-02-27T02:11:41.5670540Z at > akka.dispatch.OnComplete.internal(Future.scala:261) > 2021-02-27T02:11:41.5671268Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > 2021-02-27T02:11:41.5671881Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > 2021-02-27T02:11:41.5672512Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5673219Z at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > 2021-02-27T02:11:41.5674085Z at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > 2021-02-27T02:11:41.5674794Z at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > 2021-02-27T02:11:41.5675466Z at > akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > 2021-02-27T02:11:41.5676181Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > 2021-02-27T02:11:41.5676977Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > 2021-02-27T02:11:41.5677717Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > 2021-02-27T02:11:41.5678409Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > 2021-02-27T02:11:41.5679071Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5679776Z at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > 2021-02-27T02:11:41.5680576Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5681383Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5682167Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5683040Z at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2021-02-27T02:11:41.5683759Z at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > 2021-02-27T02:11:41.5684493Z at >
[jira] [Commented] (FLINK-20130) Add ZStandard format to inputs
[ https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301877#comment-17301877 ] Arvid Heise commented on FLINK-20130: - Merged into master as 889b3845217141b295eb2b60c3dd8a2c245b429a. > Add ZStandard format to inputs > -- > > Key: FLINK-20130 > URL: https://issues.apache.org/jira/browse/FLINK-20130 > Project: Flink > Issue Type: Improvement > Components: API / Core >Reporter: João Boto >Assignee: João Boto >Priority: Major > Labels: pull-request-available > > Allow Flink to read files compressed in ZStandard (.zst) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-20130) Add ZStandard format to inputs
[ https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-20130: --- Assignee: João Boto > Add ZStandard format to inputs > -- > > Key: FLINK-20130 > URL: https://issues.apache.org/jira/browse/FLINK-20130 > Project: Flink > Issue Type: Improvement > Components: API / Core >Reporter: João Boto >Assignee: João Boto >Priority: Major > Labels: pull-request-available > > Allow Flink to read files compressed in ZStandard (.zst) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-20130) Add ZStandard format to inputs
[ https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-20130. - Fix Version/s: 1.13.0 Resolution: Fixed > Add ZStandard format to inputs > -- > > Key: FLINK-20130 > URL: https://issues.apache.org/jira/browse/FLINK-20130 > Project: Flink > Issue Type: Improvement > Components: API / Core >Reporter: João Boto >Assignee: João Boto >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > Allow Flink to read files compressed in ZStandard (.zst) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"
[ https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-21535. - Fix Version/s: 1.12.3 1.13.0 Resolution: Fixed > UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap > space" > - > > Key: FLINK-21535 > URL: https://issues.apache.org/jira/browse/FLINK-21535 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.13.0, 1.12.3 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56 > {code} > 2021-02-27T02:11:41.5659201Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-02-27T02:11:41.5659947Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-02-27T02:11:41.5660794Z at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > 2021-02-27T02:11:41.5661618Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2021-02-27T02:11:41.5662356Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2021-02-27T02:11:41.5663104Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5664016Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5664817Z at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > 2021-02-27T02:11:41.5665638Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2021-02-27T02:11:41.5666405Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2021-02-27T02:11:41.5667609Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5668358Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5669218Z at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > 2021-02-27T02:11:41.5669928Z at > akka.dispatch.OnComplete.internal(Future.scala:264) > 2021-02-27T02:11:41.5670540Z at > akka.dispatch.OnComplete.internal(Future.scala:261) > 2021-02-27T02:11:41.5671268Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > 2021-02-27T02:11:41.5671881Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > 2021-02-27T02:11:41.5672512Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5673219Z at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > 2021-02-27T02:11:41.5674085Z at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > 2021-02-27T02:11:41.5674794Z at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > 2021-02-27T02:11:41.5675466Z at > akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > 2021-02-27T02:11:41.5676181Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > 2021-02-27T02:11:41.5676977Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > 2021-02-27T02:11:41.5677717Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > 2021-02-27T02:11:41.5678409Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > 2021-02-27T02:11:41.5679071Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5679776Z at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > 2021-02-27T02:11:41.5680576Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5681383Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5682167Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5683040Z at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2021-02-27T02:11:41.5683759Z at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > 2021-02-27T02:11:41.5684493Z at >
[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"
[ https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302747#comment-17302747 ] Arvid Heise commented on FLINK-21535: - Merged as 755a8d8214554c64e9db0271a827485208185b8d into 1.12. > UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap > space" > - > > Key: FLINK-21535 > URL: https://issues.apache.org/jira/browse/FLINK-21535 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56 > {code} > 2021-02-27T02:11:41.5659201Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-02-27T02:11:41.5659947Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-02-27T02:11:41.5660794Z at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > 2021-02-27T02:11:41.5661618Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2021-02-27T02:11:41.5662356Z at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > 2021-02-27T02:11:41.5663104Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5664016Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5664817Z at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > 2021-02-27T02:11:41.5665638Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2021-02-27T02:11:41.5666405Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2021-02-27T02:11:41.5667609Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2021-02-27T02:11:41.5668358Z at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > 2021-02-27T02:11:41.5669218Z at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > 2021-02-27T02:11:41.5669928Z at > akka.dispatch.OnComplete.internal(Future.scala:264) > 2021-02-27T02:11:41.5670540Z at > akka.dispatch.OnComplete.internal(Future.scala:261) > 2021-02-27T02:11:41.5671268Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > 2021-02-27T02:11:41.5671881Z at > akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > 2021-02-27T02:11:41.5672512Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5673219Z at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > 2021-02-27T02:11:41.5674085Z at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > 2021-02-27T02:11:41.5674794Z at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > 2021-02-27T02:11:41.5675466Z at > akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > 2021-02-27T02:11:41.5676181Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > 2021-02-27T02:11:41.5676977Z at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > 2021-02-27T02:11:41.5677717Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > 2021-02-27T02:11:41.5678409Z at > scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > 2021-02-27T02:11:41.5679071Z at > scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > 2021-02-27T02:11:41.5679776Z at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > 2021-02-27T02:11:41.5680576Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5681383Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5682167Z at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > 2021-02-27T02:11:41.5683040Z at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > 2021-02-27T02:11:41.5683759Z at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > 2021-02-27T02:11:41.5684493Z at >
[jira] [Commented] (FLINK-21511) Flink connector elasticsearch 6.x has a bug about BulkProcessor hangs for threads deadlocked
[ https://issues.apache.org/jira/browse/FLINK-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292715#comment-17292715 ] Arvid Heise commented on FLINK-21511: - Thank you [~zhangmeng0426] for bringing this up and doing the investigation. I assigned the ticket to you. > Flink connector elasticsearch 6.x has a bug about BulkProcessor hangs for > threads deadlocked > - > > Key: FLINK-21511 > URL: https://issues.apache.org/jira/browse/FLINK-21511 > Project: Flink > Issue Type: Bug > Components: Connectors / ElasticSearch >Affects Versions: 1.10.3, 1.11.3, 1.12.1 >Reporter: zhangmeng >Assignee: zhangmeng >Priority: Major > Labels: pull-request-available > > We use flink1.10, flink elasticsearch connector 6.x to write elasticsearch. A > total of 50 tasks running a weeks. There were more than 30 tasks that no > longer wrote data. Investigation found that there was a deadlock bug in the > current version of elasticsearch. and fixed on high version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-21511) Flink connector elasticsearch 6.x has a bug about BulkProcessor hangs for threads deadlocked
[ https://issues.apache.org/jira/browse/FLINK-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-21511: --- Assignee: zhangmeng > Flink connector elasticsearch 6.x has a bug about BulkProcessor hangs for > threads deadlocked > - > > Key: FLINK-21511 > URL: https://issues.apache.org/jira/browse/FLINK-21511 > Project: Flink > Issue Type: Bug > Components: Connectors / ElasticSearch >Affects Versions: 1.10.3, 1.11.3, 1.12.1 >Reporter: zhangmeng >Assignee: zhangmeng >Priority: Major > Labels: pull-request-available > > We use flink1.10, flink elasticsearch connector 6.x to write elasticsearch. A > total of 50 tasks running a weeks. There were more than 30 tasks that no > longer wrote data. Investigation found that there was a deadlock bug in the > current version of elasticsearch. and fixed on high version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17510) StreamingKafkaITCase. testKafka timeouts on downloading Kafka
[ https://issues.apache.org/jira/browse/FLINK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291119#comment-17291119 ] Arvid Heise commented on FLINK-17510: - https://dev.azure.com/arvidheise0209/arvidheise/_build/results?buildId=947=logs=9401bf33-03c4-5a24-83fe-e51d75db73ef=72901ab2-7cd0-57be-82b1-bca51de20fba > StreamingKafkaITCase. testKafka timeouts on downloading Kafka > - > > Key: FLINK-17510 > URL: https://issues.apache.org/jira/browse/FLINK-17510 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Connectors / Kafka, Tests >Affects Versions: 1.11.3, 1.12.1 >Reporter: Robert Metzger >Priority: Major > Labels: test-stability > > CI: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=585=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > {code} > 2020-05-05T00:06:49.7268716Z [INFO] > --- > 2020-05-05T00:06:49.7268938Z [INFO] T E S T S > 2020-05-05T00:06:49.7269282Z [INFO] > --- > 2020-05-05T00:06:50.5336315Z [INFO] Running > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-05-05T00:11:26.8603439Z [ERROR] Tests run: 3, Failures: 0, Errors: 2, > Skipped: 0, Time elapsed: 276.323 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-05-05T00:11:26.8604882Z [ERROR] testKafka[1: > kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) > Time elapsed: 120.024 s <<< ERROR! > 2020-05-05T00:11:26.8605942Z java.io.IOException: Process ([wget, -q, -P, > /tmp/junit2815750531595874769/downloads/1290570732, > https://archive.apache.org/dist/kafka/0.11.0.2/kafka_2.11-0.11.0.2.tgz]) > exceeded timeout (12) or number of retries (3). > 2020-05-05T00:11:26.8606732Z at > org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:132) > 2020-05-05T00:11:26.8607321Z at > org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127) > 2020-05-05T00:11:26.8607826Z at > org.apache.flink.tests.util.cache.LolCache.getOrDownload(LolCache.java:31) > 2020-05-05T00:11:26.8608343Z at > org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.setupKafkaDist(LocalStandaloneKafkaResource.java:98) > 2020-05-05T00:11:26.8608892Z at > org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.before(LocalStandaloneKafkaResource.java:92) > 2020-05-05T00:11:26.8609602Z at > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46) > 2020-05-05T00:11:26.8610026Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-05-05T00:11:26.8610553Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-05-05T00:11:26.8610958Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-05-05T00:11:26.8611388Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-05-05T00:11:26.8612214Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-05-05T00:11:26.8612706Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-05-05T00:11:26.8613109Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-05-05T00:11:26.8613551Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-05-05T00:11:26.8614019Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-05-05T00:11:26.8614442Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-05-05T00:11:26.8614869Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-05-05T00:11:26.8615251Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-05-05T00:11:26.8615654Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2020-05-05T00:11:26.8616060Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-05-05T00:11:26.8616465Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-05-05T00:11:26.8616893Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-05-05T00:11:26.8617893Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-05-05T00:11:26.8618490Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-05-05T00:11:26.8619056Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-05-05T00:11:26.8619589Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-05-05T00:11:26.8620073Z at > org.junit.runners.Suite.runChild(Suite.java:27) >
[jira] [Commented] (FLINK-21452) FLIP-27 sources cannot reliably downscale
[ https://issues.apache.org/jira/browse/FLINK-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291153#comment-17291153 ] Arvid Heise commented on FLINK-21452: - Merged as fb99ce2e22ca84dece1f7a431a92a4cecb6a71f2^ in 1.12 and as 81cfe465c9e4a17e563e1b4c02cd60a63b984de5^ in master. > FLIP-27 sources cannot reliably downscale > - > > Key: FLINK-21452 > URL: https://issues.apache.org/jira/browse/FLINK-21452 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.12.1, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.2, 1.13.0 > > > Sources currently store their registered readers into the snapshot. However, > when downscaling, we have unmatched readers that we violate a couple of > invariants. > The solution is to not store registered readers - they are re-registered > anyways on restart. > To keep it backward compatible, the best option is to always store an empty > set of readers while writing the snapshot and discard any recovered readers > from the snapshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-21452) FLIP-27 sources cannot reliably downscale
[ https://issues.apache.org/jira/browse/FLINK-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-21452. - Resolution: Fixed > FLIP-27 sources cannot reliably downscale > - > > Key: FLINK-21452 > URL: https://issues.apache.org/jira/browse/FLINK-21452 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.12.1, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.2, 1.13.0 > > > Sources currently store their registered readers into the snapshot. However, > when downscaling, we have unmatched readers that we violate a couple of > invariants. > The solution is to not store registered readers - they are re-registered > anyways on restart. > To keep it backward compatible, the best option is to always store an empty > set of readers while writing the snapshot and discard any recovered readers > from the snapshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-21452) FLIP-27 sources cannot reliably downscale
[ https://issues.apache.org/jira/browse/FLINK-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-21452: Description: Sources currently store their registered readers into the snapshot. However, when downscaling, there are unmatched readers we violate a couple of invariants. The solution is to not store registered readers - they are re-registered anyways on restart. To keep it backward compatible, the best option is to always store an empty set of readers while writing the snapshot and discard any recovered readers from the snapshot. was: Sources currently store their registered readers into the snapshot. However, when downscaling, we have unmatched readers that we violate a couple of invariants. The solution is to not store registered readers - they are re-registered anyways on restart. To keep it backward compatible, the best option is to always store an empty set of readers while writing the snapshot and discard any recovered readers from the snapshot. > FLIP-27 sources cannot reliably downscale > - > > Key: FLINK-21452 > URL: https://issues.apache.org/jira/browse/FLINK-21452 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.12.1, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.2, 1.13.0 > > > Sources currently store their registered readers into the snapshot. However, > when downscaling, there are unmatched readers we violate a couple of > invariants. > The solution is to not store registered readers - they are re-registered > anyways on restart. > To keep it backward compatible, the best option is to always store an empty > set of readers while writing the snapshot and discard any recovered readers > from the snapshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17510) StreamingKafkaITCase. testKafka timeouts on downloading Kafka
[ https://issues.apache.org/jira/browse/FLINK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291126#comment-17291126 ] Arvid Heise commented on FLINK-17510: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13773=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529 > StreamingKafkaITCase. testKafka timeouts on downloading Kafka > - > > Key: FLINK-17510 > URL: https://issues.apache.org/jira/browse/FLINK-17510 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Connectors / Kafka, Tests >Affects Versions: 1.11.3, 1.12.1 >Reporter: Robert Metzger >Priority: Major > Labels: test-stability > > CI: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=585=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > {code} > 2020-05-05T00:06:49.7268716Z [INFO] > --- > 2020-05-05T00:06:49.7268938Z [INFO] T E S T S > 2020-05-05T00:06:49.7269282Z [INFO] > --- > 2020-05-05T00:06:50.5336315Z [INFO] Running > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-05-05T00:11:26.8603439Z [ERROR] Tests run: 3, Failures: 0, Errors: 2, > Skipped: 0, Time elapsed: 276.323 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-05-05T00:11:26.8604882Z [ERROR] testKafka[1: > kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) > Time elapsed: 120.024 s <<< ERROR! > 2020-05-05T00:11:26.8605942Z java.io.IOException: Process ([wget, -q, -P, > /tmp/junit2815750531595874769/downloads/1290570732, > https://archive.apache.org/dist/kafka/0.11.0.2/kafka_2.11-0.11.0.2.tgz]) > exceeded timeout (12) or number of retries (3). > 2020-05-05T00:11:26.8606732Z at > org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:132) > 2020-05-05T00:11:26.8607321Z at > org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127) > 2020-05-05T00:11:26.8607826Z at > org.apache.flink.tests.util.cache.LolCache.getOrDownload(LolCache.java:31) > 2020-05-05T00:11:26.8608343Z at > org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.setupKafkaDist(LocalStandaloneKafkaResource.java:98) > 2020-05-05T00:11:26.8608892Z at > org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.before(LocalStandaloneKafkaResource.java:92) > 2020-05-05T00:11:26.8609602Z at > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46) > 2020-05-05T00:11:26.8610026Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-05-05T00:11:26.8610553Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-05-05T00:11:26.8610958Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-05-05T00:11:26.8611388Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-05-05T00:11:26.8612214Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-05-05T00:11:26.8612706Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-05-05T00:11:26.8613109Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-05-05T00:11:26.8613551Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-05-05T00:11:26.8614019Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-05-05T00:11:26.8614442Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-05-05T00:11:26.8614869Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-05-05T00:11:26.8615251Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-05-05T00:11:26.8615654Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2020-05-05T00:11:26.8616060Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-05-05T00:11:26.8616465Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-05-05T00:11:26.8616893Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-05-05T00:11:26.8617893Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-05-05T00:11:26.8618490Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-05-05T00:11:26.8619056Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-05-05T00:11:26.8619589Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-05-05T00:11:26.8620073Z at > org.junit.runners.Suite.runChild(Suite.java:27) >
[jira] [Commented] (FLINK-17510) StreamingKafkaITCase. testKafka timeouts on downloading Kafka
[ https://issues.apache.org/jira/browse/FLINK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291114#comment-17291114 ] Arvid Heise commented on FLINK-17510: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13755=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0 > StreamingKafkaITCase. testKafka timeouts on downloading Kafka > - > > Key: FLINK-17510 > URL: https://issues.apache.org/jira/browse/FLINK-17510 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Connectors / Kafka, Tests >Affects Versions: 1.11.3, 1.12.1 >Reporter: Robert Metzger >Priority: Major > Labels: test-stability > > CI: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=585=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > {code} > 2020-05-05T00:06:49.7268716Z [INFO] > --- > 2020-05-05T00:06:49.7268938Z [INFO] T E S T S > 2020-05-05T00:06:49.7269282Z [INFO] > --- > 2020-05-05T00:06:50.5336315Z [INFO] Running > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-05-05T00:11:26.8603439Z [ERROR] Tests run: 3, Failures: 0, Errors: 2, > Skipped: 0, Time elapsed: 276.323 s <<< FAILURE! - in > org.apache.flink.tests.util.kafka.StreamingKafkaITCase > 2020-05-05T00:11:26.8604882Z [ERROR] testKafka[1: > kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) > Time elapsed: 120.024 s <<< ERROR! > 2020-05-05T00:11:26.8605942Z java.io.IOException: Process ([wget, -q, -P, > /tmp/junit2815750531595874769/downloads/1290570732, > https://archive.apache.org/dist/kafka/0.11.0.2/kafka_2.11-0.11.0.2.tgz]) > exceeded timeout (12) or number of retries (3). > 2020-05-05T00:11:26.8606732Z at > org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:132) > 2020-05-05T00:11:26.8607321Z at > org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127) > 2020-05-05T00:11:26.8607826Z at > org.apache.flink.tests.util.cache.LolCache.getOrDownload(LolCache.java:31) > 2020-05-05T00:11:26.8608343Z at > org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.setupKafkaDist(LocalStandaloneKafkaResource.java:98) > 2020-05-05T00:11:26.8608892Z at > org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.before(LocalStandaloneKafkaResource.java:92) > 2020-05-05T00:11:26.8609602Z at > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46) > 2020-05-05T00:11:26.8610026Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-05-05T00:11:26.8610553Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-05-05T00:11:26.8610958Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-05-05T00:11:26.8611388Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-05-05T00:11:26.8612214Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-05-05T00:11:26.8612706Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-05-05T00:11:26.8613109Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-05-05T00:11:26.8613551Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-05-05T00:11:26.8614019Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-05-05T00:11:26.8614442Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-05-05T00:11:26.8614869Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-05-05T00:11:26.8615251Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-05-05T00:11:26.8615654Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2020-05-05T00:11:26.8616060Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-05-05T00:11:26.8616465Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-05-05T00:11:26.8616893Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-05-05T00:11:26.8617893Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-05-05T00:11:26.8618490Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-05-05T00:11:26.8619056Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-05-05T00:11:26.8619589Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-05-05T00:11:26.8620073Z at > org.junit.runners.Suite.runChild(Suite.java:27) >
[jira] [Commented] (FLINK-21490) UnalignedCheckpointITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291140#comment-17291140 ] Arvid Heise commented on FLINK-21490: - Merged as 29abccd4cb7d4905fa168f8d7b68a113e9640fca^ in master and as 0c1b20d2119463d4571d17de607aebfff1b4b17f^ in 1.12. > UnalignedCheckpointITCase fails on azure > > > Key: FLINK-21490 > URL: https://issues.apache.org/jira/browse/FLINK-21490 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.1, 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.2, 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13682=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2 > {code} > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > at akka.dispatch.OnComplete.internal(Future.scala:264) > at akka.dispatch.OnComplete.internal(Future.scala:261) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by > FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5, > backoffTimeMS=100) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:130) > at >
[jira] [Resolved] (FLINK-21490) UnalignedCheckpointITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-21490. - Resolution: Fixed > UnalignedCheckpointITCase fails on azure > > > Key: FLINK-21490 > URL: https://issues.apache.org/jira/browse/FLINK-21490 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.1, 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.2, 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13682=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2 > {code} > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > at akka.dispatch.OnComplete.internal(Future.scala:264) > at akka.dispatch.OnComplete.internal(Future.scala:261) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by > FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5, > backoffTimeMS=100) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:130) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:81) > at >
[jira] [Created] (FLINK-21452) FLIP-27 sources cannot reliably downscale
Arvid Heise created FLINK-21452: --- Summary: FLIP-27 sources cannot reliably downscale Key: FLINK-21452 URL: https://issues.apache.org/jira/browse/FLINK-21452 Project: Flink Issue Type: Improvement Components: Connectors / Common Affects Versions: 1.12.1, 1.13.0 Reporter: Arvid Heise Assignee: Arvid Heise Sources currently store their registered readers into the snapshot. However, when downscaling, we have unmatched readers that we violate a couple of invariants. The solution is to not store registered readers - they are re-registered anyways on restart. To keep it backward compatible, the best option is to always store an empty set of readers while writing the snapshot and discard any recovered readers from the snapshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21490) UnalignedCheckpointITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290742#comment-17290742 ] Arvid Heise commented on FLINK-21490: - The error is probably test-only: For some reason the test does not terminate after 10 successful checkpoints (to be investigated). {noformat} 12:21:43,173 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 3165 (type=CHECKPOINT) @ 1614169303172 for job 78d8cb678ee2304d517c9e42bff43aea. {noformat} I suspect that we overflow {{MAX_INT}} in {{value}}, and then {{checkHeader}} fails as it uses the upper 4 bytes of the long. We have already hardened that part to give a meaningful exception in the {{UCRescaleITCase}}, but it might be a good idea to extract that to this ticket as that test will only go into master. So for now I'd harden the test. There is also a related issue with unions that I initially suspected. > UnalignedCheckpointITCase fails on azure > > > Key: FLINK-21490 > URL: https://issues.apache.org/jira/browse/FLINK-21490 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Major > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13682=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2 > {code} > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > at > org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > at > org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066) > at akka.dispatch.OnComplete.internal(Future.scala:264) > at akka.dispatch.OnComplete.internal(Future.scala:261) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) > at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) > at > akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) > at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) > at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at >
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316515#comment-17316515 ] Arvid Heise commented on FLINK-20816: - [~yunta]'s analysis is spot on; there is not much to add as the interesting debug statements are currently not in. Interestingly, a line like {noformat} 21:04:11,714 [ DeclineSink (1/1)#0] DEBUG org.apache.flink.runtime.state.SnapshotStrategyRunner[] - StuckAsyncSnapshotStrategy (FsCheckpointStorageLocation {fileSystem=org.apache.flink.core.fs.SafetyNetWrapperFileSystem@3f0e55b0, checkpointDirectory=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/chk-2, sharedStateDirectory=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/shared, taskOwnedStateDirectory=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/taskowned, metadataFilePath=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/chk-2/_metadata, reference=(default), fileStateSizeThreshold=20480, writeBufferSize=4096}, synchronous part) in thread Thread[DeclineSink (1/1)#0,5,Flink Task Threads] took 0 ms. {noformat} is missing from the failed log, which indicates that we never successfully execute {{SubtaskCheckpointCoordinatorImpl#buildOperatorSnapshotFutures}}. The part of unaligned checkpoint before Yun's fragment is executed normally, so I don't immediately see a connection to unaligned checkpoints. > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at >
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316519#comment-17316519 ] Arvid Heise commented on FLINK-20816: - Indeed https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15528=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9680 shows that the same occurs for aligned checkpoints. > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317784#comment-17317784 ] Arvid Heise commented on FLINK-22173: - Browsing through the log, we first have some issues on cancellation 4x {noformat} 23:15:14,497 [ failing-map (7/7)#0] WARN org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - failing-map (7/7)#0 (34b929a926bea2eaa9eb1eccee63b4cb): Error during release of channel resources: org.apache.flink.shaded.netty4.io.netty.util.IllegalReferenceCountException: refCnt: 0. java.io.IOException: org.apache.flink.shaded.netty4.io.netty.util.IllegalReferenceCountException: refCnt: 0 {noformat} Then, there is seemingly a deadlock {noformat} 23:15:42,548 [Flink Netty Client (0) Thread 3] TRACE org.apache.flink.runtime.io.network.logger.NetworkActionsLogger [] - [global0 (1/7)#1 (b1735bf22fee2c8bc8b3199426d089d7)] RemoteInputChannel#onBuffer Buffer{size=651, hash=-770186644}, seq 392, ChannelStatePersister(lastSeenBarrier=10 (COMPLETED)} @ InputChannelInfo{gateIdx=0, inputChannelIdx=5} 23:25:16,104 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Checkpoint 10 of job b3f28f25c25a01c99593dcf74948687e expired before completing. {noformat} Leading to {noformat} org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. {noformat} > UnalignedCheckpointRescaleITCase fails on azure > --- > > Key: FLINK-22173 > URL: https://issues.apache.org/jira/browse/FLINK-22173 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628 > {code} > 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase > 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no > scale union from 7 to > 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase) > Time elapsed: 607.467 s <<< ERROR! > 2021-04-08T23:25:56.3133586Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-08T23:25:56.3134070Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-08T23:25:56.3134643Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-08T23:25:56.3135577Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368) > 2021-04-08T23:25:56.3138843Z at > sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) > 2021-04-08T23:25:56.3139402Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-08T23:25:56.3139880Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-08T23:25:56.3140328Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-08T23:25:56.3140844Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-08T23:25:56.3141768Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-08T23:25:56.3142272Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-08T23:25:56.3142706Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-08T23:25:56.3143142Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-08T23:25:56.3143608Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-08T23:25:56.3144039Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-08T23:25:56.3144434Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-08T23:25:56.3145027Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-08T23:25:56.3145484Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-08T23:25:56.3145981Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-08T23:25:56.3146421Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3146843Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) >
[jira] [Commented] (FLINK-21873) CoordinatedSourceRescaleITCase.testUpscaling fails on AZP
[ https://issues.apache.org/jira/browse/FLINK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316684#comment-17316684 ] Arvid Heise commented on FLINK-21873: - Merged into master as 9953206599910983425dceea7a48164370fa605b and 1.12 as 69cd36d9aa6e4aeb2ad827020d125712307ab585. > CoordinatedSourceRescaleITCase.testUpscaling fails on AZP > - > > Key: FLINK-21873 > URL: https://issues.apache.org/jira/browse/FLINK-21873 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.13.0 >Reporter: Till Rohrmann >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.14.0 > > > The test {{CoordinatedSourceRescaleITCase.testUpscaling}} fails on AZP with > {code} > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > ... 4 more > Caused by: java.lang.Exception: successfully restored checkpoint > at > org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:139) > at > org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:126) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) > at > org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71) > at > org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) > at > org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) > at > org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:161) > at > org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) > at > org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101) > at > org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:95) > at > org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275) > at > org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:408) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:190) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:624) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:588) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > at java.lang.Thread.run(Thread.java:748) > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14997=logs=fc5181b0-e452-5c8f-68de-1097947f6483=62110053-334f-5295-a0ab-80dd7e2babbf=22049 > cc [~AHeise] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-21873) CoordinatedSourceRescaleITCase.testUpscaling fails on AZP
[ https://issues.apache.org/jira/browse/FLINK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-21873. - Fix Version/s: (was: 1.14.0) 1.13.0 Resolution: Fixed > CoordinatedSourceRescaleITCase.testUpscaling fails on AZP > - > > Key: FLINK-21873 > URL: https://issues.apache.org/jira/browse/FLINK-21873 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.13.0 >Reporter: Till Rohrmann >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.13.0 > > > The test {{CoordinatedSourceRescaleITCase.testUpscaling}} fails on AZP with > {code} > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > ... 4 more > Caused by: java.lang.Exception: successfully restored checkpoint > at > org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:139) > at > org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:126) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) > at > org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71) > at > org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) > at > org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) > at > org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:161) > at > org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) > at > org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101) > at > org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:95) > at > org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275) > at > org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:408) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:190) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:624) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:588) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > at java.lang.Thread.run(Thread.java:748) > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14997=logs=fc5181b0-e452-5c8f-68de-1097947f6483=62110053-334f-5295-a0ab-80dd7e2babbf=22049 > cc [~AHeise] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317839#comment-17317839 ] Arvid Heise edited comment on FLINK-20816 at 4/9/21, 10:00 AM: --- With some more echo debugging, it's most likely caused by {noformat} OperatorSnapshotFutures snapshotInProgress = checkpointStreamOperator( op, checkpointMetaData, checkpointOptions, storage, isRunning); {noformat} hanging in the sync phase. was (Author: aheise): With some more echo debugging, it's most likely caused by {noformat} CheckpointStreamFactory storage = checkpointStorage.resolveCheckpointStorageLocation( checkpointId, checkpointOptions.getTargetLocation()); {noformat} hanging in the sync phase. > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317847#comment-17317847 ] Arvid Heise commented on FLINK-20816: - That is actually be design of the test {noformat} if (context.getCheckpointId() == DECLINE_CHECKPOINT_ID) { DeclineSink.waitLatch.await(); } {noformat} DeclineSink is not supposed to complete it until the abortion of the first checkpoint is verified. However, there is no log statement that indicate that, there is an abortion call happening at all. In the attached success.log we have {noformat} 21:04:11,624 [Source: NormalSource (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 1 for task Source: NormalSource (1/1)#0 21:04:11,624 [ DeclineSink (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 1 for task DeclineSink (1/1)#0 21:04:11,625 [ NormalMap (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 1 for task NormalMap (1/1)#0 21:04:11,739 [Source: NormalSource (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 2 for task Source: NormalSource (1/1)#0 21:04:11,739 [ NormalMap (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 2 for task NormalMap (1/1)#0 21:04:11,739 [ DeclineSink (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 2 for task DeclineSink (1/1)#0 {noformat} while in failure.log, I can only find {noformat} 21:04:19,260 [Source: NormalSource (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 1 for task Source: NormalSource (1/1)#0 21:04:19,268 [ NormalMap (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 1 for task NormalMap (1/1)#0 21:05:58,297 [Source: NormalSource (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 2 for task Source: NormalSource (1/1)#0 21:05:58,297 [ NormalMap (1/1)#0] DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of aborted checkpoint 2 for task NormalMap (1/1)#0 {noformat} It might be some race condition, where the {{SubtaskCheckpointCoordinatorImpl}} does not know that the {{DeclineSink}} is already running. {noformat} 21:04:19,037 [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - DeclineSink (1/1) (7457bf515844f409738c9929fffc54f7) switched from DEPLOYING to RUNNING. {noformat} > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at >
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317839#comment-17317839 ] Arvid Heise commented on FLINK-20816: - With some more echo debugging, it's most likely caused by {noformat} CheckpointStreamFactory storage = checkpointStorage.resolveCheckpointStorageLocation( checkpointId, checkpointOptions.getTargetLocation()); {noformat} hanging in the sync phase. > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316922#comment-17316922 ] Arvid Heise commented on FLINK-22081: - Merged into master as 2d3559e66db, into 1.12 as a9b34a3db23, and into 1.11 as 3bd44e083c8. > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.11.0, 1.11.1, 1.11.2, > 1.11.3, 1.11.4, 1.12.0, 1.12.1, 1.12.2, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22081: Fix Version/s: (was: 1.10.4) (was: 1.12.2) (was: 1.12.1) (was: 1.11.3) (was: 1.10.3) (was: 1.11.2) (was: 1.11.1) (was: 1.12.0) (was: 1.10.2) (was: 1.10.1) (was: 1.11.0) > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22081: Affects Version/s: 1.13.0 1.10.3 1.11.3 1.12.2 > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0 >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Minor > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22081: Priority: Major (was: Minor) > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0 >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Major > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316924#comment-17316924 ] Arvid Heise commented on FLINK-22081: - Merging into 1.10 is quite an effort and the version is officially not maintained anymore. > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0 >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Major > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
[ https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-22081. - Resolution: Fixed > Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin > --- > > Key: FLINK-22081 > URL: https://issues.apache.org/jira/browse/FLINK-22081 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0 >Reporter: Chen Qin >Assignee: Chen Qin >Priority: Major > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > Attachments: image (13).png > > > Using flink 1.11.2 > I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the > checkpoints paths like > {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}} > which means the entropy injection key is not being resolved. After some > debugging I found that in the > [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97] > we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} > and if so we check if the filesysystem is of type > {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for > {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}} > directly in getEntorpyFs method which would be the type if S3 file system > dependencies are added as a plugin. > > Repro steps: > Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection > key _entropy_ > observe checkpoint dir with entropy marker not removed. > s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/ > compare to removed when running Flink 1.9.1 > s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/ > Add some logging to getEntropyFs, observe it return null because passed in > parameter is not {{SafetyNetWrapperFileSystem}} but > {{ClassLoaderFixingFileSystem}} > Apply patch, build release and run same job, resolved issue as attachment > shows > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22190) no guarantee on Flink exactly_once sink to Kafka
[ https://issues.apache.org/jira/browse/FLINK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319910#comment-17319910 ] Arvid Heise commented on FLINK-22190: - 1. You get byZeroException because you are dividing by 0 in user code {{/ Random.nextInt(5)}}. That's something that you need to fix on your end. 2. Could you provide example output to show the duplicates? Where does the fail-over happen? Note that exactly once does not mean deduplication of records or parts thereof. Exactly once ensures that there are no duplicates caused by fail-over/restarts. > no guarantee on Flink exactly_once sink to Kafka > - > > Key: FLINK-22190 > URL: https://issues.apache.org/jira/browse/FLINK-22190 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.12.2 > Environment: *flink: 1.12.2* > *kafka: 2.7.0* >Reporter: Spongebob >Priority: Major > > When I tried to test the function of flink exactly_once sink to kafka, I > found it can not run as expectation. here's the pipline of the flink > applications: raw data(flink app0)-> kafka topic1 -> flink app1 -> kafka > topic2 -> flink app2, flink tasks may met / byZeroException in random. Below > shows the codes: > {code:java} > //代码占位符 > raw data, flink app0: > class SimpleSource1 extends SourceFunction[String] { > var switch = true > val students: Array[String] = Array("Tom", "Jerry", "Gory") > override def run(sourceContext: SourceFunction.SourceContext[String]): Unit > = { > var i = 0 > while (switch) { > sourceContext.collect(s"${students(Random.nextInt(students.length))},$i") > i += 1 > Thread.sleep(5000) > } > } > override def cancel(): Unit = switch = false > } > val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment > val dataStream = streamEnv.addSource(new SimpleSource1) > dataStream.addSink(new FlinkKafkaProducer[String]("xfy:9092", > "single-partition-topic-2", new SimpleStringSchema())) > streamEnv.execute("sink kafka") > > flink-app1: > val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment > streamEnv.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE) > val prop = new Properties() > prop.setProperty("bootstrap.servers", "xfy:9092") > prop.setProperty("group.id", "test") > val dataStream = streamEnv.addSource(new FlinkKafkaConsumer[String]( > "single-partition-topic-2", > new SimpleStringSchema, > prop > )) > val resultStream = dataStream.map(x => { > val data = x.split(",") > (data(0), data(1), data(1).toInt / Random.nextInt(5)).toString() > } > ) > resultStream.print().setParallelism(1) > val propProducer = new Properties() > propProducer.setProperty("bootstrap.servers", "xfy:9092") > propProducer.setProperty("transaction.timeout.ms", s"${1000 * 60 * 5}") > resultStream.addSink(new FlinkKafkaProducer[String]( > "single-partition-topic", > new MyKafkaSerializationSchema("single-partition-topic"), > propProducer, > Semantic.EXACTLY_ONCE)) > streamEnv.execute("sink kafka") > > flink-app2: > val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment > val prop = new Properties() > prop.setProperty("bootstrap.servers", "xfy:9092") > prop.setProperty("group.id", "test") > prop.setProperty("isolation_level", "read_committed") > val dataStream = streamEnv.addSource(new FlinkKafkaConsumer[String]( > "single-partition-topic", > new SimpleStringSchema, > prop > )) > dataStream.print().setParallelism(1) > streamEnv.execute("consumer kafka"){code} > > flink app1 will print some duplicate numbers, and to my expectation flink > app2 will deduplicate them but the fact shows not. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-21992) Fix availability notification in UnionInputGate
[ https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-21992: Summary: Fix availability notification in UnionInputGate (was: Investigate potential buffer leak in unaligned checkpoint) > Fix availability notification in UnionInputGate > --- > > Key: FLINK-21992 > URL: https://issues.apache.org/jira/browse/FLINK-21992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Piotr Nowojski >Priority: Blocker > > A user on mailing list reported that his job gets stuck with unaligned > checkpoint enabled. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html > We received two similar reports in the past, but the users didn't follow up, > so it was not as easy to diagnose as this time where the initial report > already contains many relevant data points. > Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint
[ https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319986#comment-17319986 ] Arvid Heise commented on FLINK-21992: - It turns out that there is an issue with notification. We managed to reliable reproduce it with: * Unaligned checkpoints with * Unions going into * Two input tasks. The root cause is a bug in {{UnionInputGate}} introduced in FLINK-19026. The available notification of {{UnionInputGate}} is simply reset too early, leading to stuck tasks. The bug can probably also be triggered with single input tasks but there are certain factors that rectify the bug: If you drain a union gate entirely without looking at availability after the first buffer, the bug would not be visible. Since we hot-loop at plenty of places until running out of data, it might be that just the combination of the three things actually makes it visible. > Investigate potential buffer leak in unaligned checkpoint > - > > Key: FLINK-21992 > URL: https://issues.apache.org/jira/browse/FLINK-21992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Piotr Nowojski >Priority: Blocker > > A user on mailing list reported that his job gets stuck with unaligned > checkpoint enabled. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html > We received two similar reports in the past, but the users didn't follow up, > so it was not as easy to diagnose as this time where the initial report > already contains many relevant data points. > Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319992#comment-17319992 ] Arvid Heise commented on FLINK-22173: - I'm expecting some connection to FLINK-21992. > UnalignedCheckpointRescaleITCase fails on azure > --- > > Key: FLINK-22173 > URL: https://issues.apache.org/jira/browse/FLINK-22173 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628 > {code} > 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase > 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no > scale union from 7 to > 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase) > Time elapsed: 607.467 s <<< ERROR! > 2021-04-08T23:25:56.3133586Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-08T23:25:56.3134070Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-08T23:25:56.3134643Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-08T23:25:56.3135577Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368) > 2021-04-08T23:25:56.3138843Z at > sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) > 2021-04-08T23:25:56.3139402Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-08T23:25:56.3139880Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-08T23:25:56.3140328Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-08T23:25:56.3140844Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-08T23:25:56.3141768Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-08T23:25:56.3142272Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-08T23:25:56.3142706Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-08T23:25:56.3143142Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-08T23:25:56.3143608Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-08T23:25:56.3144039Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-08T23:25:56.3144434Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-08T23:25:56.3145027Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-08T23:25:56.3145484Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-08T23:25:56.3145981Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-08T23:25:56.3146421Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3146843Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-08T23:25:56.3147274Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-08T23:25:56.3147692Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-08T23:25:56.3148116Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-08T23:25:56.3148543Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-08T23:25:56.3148930Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-08T23:25:56.3149298Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-08T23:25:56.3149663Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3150075Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-08T23:25:56.3150488Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-08T23:25:56.3151148Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-08T23:25:56.3151691Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-08T23:25:56.3152115Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-08T23:25:56.3152534Z
[jira] [Assigned] (FLINK-21992) Fix availability notification in UnionInputGate
[ https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-21992: --- Assignee: Arvid Heise (was: Piotr Nowojski) > Fix availability notification in UnionInputGate > --- > > Key: FLINK-21992 > URL: https://issues.apache.org/jira/browse/FLINK-21992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Blocker > Labels: pull-request-available > > A user on mailing list reported that his job gets stuck with unaligned > checkpoint enabled. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html > We received two similar reports in the past, but the users didn't follow up, > so it was not as easy to diagnose as this time where the initial report > already contains many relevant data points. > Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321282#comment-17321282 ] Arvid Heise edited comment on FLINK-22259 at 4/15/21, 7:25 AM: --- -I guess the test assumes that the enumerator never fails (it has transient state). The test should also persist the transient state.- The enumerator state is correct {noformat} snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, numCompletedCheckpoints=11} {noformat} It's just that the sync event is never reaching the reader caused by FLINK-18071 (or FLINK-21996). was (Author: aheise): -I guess the test assumes that the enumerator never fails (it has transient state). The test should also persist the transient state.- The enumerator state is correct {noformat} snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, numCompletedCheckpoints=11} {noformat} It's just that the sync event is never reaching the reader caused by FLINK-21996 . > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at >
[jira] [Comment Edited] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321282#comment-17321282 ] Arvid Heise edited comment on FLINK-22259 at 4/15/21, 7:08 AM: --- -I guess the test assumes that the enumerator never fails (it has transient state). The test should also persist the transient state.- The enumerator state is correct {noformat} snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, numCompletedCheckpoints=11} {noformat} It's just that the sync event is never reaching the reader caused by FLINK-21996 . was (Author: aheise): I guess the test assumes that the enumerator never fails (it has transient state). The test should also persist the transient state. > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at >
[jira] [Resolved] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-22173. - Assignee: Arvid Heise Resolution: Cannot Reproduce > UnalignedCheckpointRescaleITCase fails on azure > --- > > Key: FLINK-22173 > URL: https://issues.apache.org/jira/browse/FLINK-22173 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628 > {code} > 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase > 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no > scale union from 7 to > 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase) > Time elapsed: 607.467 s <<< ERROR! > 2021-04-08T23:25:56.3133586Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-08T23:25:56.3134070Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-08T23:25:56.3134643Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-08T23:25:56.3135577Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368) > 2021-04-08T23:25:56.3138843Z at > sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) > 2021-04-08T23:25:56.3139402Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-08T23:25:56.3139880Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-08T23:25:56.3140328Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-08T23:25:56.3140844Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-08T23:25:56.3141768Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-08T23:25:56.3142272Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-08T23:25:56.3142706Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-08T23:25:56.3143142Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-08T23:25:56.3143608Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-08T23:25:56.3144039Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-08T23:25:56.3144434Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-08T23:25:56.3145027Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-08T23:25:56.3145484Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-08T23:25:56.3145981Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-08T23:25:56.3146421Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3146843Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-08T23:25:56.3147274Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-08T23:25:56.3147692Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-08T23:25:56.3148116Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-08T23:25:56.3148543Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-08T23:25:56.3148930Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-08T23:25:56.3149298Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-08T23:25:56.3149663Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3150075Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-08T23:25:56.3150488Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-08T23:25:56.3151148Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-08T23:25:56.3151691Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-08T23:25:56.3152115Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) >
[jira] [Commented] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322118#comment-17322118 ] Arvid Heise commented on FLINK-22019: - Let's wait for FLINK-21346 for more investigation. > UnalignedCheckpointRescaleITCase hangs on azure > --- > > Key: FLINK-22019 > URL: https://issues.apache.org/jira/browse/FLINK-22019 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322030#comment-17322030 ] Arvid Heise commented on FLINK-22259: - I'd close this issue in hopes that the upcoming fixes for FLINK-18071 and FLINK-21996, will solve this issue automatically. Please reopen if there are failures in the next week. > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9413097Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9413538Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9413964Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9414405Z at >
[jira] [Closed] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise closed FLINK-22259. --- Resolution: Duplicate > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9413097Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9413538Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9413964Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9414405Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9414834Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9415263Z at >
[jira] [Commented] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure
[ https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322038#comment-17322038 ] Arvid Heise commented on FLINK-22173: - Given the frequency and the lack of logs, I'm closing it. There is a high chance that this is also caused by either FLINK-21992 or FLINK-18071. Let's reopening if it's still reappearing after these fixes and hopefully after FLINK-21346 gives us the needed logs. > UnalignedCheckpointRescaleITCase fails on azure > --- > > Key: FLINK-22173 > URL: https://issues.apache.org/jira/browse/FLINK-22173 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628 > {code} > 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase > 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no > scale union from 7 to > 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase) > Time elapsed: 607.467 s <<< ERROR! > 2021-04-08T23:25:56.3133586Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-08T23:25:56.3134070Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-08T23:25:56.3134643Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-08T23:25:56.3135577Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368) > 2021-04-08T23:25:56.3138843Z at > sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) > 2021-04-08T23:25:56.3139402Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-08T23:25:56.3139880Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-08T23:25:56.3140328Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-08T23:25:56.3140844Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-08T23:25:56.3141768Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-08T23:25:56.3142272Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-08T23:25:56.3142706Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-08T23:25:56.3143142Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-08T23:25:56.3143608Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-08T23:25:56.3144039Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-08T23:25:56.3144434Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-08T23:25:56.3145027Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-08T23:25:56.3145484Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-08T23:25:56.3145981Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-08T23:25:56.3146421Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3146843Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-08T23:25:56.3147274Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-08T23:25:56.3147692Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-08T23:25:56.3148116Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-08T23:25:56.3148543Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-08T23:25:56.3148930Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-08T23:25:56.3149298Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-08T23:25:56.3149663Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-08T23:25:56.3150075Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-08T23:25:56.3150488Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-08T23:25:56.3151148Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) >
[jira] [Created] (FLINK-22290) Add new unaligned checkpiont options to Python API.
Arvid Heise created FLINK-22290: --- Summary: Add new unaligned checkpiont options to Python API. Key: FLINK-22290 URL: https://issues.apache.org/jira/browse/FLINK-22290 Project: Flink Issue Type: Improvement Components: API / Python, Runtime / Checkpointing Reporter: Arvid Heise Assignee: Arvid Heise Fix For: 1.13.0 There is currently no python equivalent of {noformat} CheckpointConfig#setAlignmentTimeout CheckpointConfig#setForceUnalignedCheckpoints {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22290) Add new unaligned checkpiont options to Python API.
[ https://issues.apache.org/jira/browse/FLINK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22290: Affects Version/s: 1.13.0 > Add new unaligned checkpiont options to Python API. > --- > > Key: FLINK-22290 > URL: https://issues.apache.org/jira/browse/FLINK-22290 > Project: Flink > Issue Type: Improvement > Components: API / Python, Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Fix For: 1.13.0 > > > There is currently no python equivalent of > {noformat} > CheckpointConfig#setAlignmentTimeout > CheckpointConfig#setForceUnalignedCheckpoints > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318485#comment-17318485 ] Arvid Heise commented on FLINK-20816: - Downgraded to Major as it's a test-only issue. > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-20816: Priority: Major (was: Critical) > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19801) Add support for virtual channels
[ https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-19801: Release Note: (was: While recovering from unaligned checkpoints, users can now change the parallelism of the job. This change allows users to quickly upscale the job under backpressure.) > Add support for virtual channels > > > Key: FLINK-19801 > URL: https://issues.apache.org/jira/browse/FLINK-19801 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Affects Versions: 1.12.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > During rescaling of unaligned checkpoints, if state from multiple former > channels are read on input or output side to recover a specific channel, then > these buffers are multiplexed on output side and demultiplexed on input side > to guarantee a consistent recovery of spanning records: > Assume two channels C1, C2 connect operator A and B and both have one buffer > in the output and in the input part of the channel respectively, where a > record spans. Assume that the buffers are named O1 for output buffer of C1 > and I2 for input buffer of C2 etc. Then after rescaling both channels become > one channel C. Then, the buffers may be restored as I1, I2, O1, O2. > Channels use the mapping of FLINK-19533 to infer the need for virtual > channels and distribute the needed resources. Virtual channels are removed on > the EndOfChannelRecovery epoch marker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-17979) Support rescaling for Unaligned Checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-17979. - Resolution: Fixed Solved with FLINK-19533, FLINK-19801, and FLINK-21945. > Support rescaling for Unaligned Checkpoints > --- > > Key: FLINK-17979 > URL: https://issues.apache.org/jira/browse/FLINK-17979 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Reporter: Roman Khachatryan >Priority: Major > Fix For: 1.13.0 > > > This is one of the limitations of Unaligned Checkpoints MVP. > (see [Unaligned checkpoints: recovery & > rescaling|https://docs.google.com/document/d/1T2WB163uf8xt6Eu2JS0Jyy2XZyF4YpnzGiHlo6twrks/edit?usp=sharing] > for possible options) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17979) Support rescaling for Unaligned Checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-17979: Release Note: While recovering from unaligned checkpoints, users can now change the parallelism of the job. This change allows users to quickly upscale the job under backpressure. > Support rescaling for Unaligned Checkpoints > --- > > Key: FLINK-17979 > URL: https://issues.apache.org/jira/browse/FLINK-17979 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Reporter: Roman Khachatryan >Priority: Major > Fix For: 1.13.0 > > > This is one of the limitations of Unaligned Checkpoints MVP. > (see [Unaligned checkpoints: recovery & > rescaling|https://docs.google.com/document/d/1T2WB163uf8xt6Eu2JS0Jyy2XZyF4YpnzGiHlo6twrks/edit?usp=sharing] > for possible options) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-20816. - Resolution: Fixed > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319664#comment-17319664 ] Arvid Heise commented on FLINK-20816: - Merged into master as fad4874f9866de7d3c2f5fb3a473f4df744c8159 and into 1.12 as a1ee66d9ef9a14414b9c0fee9288a94685740471. > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available, test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320466#comment-17320466 ] Arvid Heise commented on FLINK-22259: - Seems to be test-only: {noformat} 07:36:19,541 [SourceCoordinator-Source: source] INFO org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase [] - snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, numCompletedCheckpoints=14187} {noformat} We have enough restarts and completed checkpoints but for some reason the test is not finishing. > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9413097Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9413538Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) >
[jira] [Commented] (FLINK-21992) Fix availability notification in UnionInputGate
[ https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321096#comment-17321096 ] Arvid Heise commented on FLINK-21992: - Merged into master as 7c3abe11a28d54a585985ef908f36d8cf5857e14. > Fix availability notification in UnionInputGate > --- > > Key: FLINK-21992 > URL: https://issues.apache.org/jira/browse/FLINK-21992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Blocker > Labels: pull-request-available > Fix For: 1.13.0, 1.12.3 > > > A user on mailing list reported that his job gets stuck with unaligned > checkpoint enabled. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html > We received two similar reports in the past, but the users didn't follow up, > so it was not as easy to diagnose as this time where the initial report > already contains many relevant data points. > Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-21992) Fix availability notification in UnionInputGate
[ https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321279#comment-17321279 ] Arvid Heise commented on FLINK-21992: - Merged into 1.12 as e2cbfad6a3cdafe3d568bb43a1d048f9533b29ec. > Fix availability notification in UnionInputGate > --- > > Key: FLINK-21992 > URL: https://issues.apache.org/jira/browse/FLINK-21992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Blocker > Labels: pull-request-available > Fix For: 1.13.0, 1.12.3 > > > A user on mailing list reported that his job gets stuck with unaligned > checkpoint enabled. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html > We received two similar reports in the past, but the users didn't follow up, > so it was not as easy to diagnose as this time where the initial report > already contains many relevant data points. > Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-21992) Fix availability notification in UnionInputGate
[ https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-21992. - Resolution: Fixed > Fix availability notification in UnionInputGate > --- > > Key: FLINK-21992 > URL: https://issues.apache.org/jira/browse/FLINK-21992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Blocker > Labels: pull-request-available > Fix For: 1.13.0, 1.12.3 > > > A user on mailing list reported that his job gets stuck with unaligned > checkpoint enabled. > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html > We received two similar reports in the past, but the users didn't follow up, > so it was not as easy to diagnose as this time where the initial report > already contains many relevant data points. > Beside a buffer leak, there could also be an issue with priority notification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321282#comment-17321282 ] Arvid Heise commented on FLINK-22259: - I guess the test assumes that the enumerator never fails (it has transient state). The test should also persist the transient state. > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9413097Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9413538Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9413964Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9414405Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) >
[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321208#comment-17321208 ] Arvid Heise commented on FLINK-22259: - Source is not finishing as {{numRestarts}} is out of sync (at 0 but should be at 5). > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9413097Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9413538Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9413964Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9414405Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9414834Z at >
[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321280#comment-17321280 ] Arvid Heise commented on FLINK-22259: - It seems as if the {{SyncEvent}} is not sent from the coordinator although it should. I also found this exception, which I have not seen before {noformat} 07:12:41,454 [Checkpoint Timer] WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 320293149b12a901fb4f3750349040db.) org.apache.flink.runtime.checkpoint.CheckpointException: Coordinator state not acknowledged successfully: DISCARDED Failure reason: Trigger checkpoint failure. at org.apache.flink.runtime.checkpoint.OperatorCoordinatorCheckpoints.acknowledgeAllCoordinators(OperatorCoordinatorCheckpoints.java:125) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.OperatorCoordinatorCheckpoints.lambda$triggerAndAcknowledgeAllCoordinatorCheckpoints$1(OperatorCoordinatorCheckpoints.java:86) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) ~[?:1.8.0_282] at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) ~[?:1.8.0_282] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_282] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:532) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1920) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1907) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1782) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1765) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingAndQueuedCheckpoints(CheckpointCoordinator.java:1965) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1748) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:47) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1434) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1048) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1020) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:569) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:269) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:250) ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:233)
[jira] [Assigned] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"
[ https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-22259: --- Assignee: Arvid Heise > UnalignedCheckpointITCase fails with "Value too large for header, this > indicates that the test is running too long" > --- > > Key: FLINK-22259 > URL: https://issues.apache.org/jira/browse/FLINK-22259 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Assignee: Arvid Heise >Priority: Major > Labels: test-stability > Fix For: 1.13.0 > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672] > > {code:java} > 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p > = 1, timeout = > 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time > elapsed: 1,420.285 s <<< ERROR! > 2021-04-13T07:37:31.9395135Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2021-04-13T07:37:31.9395717Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) > 2021-04-13T07:37:31.9396274Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168) > 2021-04-13T07:37:31.9396866Z at > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274) > 2021-04-13T07:37:31.9397318Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2021-04-13T07:37:31.9397723Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2021-04-13T07:37:31.9398312Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2021-04-13T07:37:31.9398724Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2021-04-13T07:37:31.9401916Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2021-04-13T07:37:31.9402764Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2021-04-13T07:37:31.9403756Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2021-04-13T07:37:31.9404222Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2021-04-13T07:37:31.9404624Z at > org.junit.rules.Verifier$1.evaluate(Verifier.java:35) > 2021-04-13T07:37:31.9405008Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9405449Z at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > 2021-04-13T07:37:31.9405855Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2021-04-13T07:37:31.9406362Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2021-04-13T07:37:31.9406774Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2021-04-13T07:37:31.9407512Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2021-04-13T07:37:31.9408202Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2021-04-13T07:37:31.9408655Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9409083Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9409521Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9410114Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9410775Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9411398Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2021-04-13T07:37:31.9411914Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2021-04-13T07:37:31.9412292Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2021-04-13T07:37:31.9412670Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2021-04-13T07:37:31.9413097Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2021-04-13T07:37:31.9413538Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2021-04-13T07:37:31.9413964Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2021-04-13T07:37:31.9414405Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2021-04-13T07:37:31.9414834Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2021-04-13T07:37:31.9415263Z at >
[jira] [Comment Edited] (FLINK-22368) UnalignedCheckpointITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325641#comment-17325641 ] Arvid Heise edited comment on FLINK-22368 at 4/20/21, 9:30 AM: --- The test doesn't finish as checkpointing gets stuck in the last execution attempt (5): {noformat} 23:02:26,104 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Flink Streaming Job (5d70bcb288d90589845e39c2953b27c3) switched from state RESTARTING to RUNNING. 23:02:26,118 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: source (2/20) (2d3357d530b11041d123bde87da7584b) switched from INITIALIZING to RUNNING. ... (in total all 100 tasks are RUNNING) 23:02:26,347 [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - failing-map (10/20) (23870b8b94e5ea774ca3da72a7ca7251) switched from INITIALIZING to RUNNING. ... 23:02:27,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. 23:02:28,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. ... (in total 10k failed to trigger...) 01:55:56,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. {noformat} was (Author: aheise): The test doesn't finish as checkpointing gets stuck in the last execution attempt (5): {noformat} 23:02:26,104 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Flink Streaming Job (5d70bcb288d90589845e39c2953b27c3) switched from state RESTARTING to RUNNING. 23:02:26,118 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: source (2/20) (2d3357d530b11041d123bde87da7584b) switched from INITIALIZING to RUNNING. ... (in total 100 tasks are RUNNING) 23:02:26,347 [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - failing-map (10/20) (23870b8b94e5ea774ca3da72a7ca7251) switched from INITIALIZING to RUNNING. ... 23:02:27,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. 23:02:28,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. ... (in total 10k failed to trigger...) 01:55:56,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. {noformat} > UnalignedCheckpointITCase hangs on azure > > > Key: FLINK-22368 > URL: https://issues.apache.org/jira/browse/FLINK-22368 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22368) UnalignedCheckpointITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325641#comment-17325641 ] Arvid Heise commented on FLINK-22368: - The test doesn't finish as checkpointing gets stuck in the last execution attempt (5): {noformat} 23:02:26,104 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Flink Streaming Job (5d70bcb288d90589845e39c2953b27c3) switched from state RESTARTING to RUNNING. 23:02:26,118 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: source (2/20) (2d3357d530b11041d123bde87da7584b) switched from INITIALIZING to RUNNING. ... (in total 100 tasks are RUNNING) 23:02:26,347 [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - failing-map (10/20) (23870b8b94e5ea774ca3da72a7ca7251) switched from INITIALIZING to RUNNING. ... 23:02:27,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. 23:02:28,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. ... (in total 10k failed to trigger...) 01:55:56,165 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint Failure reason: Not all required tasks are currently running. {noformat} > UnalignedCheckpointITCase hangs on azure > > > Key: FLINK-22368 > URL: https://issues.apache.org/jira/browse/FLINK-22368 > Project: Flink > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22368) UnalignedCheckpointITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22368: Component/s: Runtime / Checkpointing > UnalignedCheckpointITCase hangs on azure > > > Key: FLINK-22368 > URL: https://issues.apache.org/jira/browse/FLINK-22368 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22368) UnalignedCheckpointITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325647#comment-17325647 ] Arvid Heise commented on FLINK-22368: - Okay, it makes sense that it doesn't checkpoint as {noformat} 23:02:26,166 [Source: source (3/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (3/20)#5 (f45d32db48b407a34edc6dc048c5e0c2) switched from RUNNING to FINISHED. {noformat} I'm investigating further. > UnalignedCheckpointITCase hangs on azure > > > Key: FLINK-22368 > URL: https://issues.apache.org/jira/browse/FLINK-22368 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22368) UnalignedCheckpointITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22368: Component/s: (was: Runtime / Checkpointing) Runtime / Task > UnalignedCheckpointITCase hangs on azure > > > Key: FLINK-22368 > URL: https://issues.apache.org/jira/browse/FLINK-22368 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.13.0 >Reporter: Dawid Wysakowicz >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22368) UnalignedCheckpointITCase hangs on azure
[ https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325658#comment-17325658 ] Arvid Heise commented on FLINK-22368: - All source tasks are finished, but the job is not finishing for some reason {noformat} 23:02:26,166 [Source: source (3/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (3/20)#5 (f45d32db48b407a34edc6dc048c5e0c2) switched from RUNNING to FINISHED. 23:02:26,167 [Source: source (8/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (8/20)#5 (53d62fcb80e6b2e5ee7657033a555d6f) switched from RUNNING to FINISHED. 23:02:26,166 [Source: source (7/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (7/20)#5 (1147339526bf7fadcd47ef579c4c4130) switched from RUNNING to FINISHED. 23:02:26,168 [Source: source (2/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (2/20)#5 (2d3357d530b11041d123bde87da7584b) switched from RUNNING to FINISHED. 23:02:26,170 [Source: source (9/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (9/20)#5 (426fbb3a5b561a61affed4c40b0a8f8a) switched from RUNNING to FINISHED. 23:02:26,171 [Source: source (1/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (1/20)#5 (a5585ddf277047a6d2b67d8d3cf2cd0e) switched from RUNNING to FINISHED. 23:02:26,213 [Source: source (5/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (5/20)#5 (9ad44b9747f266816750e98063306fc4) switched from RUNNING to FINISHED. 23:02:26,216 [Source: source (20/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (20/20)#5 (d92ff5256f0c3e1e2a43c7413f6ce71f) switched from RUNNING to FINISHED. 23:02:26,215 [Source: source (13/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (13/20)#5 (327096e9fc8562fdc0ea12e031f98749) switched from RUNNING to FINISHED. 23:02:26,215 [Source: source (10/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (10/20)#5 (456858f8cd5fca51c09519f29b21641e) switched from RUNNING to FINISHED. 23:02:26,223 [Source: source (18/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (18/20)#5 (8a75c4257e592a008bca8f0a29c5e856) switched from RUNNING to FINISHED. 23:02:26,223 [Source: source (16/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (16/20)#5 (c17c62b90c66f7bd11debe913916fd89) switched from RUNNING to FINISHED. 23:02:26,225 [Source: source (11/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (11/20)#5 (dd463d97b55ee56bcb6e2853760e3daf) switched from RUNNING to FINISHED. 23:02:26,239 [Source: source (19/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (19/20)#5 (202827bf688aa5c206bd3b900ad3beb2) switched from RUNNING to FINISHED. 23:02:26,240 [Source: source (4/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (4/20)#5 (475c75d73150c1d20eaaa185f96c81e1) switched from RUNNING to FINISHED. 23:02:26,245 [Source: source (17/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (17/20)#5 (afb774c210e06bd3e62d42e749c22417) switched from RUNNING to FINISHED. 23:02:26,245 [Source: source (6/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (6/20)#5 (776d580fcfd5ddf38643822311884d70) switched from RUNNING to FINISHED. 23:02:26,246 [Source: source (14/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (14/20)#5 (7eccee9beab3b7e6b5977d5f726e9c9e) switched from RUNNING to FINISHED. 23:02:26,245 [Source: source (15/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (15/20)#5 (45df84d24e611dfd7d972795548a7f33) switched from RUNNING to FINISHED. 23:02:26,252 [Source: source (12/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Source: source (12/20)#5 (65cad528fa7a6e0fe6822c4677d8797a) switched from RUNNING to FINISHED. {noformat} {{failing-map (8/20)#5}} and none of the sinks (naturally) are not finishing. I found {noformat} 23:02:26,422 [failing-map (8/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - failing-map (8/20)#5 (77221af17305e76caafacbf7bc696af7) switched from RUNNING to CANCELED. 23:02:26,425 [failing-map (8/20)#5] INFO org.apache.flink.runtime.taskmanager.Task[] - Freeing task resources for failing-map (8/20)#5
[jira] [Resolved] (FLINK-22290) Add new unaligned checkpiont options to Python API.
[ https://issues.apache.org/jira/browse/FLINK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-22290. - Resolution: Fixed > Add new unaligned checkpiont options to Python API. > --- > > Key: FLINK-22290 > URL: https://issues.apache.org/jira/browse/FLINK-22290 > Project: Flink > Issue Type: Improvement > Components: API / Python, Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > There is currently no python equivalent of > {noformat} > CheckpointConfig#setAlignmentTimeout > CheckpointConfig#setForceUnalignedCheckpoints > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22290) Add new unaligned checkpiont options to Python API.
[ https://issues.apache.org/jira/browse/FLINK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323936#comment-17323936 ] Arvid Heise commented on FLINK-22290: - Merged as 52b400f717dc9a5c903a9b834d02b0cbf609897b..de069df949866bf332ba006b244428082f886f55 into master. > Add new unaligned checkpiont options to Python API. > --- > > Key: FLINK-22290 > URL: https://issues.apache.org/jira/browse/FLINK-22290 > Project: Flink > Issue Type: Improvement > Components: API / Python, Runtime / Checkpointing >Affects Versions: 1.13.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > > There is currently no python equivalent of > {noformat} > CheckpointConfig#setAlignmentTimeout > CheckpointConfig#setForceUnalignedCheckpoints > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout
[ https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318134#comment-17318134 ] Arvid Heise commented on FLINK-20816: - I found the root cause: the test assumes implicitly that abortion of chk-1 happens before sync phase of chk-2. If abortion is late, the mail that handles the abortion is never executed since the waitLatch blocks the mailbox thread. It's easy to reproduce by adding some sleep into {noformat} private void CheckpointCoordinator#sendAbortedMessages( List tasksToAbort, long checkpointId, long timeStamp) { // send notification of aborted checkpoints asynchronously. executor.execute( () -> { try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } {noformat} > NotifyCheckpointAbortedITCase failed due to timeout > --- > > Key: FLINK-20816 > URL: https://issues.apache.org/jira/browse/FLINK-20816 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.0 >Reporter: Matthias >Assignee: Arvid Heise >Priority: Critical > Labels: test-stability > Fix For: 1.13.0 > > Attachments: flink-20816-failure.log, flink-20816-success.log > > > [This > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245] > failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a > timeout. > {code} > 2020-12-29T21:48:40.9430511Z [INFO] Running > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase > 2020-12-29T21:50:28.0087961Z [ERROR] > testNotifyCheckpointAborted[unalignedCheckpointEnabled > =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase) > Time elapsed: 104.044 s <<< ERROR! > 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: > test timed out after 10 milliseconds > 2020-12-29T21:50:28.0088972Z at java.lang.Object.wait(Native Method) > 2020-12-29T21:50:28.0089267Z at java.lang.Object.wait(Object.java:502) > 2020-12-29T21:50:28.0089633Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61) > 2020-12-29T21:50:28.0090458Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200) > 2020-12-29T21:50:28.0091313Z at > org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183) > 2020-12-29T21:50:28.0091819Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-12-29T21:50:28.0092199Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-12-29T21:50:28.0092675Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-12-29T21:50:28.0093095Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-12-29T21:50:28.0093495Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-12-29T21:50:28.0093980Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-12-29T21:50:28.009Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-12-29T21:50:28.0094917Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-12-29T21:50:28.0095663Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > 2020-12-29T21:50:28.0096221Z at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > 2020-12-29T21:50:28.0096675Z at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2020-12-29T21:50:28.0097022Z at java.lang.Thread.run(Thread.java:748) > {code} > The branch contained changes from FLINK-20594 and FLINK-20595. These issues > remove code that is not used anymore and should have had only affects on unit > tests. [The previous > build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results] > containing all the changes accept for > [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279] > passed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22136) Device application for unaligned checkpoint test on cluster
[ https://issues.apache.org/jira/browse/FLINK-22136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise updated FLINK-22136: Description: To test unaligned checkpoints, we should use a few different applications that use different features: * Mixing forward/rescale channels with keyby or other shuffle operations * Unions * 2 or n-ary operators * Associated state ((keyed) process function) * Correctness verifications The sinks should not be mocked but rather should be able to induce a fair amount of backpressure into the system. Quite possibly, it would be a good idea to have a way to add more backpressure to the sink by running the respective system on the cluster and be able to add/remove parallel instances. Things to check in the application * Inflight data is restored to the correct keygroups -> can be checked with keyed state in a process function * Correctness: Completeness (no lost records) + no duplicates * Orderness of data for keyed exchanges (we guarantee that records with the same key retain orderness across keyed operators) * (To detect errors early, we can also use magic headers) was: To test unaligned checkpoints, we should use a few different applications that use different features: * Mixing forward/rescale channels with keyby or other shuffle operations * Unions * 2 or n-ary operators * Associated state ((keyed) process function) * Correctness verifications The sinks should not be mocked but rather should be able to induce a fair amount of backpressure into the system. Quite possibly, it would be a good idea to have a way to add more backpressure to the sink by running the respective system on the cluster and be able to add/remove parallel instances. > Device application for unaligned checkpoint test on cluster > --- > > Key: FLINK-22136 > URL: https://issues.apache.org/jira/browse/FLINK-22136 > Project: Flink > Issue Type: Sub-task >Reporter: Arvid Heise >Priority: Major > > To test unaligned checkpoints, we should use a few different applications > that use different features: > * Mixing forward/rescale channels with keyby or other shuffle operations > * Unions > * 2 or n-ary operators > * Associated state ((keyed) process function) > * Correctness verifications > The sinks should not be mocked but rather should be able to induce a fair > amount of backpressure into the system. Quite possibly, it would be a good > idea to have a way to add more backpressure to the sink by running the > respective system on the cluster and be able to add/remove parallel instances. > Things to check in the application > * Inflight data is restored to the correct keygroups -> can be checked with > keyed state in a process function > * Correctness: Completeness (no lost records) + no duplicates > * Orderness of data for keyed exchanges (we guarantee that records with the > same key retain orderness across keyed operators) > * (To detect errors early, we can also use magic headers) -- This message was sent by Atlassian Jira (v8.3.4#803005)