[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22053:

Description: 
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
~[scala-library-2.11.12.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
~[scala-library-2.11.12.jar:?]
... 12 more
{noformat}

To reproduce


{noformat}
@Test
public void testRecoveryWithFinishedSplit() throws Exception {
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.fromSequence(0, 100);
env.execute();
}
{noformat}



  was:
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 

[jira] [Assigned] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22053:
---

Assignee: (was: Arvid Heise)

> NumberSequenceSource causes fatal exception when less splits than parallelism.
> --
>
> Key: FLINK-22053
> URL: https://issues.apache.org/jira/browse/FLINK-22053
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Priority: Major
>
> If more splits than 
> {noformat}
> Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
>   at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>   ... 12 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22053:

Description: 
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
~[scala-library-2.11.12.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
~[scala-library-2.11.12.jar:?]
... 12 more
{noformat}


  was:
If a checkpoint happens after the only split is processed, the split is 
checkpointed with (from > to). Upon recovery this split causes an exception in 
the coordinator and a subsequent fatal exception.


{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 

[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22053:

Description: 
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
~[scala-library-2.11.12.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
~[scala-library-2.11.12.jar:?]
... 12 more
{noformat}

To reproduce


{noformat}
@Test
public void testRecoveryWithFinishedSplit() throws Exception {
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(12);
env.fromSequence(0, 10);
env.execute();
}
{noformat}



  was:
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 

[jira] [Updated] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22053:

Description: 
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
~[scala-library-2.11.12.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
~[scala-library-2.11.12.jar:?]
... 12 more
{noformat}

To reproduce


{noformat}
@Test
public void testLessSplitsThanParallelism() throws Exception {
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(12);
env.fromSequence(0, 10);
env.execute();
}
{noformat}



  was:
If more splits than 

{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 

[jira] [Resolved] (FLINK-21945) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-30 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21945.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

Merged into master as 
89fde844d05125c49a9bbb9d0676cd1babb3b222..af5719993dd8d5164e03171810bbd709523d0927.

> Disable checkpointing of inflight data in pointwise connections for unaligned 
> checkpoints
> -
>
> Key: FLINK-21945
> URL: https://issues.apache.org/jira/browse/FLINK-21945
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22053) Recovery of a completed split in NumberSequenceSource causes fatal exception.

2021-03-30 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-22053:
---

 Summary: Recovery of a completed split in NumberSequenceSource 
causes fatal exception.
 Key: FLINK-22053
 URL: https://issues.apache.org/jira/browse/FLINK-22053
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Affects Versions: 1.12.2, 1.13.0
Reporter: Arvid Heise


If a checkpoint happens after the only split is processed, the split is 
checkpointed with (from > to). Upon recovery this split causes an exception in 
the coordinator and a subsequent fatal exception.


{noformat}
Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
 ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
~[scala-library-2.11.12.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
~[akka-actor_2.11-2.5.21.jar:2.5.21]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
~[scala-library-2.11.12.jar:?]
... 12 more
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-04-01 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312892#comment-17312892
 ] 

Arvid Heise commented on FLINK-22053:
-

Merged into master as a93251b4599e7c7d77ec6f0796825a93224eb010, merged into 
1.12 as 945684114092f590e6cb90b78ce7fe4ccc7ada6c.

> NumberSequenceSource causes fatal exception when less splits than parallelism.
> --
>
> Key: FLINK-22053
> URL: https://issues.apache.org/jira/browse/FLINK-22053
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
>
> If more splits than 
> {noformat}
> Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
>   at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>   ... 12 more
> {noformat}
> To reproduce
> {noformat}
> @Test
> public void testLessSplitsThanParallelism() throws Exception {
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.setParallelism(12);
> env.fromSequence(0, 10);
> env.execute();
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-22053) NumberSequenceSource causes fatal exception when less splits than parallelism.

2021-04-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22053.
-
Fix Version/s: 1.12.3
   1.13.0
   Resolution: Fixed

> NumberSequenceSource causes fatal exception when less splits than parallelism.
> --
>
> Key: FLINK-22053
> URL: https://issues.apache.org/jira/browse/FLINK-22053
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0, 1.12.3
>
>
> If more splits than 
> {noformat}
> Caused by: java.lang.IllegalArgumentException: 'from' must be <= 'to'
>   at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource$NumberSequenceSplit.(NumberSequenceSource.java:148)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.api.connector.source.lib.NumberSequenceSource.createEnumerator(NumberSequenceSource.java:111)
>  ~[flink-core-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:126)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$DeferrableCoordinator.applyCall(RecreateOnResetOperatorCoordinator.java:296)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.start(RecreateOnResetOperatorCoordinator.java:71)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:182)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:85)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:501)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:955)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:873)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:383) 
> ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:605)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:180)
>  ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 
> ~[scala-library-2.11.12.jar:?]
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> ~[akka-actor_2.11-2.5.21.jar:2.5.21]
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 
> ~[scala-library-2.11.12.jar:?]
>   ... 12 more
> {noformat}
> To reproduce
> {noformat}
> @Test
> public void testLessSplitsThanParallelism() throws Exception {
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.setParallelism(12);
> env.fromSequence(0, 10);
> env.execute();
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22081:
---

Assignee: Chen Qin  (was: Prem Santosh)

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 
> 1.11.3, 1.11.4, 1.12.0, 1.12.1, 1.12.2, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13550) Support for CPU FlameGraphs in new web UI

2021-04-01 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313104#comment-17313104
 ] 

Arvid Heise commented on FLINK-13550:
-

Merged into master for 1.13 as 
e9385051cd2ac7110f02b361ac503d9153441f9f..12a99e8fe84c28fb250028b4fde4025ec9dc00c9.

Thanks [~dmvk] and [~afedulov] for your contributions!

> Support for CPU FlameGraphs in new web UI
> -
>
> Key: FLINK-13550
> URL: https://issues.apache.org/jira/browse/FLINK-13550
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST, Runtime / Web Frontend
>Reporter: David Morávek
>Assignee: Alexander Fedulov
>Priority: Major
>  Labels: pull-request-available
>
> For a better insight into a running job, it would be useful to have ability 
> to render a CPU flame graph for a particular job vertex.
> Flink already has a stack-trace sampling mechanism in-place, so it should be 
> straightforward to implement.
> This should be done by implementing a new endpoint in REST API, which would 
> sample the stack-trace the same way as current BackPressureTracker does, only 
> with a different sampling rate and length of sampling.
> [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the 
> feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13552) Render vertex FlameGraph in web UI

2021-04-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise closed FLINK-13552.
---
Release Note: Directly resolved in parent ticket.
  Resolution: Done

> Render vertex FlameGraph in web UI
> --
>
> Key: FLINK-13552
> URL: https://issues.apache.org/jira/browse/FLINK-13552
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Reporter: David Morávek
>Assignee: David Morávek
>Priority: Major
>
> Add a new FlameGraph tab in "vertex detail" page, that will actively poll 
> flame graph endpoint and render it using d3 library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13551) Add vertex FlameGraph REST endpoint

2021-04-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise closed FLINK-13551.
---
Release Note: Directly resolved in parent ticket.
  Resolution: Fixed

> Add vertex FlameGraph REST endpoint
> ---
>
> Key: FLINK-13551
> URL: https://issues.apache.org/jira/browse/FLINK-13551
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / REST
>Reporter: David Morávek
>Priority: Major
>
> Add a new endpoint that returns data for flame graph rendering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-13550) Support for CPU FlameGraphs in new web UI

2021-04-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-13550.
-
Fix Version/s: 1.13.0
 Release Note: Flink now offers Flamegraphs for each node in the job graph. 
Please enable this experimental feature by setting the respective configuration 
flag rest.flamegraph.enabled.
   Resolution: Fixed

> Support for CPU FlameGraphs in new web UI
> -
>
> Key: FLINK-13550
> URL: https://issues.apache.org/jira/browse/FLINK-13550
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST, Runtime / Web Frontend
>Reporter: David Morávek
>Assignee: Alexander Fedulov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> For a better insight into a running job, it would be useful to have ability 
> to render a CPU flame graph for a particular job vertex.
> Flink already has a stack-trace sampling mechanism in-place, so it should be 
> straightforward to implement.
> This should be done by implementing a new endpoint in REST API, which would 
> sample the stack-trace the same way as current BackPressureTracker does, only 
> with a different sampling rate and length of sampling.
> [Here|https://www.youtube.com/watch?v=GUNDehj9z9o] is a little demo of the 
> feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-23 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307383#comment-17307383
 ] 

Arvid Heise commented on FLINK-21936:
-

As a first step, we might want to provide users an explicit way to express the 
guarantees that they expect of pointwise connection. Only if the users wants to 
retain orderness, we have to disable UC for that exchange. I'm assuming that 
the vast majority of pointwise connections do not require the guarantees.

> Disable checkpointing of inflight data in pointwise connections for unaligned 
> checkpoints
> -
>
> Key: FLINK-21936
> URL: https://issues.apache.org/jira/browse/FLINK-21936
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> We currently do not have any hard guarantees on pointwise connection 
> regarding data consistency. However, since data was structured implicitly in 
> the same way as any preceding source or keyby, some users relied on this 
> behavior to divide compute-intensive tasks into smaller chunks while relying 
> on ordering guarantees.
> As long as the parallelism does not change, unaligned checkpoints (UC) 
> retains these properties. With the implementation of rescaling of UC 
> (FLINK-19801), that has changed. For most exchanges, there is a meaningful 
> way to reassign state from one channel to another (even in random order). For 
> some exchanges, the mapping is ambiguous and requires post-filtering. 
> However, for point-wise connections, it's impossible while retaining these 
> properties.
> Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to 
> rescale from parallelism p = 1 to p = 2, suddenly the records inside the 
> keyby channels need to be divided into two channels according to the 
> keygroups. That is easily possible by using the keygroup ranges of the 
> operators and a way to determine the key(group) of the record (independent of 
> the actual approach). For the forward channel, we completely lack the key 
> context. No record in the forward channel has any keygroup assigned; it's 
> also not possible to calculate it as there is no guarantee that the key is 
> still present.
> The root cause for this limitation is the conceptual mismatch between what we 
> provide and what some users assume we provide (or we assume that the users 
> assume). For example, it's impossible to use (keyed) state in task2 right 
> now, because there is no key context, but we still want to guarantee 
> orderness in respect to that key context.
> For 1.13, the easiest solution is to disable channel state in pointwise 
> connections. For any non-trivial application with at least one shuffle, the 
> number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
> connections (quadratic to p). I'd add some alternative ideas to the 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-23 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-21936:
---

 Summary: Disable checkpointing of inflight data in pointwise 
connections for unaligned checkpoints
 Key: FLINK-21936
 URL: https://issues.apache.org/jira/browse/FLINK-21936
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.13.0
Reporter: Arvid Heise
Assignee: Arvid Heise


We currently do not have any hard guarantees on pointwise connection regarding 
data consistency. However, since data was structured implicitly in the same way 
as any preceding source or keyby, some users relied on this behavior to divide 
compute-intensive tasks into smaller chunks while relying on ordering 
guarantees.

As long as the parallelism does not change, unaligned checkpoints (UC) retains 
these properties. With the implementation of rescaling of UC (FLINK-19801), 
that has changed. For most exchanges, there is a meaningful way to reassign 
state from one channel to another (even in random order). For some exchanges, 
the mapping is ambiguous and requires post-filtering. However, for point-wise 
connections, it's impossible while retaining these properties.

Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to 
rescale from parallelism p = 1 to p = 2, suddenly the records inside the keyby 
channels need to be divided into two channels according to the keygroups. That 
is easily possible by using the keygroup ranges of the operators and a way to 
determine the key(group) of the record (independent of the actual approach). 
For the forward channel, we completely lack the key context. No record in the 
forward channel has any keygroup assigned; it's also not possible to calculate 
it as there is no guarantee that the key is still present.

The root cause for this limitation is the conceptual mismatch between what we 
provide and what some users assume we provide (or we assume that the users 
assume). For example, it's impossible to use (keyed) state in task2 right now, 
because there is no key context, but we still want to guarantee orderness in 
respect to that key context.

For 1.13, the easiest solution is to disable channel state in pointwise 
connections. For any non-trivial application with at least one shuffle, the 
number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
connections (quadratic to p). I'd add some alternative ideas to the discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-23 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307386#comment-17307386
 ] 

Arvid Heise commented on FLINK-21936:
-

Alternatively or additionally, we might want to add the keyed context to 
forward channels. Then users could also use state in these tasks. For that, we 
would need to encode the keygroup in the data stream. 
Note that we would also need to find a way to encode splits at least for UC 
recovery (maybe the source coordinator can assign unique numbers to splits?).

> Disable checkpointing of inflight data in pointwise connections for unaligned 
> checkpoints
> -
>
> Key: FLINK-21936
> URL: https://issues.apache.org/jira/browse/FLINK-21936
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> We currently do not have any hard guarantees on pointwise connection 
> regarding data consistency. However, since data was structured implicitly in 
> the same way as any preceding source or keyby, some users relied on this 
> behavior to divide compute-intensive tasks into smaller chunks while relying 
> on ordering guarantees.
> As long as the parallelism does not change, unaligned checkpoints (UC) 
> retains these properties. With the implementation of rescaling of UC 
> (FLINK-19801), that has changed. For most exchanges, there is a meaningful 
> way to reassign state from one channel to another (even in random order). For 
> some exchanges, the mapping is ambiguous and requires post-filtering. 
> However, for point-wise connections, it's impossible while retaining these 
> properties.
> Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to 
> rescale from parallelism p = 1 to p = 2, suddenly the records inside the 
> keyby channels need to be divided into two channels according to the 
> keygroups. That is easily possible by using the keygroup ranges of the 
> operators and a way to determine the key(group) of the record (independent of 
> the actual approach). For the forward channel, we completely lack the key 
> context. No record in the forward channel has any keygroup assigned; it's 
> also not possible to calculate it as there is no guarantee that the key is 
> still present.
> The root cause for this limitation is the conceptual mismatch between what we 
> provide and what some users assume we provide (or we assume that the users 
> assume). For example, it's impossible to use (keyed) state in task2 right 
> now, because there is no key context, but we still want to guarantee 
> orderness in respect to that key context.
> For 1.13, the easiest solution is to disable channel state in pointwise 
> connections. For any non-trivial application with at least one shuffle, the 
> number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
> connections (quadratic to p). I'd add some alternative ideas to the 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21936) Disable checkpointing of inflight data in pointwise connections for unaligned checkpoints

2021-03-23 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307388#comment-17307388
 ] 

Arvid Heise commented on FLINK-21936:
-

A completely different approach would be possible with dynamic rescaling 
(epoch-based). We would drain the recovered data (with old parallelism) and 
then rewire from source to sink. However, that feels like Flink 3.0.

> Disable checkpointing of inflight data in pointwise connections for unaligned 
> checkpoints
> -
>
> Key: FLINK-21936
> URL: https://issues.apache.org/jira/browse/FLINK-21936
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>
> We currently do not have any hard guarantees on pointwise connection 
> regarding data consistency. However, since data was structured implicitly in 
> the same way as any preceding source or keyby, some users relied on this 
> behavior to divide compute-intensive tasks into smaller chunks while relying 
> on ordering guarantees.
> As long as the parallelism does not change, unaligned checkpoints (UC) 
> retains these properties. With the implementation of rescaling of UC 
> (FLINK-19801), that has changed. For most exchanges, there is a meaningful 
> way to reassign state from one channel to another (even in random order). For 
> some exchanges, the mapping is ambiguous and requires post-filtering. 
> However, for point-wise connections, it's impossible while retaining these 
> properties.
> Consider, {{source -> keyby -> task1 -> forward -> task2}}. No if we want to 
> rescale from parallelism p = 1 to p = 2, suddenly the records inside the 
> keyby channels need to be divided into two channels according to the 
> keygroups. That is easily possible by using the keygroup ranges of the 
> operators and a way to determine the key(group) of the record (independent of 
> the actual approach). For the forward channel, we completely lack the key 
> context. No record in the forward channel has any keygroup assigned; it's 
> also not possible to calculate it as there is no guarantee that the key is 
> still present.
> The root cause for this limitation is the conceptual mismatch between what we 
> provide and what some users assume we provide (or we assume that the users 
> assume). For example, it's impossible to use (keyed) state in task2 right 
> now, because there is no key context, but we still want to guarantee 
> orderness in respect to that key context.
> For 1.13, the easiest solution is to disable channel state in pointwise 
> connections. For any non-trivial application with at least one shuffle, the 
> number of pointwise channels (linear to p) is quickly dwarfed by all-to-all 
> connections (quadratic to p). I'd add some alternative ideas to the 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint

2021-03-26 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-21992:
---

 Summary: Investigate potential buffer leak in unaligned checkpoint
 Key: FLINK-21992
 URL: https://issues.apache.org/jira/browse/FLINK-21992
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.13.0
Reporter: Arvid Heise


A user on mailing list reported that his job gets stuck with unaligned 
checkpoint enabled.
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html

We received two similar reports in the past, but the users didn't follow up, so 
it was not as easy to diagnose as this time where the initial report already 
contains many relevant data points. 

Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2021-03-11 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299809#comment-17299809
 ] 

Arvid Heise commented on FLINK-21535:
-

A quick assessment of the 3 cases: the test is just running too long and some 
test implementations that track all records are running OOM. The root cause 
however is rather that the test take >10 min when they should finish <<1min. 
I'll investigate further.

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5682167Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5683040Z  at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2021-02-27T02:11:41.5683759Z  at 
> 

[jira] [Resolved] (FLINK-19801) Add support for virtual channels

2021-03-11 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-19801.
-
Resolution: Fixed

> Add support for virtual channels
> 
>
> Key: FLINK-19801
> URL: https://issues.apache.org/jira/browse/FLINK-19801
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> During rescaling of unaligned checkpoints, if state from multiple former 
> channels are read on input or output side to recover a specific channel, then 
> these buffers are multiplexed on output side and demultiplexed on input side 
> to guarantee a consistent recovery of spanning records:
> Assume two channels C1, C2 connect operator A and B and both have one buffer 
> in the output and in the input part of the channel respectively, where a 
> record spans. Assume that the buffers are named O1 for output buffer of C1 
> and I2 for input buffer of C2 etc. Then after rescaling both channels become 
> one channel C. Then, the buffers may be restored as I1, I2, O1, O2.
> Channels use the mapping of FLINK-19533 to infer the need for virtual 
> channels and distribute the needed resources. Virtual channels are removed on 
> the EndOfChannelRecovery epoch marker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19801) Add support for virtual channels

2021-03-11 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299805#comment-17299805
 ] 

Arvid Heise commented on FLINK-19801:
-

Merged as 
8bfc905bf5e9a7e523f0f083c948cbb32ac260fd..b4c57c056ecc54bb1a8d04e6d4222639036dccfa
 into master.

> Add support for virtual channels
> 
>
> Key: FLINK-19801
> URL: https://issues.apache.org/jira/browse/FLINK-19801
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> During rescaling of unaligned checkpoints, if state from multiple former 
> channels are read on input or output side to recover a specific channel, then 
> these buffers are multiplexed on output side and demultiplexed on input side 
> to guarantee a consistent recovery of spanning records:
> Assume two channels C1, C2 connect operator A and B and both have one buffer 
> in the output and in the input part of the channel respectively, where a 
> record spans. Assume that the buffers are named O1 for output buffer of C1 
> and I2 for input buffer of C2 etc. Then after rescaling both channels become 
> one channel C. Then, the buffers may be restored as I1, I2, O1, O2.
> Channels use the mapping of FLINK-19533 to infer the need for virtual 
> channels and distribute the needed resources. Virtual channels are removed on 
> the EndOfChannelRecovery epoch marker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2021-03-11 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299809#comment-17299809
 ] 

Arvid Heise edited comment on FLINK-21535 at 3/11/21, 6:54 PM:
---

A quick assessment of the 3 cases: the test is just running too long and some 
test implementations that track all records are running OOM. The root cause 
however is rather that the test take >10 min when they should finish <<1min. 
I'll investigate further.

Quite possible that FLINK-21689 is a duplicate.


was (Author: aheise):
A quick assessment of the 3 cases: the test is just running too long and some 
test implementations that track all records are running OOM. The root cause 
however is rather that the test take >10 min when they should finish <<1min. 
I'll investigate further.

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> 

[jira] [Created] (FLINK-21797) Performance regression on 03/11/21

2021-03-15 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-21797:
---

 Summary: Performance regression on 03/11/21
 Key: FLINK-21797
 URL: https://issues.apache.org/jira/browse/FLINK-21797
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.13.0
Reporter: Arvid Heise
Assignee: Arvid Heise


http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3,5=tupleKeyBy=2=200=off=on=on



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20130) Add ZStandard format to inputs

2021-03-19 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304783#comment-17304783
 ] 

Arvid Heise commented on FLINK-20130:
-

Remerged as a407322a452b8a371d0ce25e8f5f8418556371ef into master.

> Add ZStandard format to inputs
> --
>
> Key: FLINK-20130
> URL: https://issues.apache.org/jira/browse/FLINK-20130
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Reporter: João Boto
>Assignee: João Boto
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Allow Flink to read files compressed in ZStandard (.zst)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-20130) Add ZStandard format to inputs

2021-03-19 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-20130.
-
Resolution: Fixed

> Add ZStandard format to inputs
> --
>
> Key: FLINK-20130
> URL: https://issues.apache.org/jira/browse/FLINK-20130
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Reporter: João Boto
>Assignee: João Boto
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Allow Flink to read files compressed in ZStandard (.zst)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-21689) UnalignedCheckpointITCase does not terminate

2021-03-12 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise closed FLINK-21689.
---
Resolution: Duplicate

> UnalignedCheckpointITCase does not terminate
> 
>
> Key: FLINK-21689
> URL: https://issues.apache.org/jira/browse/FLINK-21689
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Tests
>Affects Versions: 1.13.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> So far we assumed that the UC tests fail because of FLINK-21400, but even 
> with that in place they still do not pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-21540) finegrained_resource_management tests hang on azure

2021-03-12 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise closed FLINK-21540.
---
Resolution: Duplicate

>  finegrained_resource_management tests hang on azure
> 
>
> Key: FLINK-21540
> URL: https://issues.apache.org/jira/browse/FLINK-21540
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13905=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2021-03-12 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300147#comment-17300147
 ] 

Arvid Heise commented on FLINK-21535:
-

I verified that FLINK-21540, FLINK-21599, and FLINK-21689 are duplicates and 
closed them as such. It's also "only" a test issue. Fix for test coming today.

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5682167Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5683040Z  at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2021-02-27T02:11:41.5683759Z  at 
> 

[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2021-03-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301851#comment-17301851
 ] 

Arvid Heise commented on FLINK-21535:
-

Merged as c177d15323d5025f0cf737b98bb051efbc08a149 into master.

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5682167Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5683040Z  at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2021-02-27T02:11:41.5683759Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> 2021-02-27T02:11:41.5684493Z  at 
> 

[jira] [Commented] (FLINK-20130) Add ZStandard format to inputs

2021-03-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301877#comment-17301877
 ] 

Arvid Heise commented on FLINK-20130:
-

Merged into master as 889b3845217141b295eb2b60c3dd8a2c245b429a.

> Add ZStandard format to inputs
> --
>
> Key: FLINK-20130
> URL: https://issues.apache.org/jira/browse/FLINK-20130
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Reporter: João Boto
>Assignee: João Boto
>Priority: Major
>  Labels: pull-request-available
>
> Allow Flink to read files compressed in ZStandard (.zst)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-20130) Add ZStandard format to inputs

2021-03-15 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-20130:
---

Assignee: João Boto

> Add ZStandard format to inputs
> --
>
> Key: FLINK-20130
> URL: https://issues.apache.org/jira/browse/FLINK-20130
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Reporter: João Boto
>Assignee: João Boto
>Priority: Major
>  Labels: pull-request-available
>
> Allow Flink to read files compressed in ZStandard (.zst)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-20130) Add ZStandard format to inputs

2021-03-15 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-20130.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> Add ZStandard format to inputs
> --
>
> Key: FLINK-20130
> URL: https://issues.apache.org/jira/browse/FLINK-20130
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Core
>Reporter: João Boto
>Assignee: João Boto
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Allow Flink to read files compressed in ZStandard (.zst)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2021-03-16 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21535.
-
Fix Version/s: 1.12.3
   1.13.0
   Resolution: Fixed

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0, 1.12.3
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5682167Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5683040Z  at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2021-02-27T02:11:41.5683759Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> 2021-02-27T02:11:41.5684493Z  at 
> 

[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2021-03-16 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302747#comment-17302747
 ] 

Arvid Heise commented on FLINK-21535:
-

Merged as 755a8d8214554c64e9db0271a827485208185b8d into 1.12.

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5682167Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5683040Z  at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2021-02-27T02:11:41.5683759Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> 2021-02-27T02:11:41.5684493Z  at 
> 

[jira] [Commented] (FLINK-21511) Flink connector elasticsearch 6.x has a bug about BulkProcessor hangs for threads deadlocked

2021-03-01 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292715#comment-17292715
 ] 

Arvid Heise commented on FLINK-21511:
-

Thank you [~zhangmeng0426] for bringing this up and doing the investigation. I 
assigned the ticket to you.

> Flink connector elasticsearch 6.x has a  bug about BulkProcessor hangs for 
> threads deadlocked
> -
>
> Key: FLINK-21511
> URL: https://issues.apache.org/jira/browse/FLINK-21511
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.10.3, 1.11.3, 1.12.1
>Reporter: zhangmeng
>Assignee: zhangmeng
>Priority: Major
>  Labels: pull-request-available
>
> We use flink1.10, flink elasticsearch connector 6.x to write elasticsearch. A 
> total of 50 tasks running a weeks. There were more than 30 tasks that no 
> longer wrote data. Investigation found that there was a deadlock bug in the 
> current version of elasticsearch. and fixed on high version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-21511) Flink connector elasticsearch 6.x has a bug about BulkProcessor hangs for threads deadlocked

2021-03-01 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-21511:
---

Assignee: zhangmeng

> Flink connector elasticsearch 6.x has a  bug about BulkProcessor hangs for 
> threads deadlocked
> -
>
> Key: FLINK-21511
> URL: https://issues.apache.org/jira/browse/FLINK-21511
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.10.3, 1.11.3, 1.12.1
>Reporter: zhangmeng
>Assignee: zhangmeng
>Priority: Major
>  Labels: pull-request-available
>
> We use flink1.10, flink elasticsearch connector 6.x to write elasticsearch. A 
> total of 50 tasks running a weeks. There were more than 30 tasks that no 
> longer wrote data. Investigation found that there was a deadlock bug in the 
> current version of elasticsearch. and fixed on high version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17510) StreamingKafkaITCase. testKafka timeouts on downloading Kafka

2021-02-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291119#comment-17291119
 ] 

Arvid Heise commented on FLINK-17510:
-

https://dev.azure.com/arvidheise0209/arvidheise/_build/results?buildId=947=logs=9401bf33-03c4-5a24-83fe-e51d75db73ef=72901ab2-7cd0-57be-82b1-bca51de20fba

> StreamingKafkaITCase. testKafka timeouts on downloading Kafka
> -
>
> Key: FLINK-17510
> URL: https://issues.apache.org/jira/browse/FLINK-17510
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Connectors / Kafka, Tests
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=585=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> {code}
> 2020-05-05T00:06:49.7268716Z [INFO] 
> ---
> 2020-05-05T00:06:49.7268938Z [INFO]  T E S T S
> 2020-05-05T00:06:49.7269282Z [INFO] 
> ---
> 2020-05-05T00:06:50.5336315Z [INFO] Running 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-05-05T00:11:26.8603439Z [ERROR] Tests run: 3, Failures: 0, Errors: 2, 
> Skipped: 0, Time elapsed: 276.323 s <<< FAILURE! - in 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-05-05T00:11:26.8604882Z [ERROR] testKafka[1: 
> kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase)
>   Time elapsed: 120.024 s  <<< ERROR!
> 2020-05-05T00:11:26.8605942Z java.io.IOException: Process ([wget, -q, -P, 
> /tmp/junit2815750531595874769/downloads/1290570732, 
> https://archive.apache.org/dist/kafka/0.11.0.2/kafka_2.11-0.11.0.2.tgz]) 
> exceeded timeout (12) or number of retries (3).
> 2020-05-05T00:11:26.8606732Z  at 
> org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:132)
> 2020-05-05T00:11:26.8607321Z  at 
> org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127)
> 2020-05-05T00:11:26.8607826Z  at 
> org.apache.flink.tests.util.cache.LolCache.getOrDownload(LolCache.java:31)
> 2020-05-05T00:11:26.8608343Z  at 
> org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.setupKafkaDist(LocalStandaloneKafkaResource.java:98)
> 2020-05-05T00:11:26.8608892Z  at 
> org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.before(LocalStandaloneKafkaResource.java:92)
> 2020-05-05T00:11:26.8609602Z  at 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46)
> 2020-05-05T00:11:26.8610026Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-05-05T00:11:26.8610553Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-05-05T00:11:26.8610958Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-05-05T00:11:26.8611388Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-05-05T00:11:26.8612214Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-05-05T00:11:26.8612706Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-05-05T00:11:26.8613109Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-05-05T00:11:26.8613551Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-05-05T00:11:26.8614019Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-05-05T00:11:26.8614442Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-05-05T00:11:26.8614869Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-05-05T00:11:26.8615251Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-05-05T00:11:26.8615654Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-05-05T00:11:26.8616060Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-05-05T00:11:26.8616465Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-05-05T00:11:26.8616893Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-05-05T00:11:26.8617893Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-05-05T00:11:26.8618490Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-05-05T00:11:26.8619056Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-05-05T00:11:26.8619589Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-05-05T00:11:26.8620073Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 

[jira] [Commented] (FLINK-21452) FLIP-27 sources cannot reliably downscale

2021-02-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291153#comment-17291153
 ] 

Arvid Heise commented on FLINK-21452:
-

Merged as fb99ce2e22ca84dece1f7a431a92a4cecb6a71f2^ in 1.12 and as 
81cfe465c9e4a17e563e1b4c02cd60a63b984de5^ in master.

> FLIP-27 sources cannot reliably downscale
> -
>
> Key: FLINK-21452
> URL: https://issues.apache.org/jira/browse/FLINK-21452
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.12.1, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.2, 1.13.0
>
>
> Sources currently store their registered readers into the snapshot. However, 
> when downscaling, we have unmatched readers that we violate a couple of 
> invariants.
> The solution is to not store registered readers - they are re-registered 
> anyways on restart.
> To keep it backward compatible, the best option is to always store an empty 
> set of readers while writing the snapshot and discard any recovered readers 
> from the snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-21452) FLIP-27 sources cannot reliably downscale

2021-02-25 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21452.
-
Resolution: Fixed

> FLIP-27 sources cannot reliably downscale
> -
>
> Key: FLINK-21452
> URL: https://issues.apache.org/jira/browse/FLINK-21452
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.12.1, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.2, 1.13.0
>
>
> Sources currently store their registered readers into the snapshot. However, 
> when downscaling, we have unmatched readers that we violate a couple of 
> invariants.
> The solution is to not store registered readers - they are re-registered 
> anyways on restart.
> To keep it backward compatible, the best option is to always store an empty 
> set of readers while writing the snapshot and discard any recovered readers 
> from the snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21452) FLIP-27 sources cannot reliably downscale

2021-02-25 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21452:

Description: 
Sources currently store their registered readers into the snapshot. However, 
when downscaling, there are unmatched readers we violate a couple of invariants.

The solution is to not store registered readers - they are re-registered 
anyways on restart.

To keep it backward compatible, the best option is to always store an empty set 
of readers while writing the snapshot and discard any recovered readers from 
the snapshot.

  was:
Sources currently store their registered readers into the snapshot. However, 
when downscaling, we have unmatched readers that we violate a couple of 
invariants.

The solution is to not store registered readers - they are re-registered 
anyways on restart.

To keep it backward compatible, the best option is to always store an empty set 
of readers while writing the snapshot and discard any recovered readers from 
the snapshot.


> FLIP-27 sources cannot reliably downscale
> -
>
> Key: FLINK-21452
> URL: https://issues.apache.org/jira/browse/FLINK-21452
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.12.1, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.2, 1.13.0
>
>
> Sources currently store their registered readers into the snapshot. However, 
> when downscaling, there are unmatched readers we violate a couple of 
> invariants.
> The solution is to not store registered readers - they are re-registered 
> anyways on restart.
> To keep it backward compatible, the best option is to always store an empty 
> set of readers while writing the snapshot and discard any recovered readers 
> from the snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17510) StreamingKafkaITCase. testKafka timeouts on downloading Kafka

2021-02-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291126#comment-17291126
 ] 

Arvid Heise commented on FLINK-17510:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13773=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529

> StreamingKafkaITCase. testKafka timeouts on downloading Kafka
> -
>
> Key: FLINK-17510
> URL: https://issues.apache.org/jira/browse/FLINK-17510
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Connectors / Kafka, Tests
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=585=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> {code}
> 2020-05-05T00:06:49.7268716Z [INFO] 
> ---
> 2020-05-05T00:06:49.7268938Z [INFO]  T E S T S
> 2020-05-05T00:06:49.7269282Z [INFO] 
> ---
> 2020-05-05T00:06:50.5336315Z [INFO] Running 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-05-05T00:11:26.8603439Z [ERROR] Tests run: 3, Failures: 0, Errors: 2, 
> Skipped: 0, Time elapsed: 276.323 s <<< FAILURE! - in 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-05-05T00:11:26.8604882Z [ERROR] testKafka[1: 
> kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase)
>   Time elapsed: 120.024 s  <<< ERROR!
> 2020-05-05T00:11:26.8605942Z java.io.IOException: Process ([wget, -q, -P, 
> /tmp/junit2815750531595874769/downloads/1290570732, 
> https://archive.apache.org/dist/kafka/0.11.0.2/kafka_2.11-0.11.0.2.tgz]) 
> exceeded timeout (12) or number of retries (3).
> 2020-05-05T00:11:26.8606732Z  at 
> org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:132)
> 2020-05-05T00:11:26.8607321Z  at 
> org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127)
> 2020-05-05T00:11:26.8607826Z  at 
> org.apache.flink.tests.util.cache.LolCache.getOrDownload(LolCache.java:31)
> 2020-05-05T00:11:26.8608343Z  at 
> org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.setupKafkaDist(LocalStandaloneKafkaResource.java:98)
> 2020-05-05T00:11:26.8608892Z  at 
> org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.before(LocalStandaloneKafkaResource.java:92)
> 2020-05-05T00:11:26.8609602Z  at 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46)
> 2020-05-05T00:11:26.8610026Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-05-05T00:11:26.8610553Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-05-05T00:11:26.8610958Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-05-05T00:11:26.8611388Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-05-05T00:11:26.8612214Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-05-05T00:11:26.8612706Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-05-05T00:11:26.8613109Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-05-05T00:11:26.8613551Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-05-05T00:11:26.8614019Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-05-05T00:11:26.8614442Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-05-05T00:11:26.8614869Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-05-05T00:11:26.8615251Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-05-05T00:11:26.8615654Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-05-05T00:11:26.8616060Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-05-05T00:11:26.8616465Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-05-05T00:11:26.8616893Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-05-05T00:11:26.8617893Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-05-05T00:11:26.8618490Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-05-05T00:11:26.8619056Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-05-05T00:11:26.8619589Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-05-05T00:11:26.8620073Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 

[jira] [Commented] (FLINK-17510) StreamingKafkaITCase. testKafka timeouts on downloading Kafka

2021-02-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291114#comment-17291114
 ] 

Arvid Heise commented on FLINK-17510:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13755=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=05b74a19-4ee4-5036-c46f-ada307df6cf0

> StreamingKafkaITCase. testKafka timeouts on downloading Kafka
> -
>
> Key: FLINK-17510
> URL: https://issues.apache.org/jira/browse/FLINK-17510
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines, Connectors / Kafka, Tests
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=585=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> {code}
> 2020-05-05T00:06:49.7268716Z [INFO] 
> ---
> 2020-05-05T00:06:49.7268938Z [INFO]  T E S T S
> 2020-05-05T00:06:49.7269282Z [INFO] 
> ---
> 2020-05-05T00:06:50.5336315Z [INFO] Running 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-05-05T00:11:26.8603439Z [ERROR] Tests run: 3, Failures: 0, Errors: 2, 
> Skipped: 0, Time elapsed: 276.323 s <<< FAILURE! - in 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-05-05T00:11:26.8604882Z [ERROR] testKafka[1: 
> kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase)
>   Time elapsed: 120.024 s  <<< ERROR!
> 2020-05-05T00:11:26.8605942Z java.io.IOException: Process ([wget, -q, -P, 
> /tmp/junit2815750531595874769/downloads/1290570732, 
> https://archive.apache.org/dist/kafka/0.11.0.2/kafka_2.11-0.11.0.2.tgz]) 
> exceeded timeout (12) or number of retries (3).
> 2020-05-05T00:11:26.8606732Z  at 
> org.apache.flink.tests.util.AutoClosableProcess$AutoClosableProcessBuilder.runBlockingWithRetry(AutoClosableProcess.java:132)
> 2020-05-05T00:11:26.8607321Z  at 
> org.apache.flink.tests.util.cache.AbstractDownloadCache.getOrDownload(AbstractDownloadCache.java:127)
> 2020-05-05T00:11:26.8607826Z  at 
> org.apache.flink.tests.util.cache.LolCache.getOrDownload(LolCache.java:31)
> 2020-05-05T00:11:26.8608343Z  at 
> org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.setupKafkaDist(LocalStandaloneKafkaResource.java:98)
> 2020-05-05T00:11:26.8608892Z  at 
> org.apache.flink.tests.util.kafka.LocalStandaloneKafkaResource.before(LocalStandaloneKafkaResource.java:92)
> 2020-05-05T00:11:26.8609602Z  at 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:46)
> 2020-05-05T00:11:26.8610026Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-05-05T00:11:26.8610553Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-05-05T00:11:26.8610958Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-05-05T00:11:26.8611388Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-05-05T00:11:26.8612214Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-05-05T00:11:26.8612706Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-05-05T00:11:26.8613109Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-05-05T00:11:26.8613551Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-05-05T00:11:26.8614019Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-05-05T00:11:26.8614442Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-05-05T00:11:26.8614869Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-05-05T00:11:26.8615251Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-05-05T00:11:26.8615654Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-05-05T00:11:26.8616060Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-05-05T00:11:26.8616465Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-05-05T00:11:26.8616893Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-05-05T00:11:26.8617893Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-05-05T00:11:26.8618490Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-05-05T00:11:26.8619056Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-05-05T00:11:26.8619589Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-05-05T00:11:26.8620073Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 

[jira] [Commented] (FLINK-21490) UnalignedCheckpointITCase fails on azure

2021-02-25 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291140#comment-17291140
 ] 

Arvid Heise commented on FLINK-21490:
-

Merged as 29abccd4cb7d4905fa168f8d7b68a113e9640fca^ in master and as 
0c1b20d2119463d4571d17de607aebfff1b4b17f^ in 1.12.

> UnalignedCheckpointITCase fails on azure
> 
>
> Key: FLINK-21490
> URL: https://issues.apache.org/jira/browse/FLINK-21490
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.1, 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.2, 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13682=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2
> {code}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
>   at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
>   at akka.dispatch.OnComplete.internal(Future.scala:264)
>   at akka.dispatch.OnComplete.internal(Future.scala:261)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>   at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
>   at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
>   at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
>   at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
>   at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5, 
> backoffTimeMS=100)
>   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:130)
>   at 
> 

[jira] [Resolved] (FLINK-21490) UnalignedCheckpointITCase fails on azure

2021-02-25 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21490.
-
Resolution: Fixed

> UnalignedCheckpointITCase fails on azure
> 
>
> Key: FLINK-21490
> URL: https://issues.apache.org/jira/browse/FLINK-21490
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.1, 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.2, 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13682=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2
> {code}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
>   at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
>   at akka.dispatch.OnComplete.internal(Future.scala:264)
>   at akka.dispatch.OnComplete.internal(Future.scala:261)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>   at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
>   at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
>   at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
>   at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
>   at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=5, 
> backoffTimeMS=100)
>   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:130)
>   at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:81)
>   at 
> 

[jira] [Created] (FLINK-21452) FLIP-27 sources cannot reliably downscale

2021-02-23 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-21452:
---

 Summary: FLIP-27 sources cannot reliably downscale
 Key: FLINK-21452
 URL: https://issues.apache.org/jira/browse/FLINK-21452
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Common
Affects Versions: 1.12.1, 1.13.0
Reporter: Arvid Heise
Assignee: Arvid Heise


Sources currently store their registered readers into the snapshot. However, 
when downscaling, we have unmatched readers that we violate a couple of 
invariants.

The solution is to not store registered readers - they are re-registered 
anyways on restart.

To keep it backward compatible, the best option is to always store an empty set 
of readers while writing the snapshot and discard any recovered readers from 
the snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21490) UnalignedCheckpointITCase fails on azure

2021-02-24 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290742#comment-17290742
 ] 

Arvid Heise commented on FLINK-21490:
-

The error is probably test-only:
For some reason the test does not terminate after 10 successful checkpoints (to 
be investigated).

{noformat}
12:21:43,173 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 3165 (type=CHECKPOINT) @ 1614169303172 for job 
78d8cb678ee2304d517c9e42bff43aea.
{noformat}

I suspect that we overflow {{MAX_INT}} in {{value}}, and then {{checkHeader}} 
fails as it uses the upper 4 bytes of the long. We have already hardened that 
part to give a meaningful exception in the {{UCRescaleITCase}}, but it might be 
a good idea to extract that to this ticket as that test will only go into 
master.

So for now I'd harden the test. There is also a related issue with unions that 
I initially suspected.

> UnalignedCheckpointITCase fails on azure
> 
>
> Key: FLINK-21490
> URL: https://issues.apache.org/jira/browse/FLINK-21490
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13682=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2
> {code}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
>   at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
>   at akka.dispatch.OnComplete.internal(Future.scala:264)
>   at akka.dispatch.OnComplete.internal(Future.scala:261)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>   at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>   at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
>   at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
>   at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
>   at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
>   at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>   at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> 

[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316515#comment-17316515
 ] 

Arvid Heise commented on FLINK-20816:
-

[~yunta]'s analysis is spot on; there is not much to add as the interesting 
debug statements are currently not in.

Interestingly, a line like 

{noformat}
21:04:11,714 [ DeclineSink (1/1)#0] DEBUG 
org.apache.flink.runtime.state.SnapshotStrategyRunner[] - 
StuckAsyncSnapshotStrategy (FsCheckpointStorageLocation 
{fileSystem=org.apache.flink.core.fs.SafetyNetWrapperFileSystem@3f0e55b0, 
checkpointDirectory=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/chk-2,
 
sharedStateDirectory=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/shared,
 
taskOwnedStateDirectory=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/taskowned,
 
metadataFilePath=file:/tmp/junit7287967740618809656/junit2918432964059421469/663881dcecc7cc89be722ae89e3384ab/chk-2/_metadata,
 reference=(default), fileStateSizeThreshold=20480, writeBufferSize=4096}, 
synchronous part) in thread Thread[DeclineSink (1/1)#0,5,Flink Task Threads] 
took 0 ms.
{noformat}

is missing from the failed log, which indicates that we never successfully 
execute {{SubtaskCheckpointCoordinatorImpl#buildOperatorSnapshotFutures}}. The 
part of unaligned checkpoint before Yun's fragment is executed normally, so I 
don't immediately see a connection to unaligned checkpoints.


> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> 

[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316519#comment-17316519
 ] 

Arvid Heise commented on FLINK-20816:
-

Indeed 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15528=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9680
 shows that the same occurs for aligned checkpoints.

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure

2021-04-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317784#comment-17317784
 ] 

Arvid Heise commented on FLINK-22173:
-

Browsing through the log, we first have some issues on cancellation

4x
{noformat}
23:15:14,497 [ failing-map (7/7)#0] WARN  
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - 
failing-map (7/7)#0 (34b929a926bea2eaa9eb1eccee63b4cb): Error during release of 
channel resources: 
org.apache.flink.shaded.netty4.io.netty.util.IllegalReferenceCountException: 
refCnt: 0.

java.io.IOException: 
org.apache.flink.shaded.netty4.io.netty.util.IllegalReferenceCountException: 
refCnt: 0
{noformat}


Then, there is seemingly a deadlock
{noformat}
23:15:42,548 [Flink Netty Client (0) Thread 3] TRACE 
org.apache.flink.runtime.io.network.logger.NetworkActionsLogger [] - [global0 
(1/7)#1 (b1735bf22fee2c8bc8b3199426d089d7)] RemoteInputChannel#onBuffer 
Buffer{size=651, hash=-770186644}, seq 392, 
ChannelStatePersister(lastSeenBarrier=10 (COMPLETED)} @ 
InputChannelInfo{gateIdx=0, inputChannelIdx=5}
23:25:16,104 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Checkpoint 10 
of job b3f28f25c25a01c99593dcf74948687e expired before completing.
{noformat}

Leading to
{noformat}
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable 
failure threshold.
{noformat}


> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-22173
> URL: https://issues.apache.org/jira/browse/FLINK-22173
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628
> {code}
> 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase
> 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no 
> scale union from 7 to 
> 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase)  
> Time elapsed: 607.467 s  <<< ERROR!
> 2021-04-08T23:25:56.3133586Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-08T23:25:56.3134070Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-08T23:25:56.3134643Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-08T23:25:56.3135577Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368)
> 2021-04-08T23:25:56.3138843Z  at 
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> 2021-04-08T23:25:56.3139402Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-08T23:25:56.3139880Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-08T23:25:56.3140328Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-08T23:25:56.3140844Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-08T23:25:56.3141768Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-08T23:25:56.3142272Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-08T23:25:56.3142706Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-08T23:25:56.3143142Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-08T23:25:56.3143608Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-08T23:25:56.3144039Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-08T23:25:56.3144434Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-08T23:25:56.3145027Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-08T23:25:56.3145484Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-08T23:25:56.3145981Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-08T23:25:56.3146421Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3146843Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 

[jira] [Commented] (FLINK-21873) CoordinatedSourceRescaleITCase.testUpscaling fails on AZP

2021-04-07 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316684#comment-17316684
 ] 

Arvid Heise commented on FLINK-21873:
-

Merged into master as 9953206599910983425dceea7a48164370fa605b and 1.12 as 
69cd36d9aa6e4aeb2ad827020d125712307ab585.

> CoordinatedSourceRescaleITCase.testUpscaling fails on AZP
> -
>
> Key: FLINK-21873
> URL: https://issues.apache.org/jira/browse/FLINK-21873
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.13.0
>Reporter: Till Rohrmann
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.14.0
>
>
> The test {{CoordinatedSourceRescaleITCase.testUpscaling}} fails on AZP with
> {code}
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   ... 4 more
> Caused by: java.lang.Exception: successfully restored checkpoint
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:139)
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:126)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:161)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
>   at 
> org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:95)
>   at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>   at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:408)
>   at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:190)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:624)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:588)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14997=logs=fc5181b0-e452-5c8f-68de-1097947f6483=62110053-334f-5295-a0ab-80dd7e2babbf=22049
> cc [~AHeise]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-21873) CoordinatedSourceRescaleITCase.testUpscaling fails on AZP

2021-04-07 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21873.
-
Fix Version/s: (was: 1.14.0)
   1.13.0
   Resolution: Fixed

> CoordinatedSourceRescaleITCase.testUpscaling fails on AZP
> -
>
> Key: FLINK-21873
> URL: https://issues.apache.org/jira/browse/FLINK-21873
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.13.0
>Reporter: Till Rohrmann
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
>
> The test {{CoordinatedSourceRescaleITCase.testUpscaling}} fails on AZP with
> {code}
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   ... 4 more
> Caused by: java.lang.Exception: successfully restored checkpoint
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:139)
>   at 
> org.apache.flink.connector.base.source.reader.CoordinatedSourceRescaleITCase$FailingMapFunction.map(CoordinatedSourceRescaleITCase.java:126)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
>   at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:161)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
>   at 
> org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
>   at 
> org.apache.flink.api.connector.source.lib.util.IteratorSourceReader.pollNext(IteratorSourceReader.java:95)
>   at 
> org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:275)
>   at 
> org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
>   at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:408)
>   at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:190)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:624)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:588)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14997=logs=fc5181b0-e452-5c8f-68de-1097947f6483=62110053-334f-5295-a0ab-80dd7e2babbf=22049
> cc [~AHeise]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317839#comment-17317839
 ] 

Arvid Heise edited comment on FLINK-20816 at 4/9/21, 10:00 AM:
---

With some more echo debugging, it's most likely caused by

{noformat}
   OperatorSnapshotFutures snapshotInProgress =
checkpointStreamOperator(
op, checkpointMetaData, checkpointOptions, storage, 
isRunning);
{noformat}

hanging in the sync phase.


was (Author: aheise):
With some more echo debugging, it's most likely caused by

{noformat}
CheckpointStreamFactory storage =
checkpointStorage.resolveCheckpointStorageLocation(
checkpointId, checkpointOptions.getTargetLocation());
{noformat}

hanging in the sync phase.

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317847#comment-17317847
 ] 

Arvid Heise commented on FLINK-20816:
-

That is actually be design of the test
{noformat}
if (context.getCheckpointId() == DECLINE_CHECKPOINT_ID) {
DeclineSink.waitLatch.await();
}
{noformat}
DeclineSink is not supposed to complete it until the abortion of the first 
checkpoint is verified.

However, there is no log statement that indicate that, there is an abortion 
call happening at all.
In the attached success.log we have

{noformat}
21:04:11,624 [Source: NormalSource (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 1 for task Source: NormalSource (1/1)#0
21:04:11,624 [ DeclineSink (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 1 for task DeclineSink (1/1)#0
21:04:11,625 [   NormalMap (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 1 for task NormalMap (1/1)#0
21:04:11,739 [Source: NormalSource (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 2 for task Source: NormalSource (1/1)#0
21:04:11,739 [   NormalMap (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 2 for task NormalMap (1/1)#0
21:04:11,739 [ DeclineSink (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 2 for task DeclineSink (1/1)#0
{noformat}

while in failure.log, I can only find

{noformat}
21:04:19,260 [Source: NormalSource (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 1 for task Source: NormalSource (1/1)#0
21:04:19,268 [   NormalMap (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 1 for task NormalMap (1/1)#0
21:05:58,297 [Source: NormalSource (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 2 for task Source: NormalSource (1/1)#0
21:05:58,297 [   NormalMap (1/1)#0] DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Notification of aborted checkpoint 2 for task NormalMap (1/1)#0
{noformat}

It might be some race condition, where the {{SubtaskCheckpointCoordinatorImpl}} 
does not know that the {{DeclineSink}} is already running.

{noformat}
21:04:19,037 [flink-akka.actor.default-dispatcher-2] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - DeclineSink 
(1/1) (7457bf515844f409738c9929fffc54f7) switched from DEPLOYING to RUNNING.
{noformat}



> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> 

[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317839#comment-17317839
 ] 

Arvid Heise commented on FLINK-20816:
-

With some more echo debugging, it's most likely caused by

{noformat}
CheckpointStreamFactory storage =
checkpointStorage.resolveCheckpointStorageLocation(
checkpointId, checkpointOptions.getTargetLocation());
{noformat}

hanging in the sync phase.

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-08 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316922#comment-17316922
 ] 

Arvid Heise commented on FLINK-22081:
-

Merged into master as 2d3559e66db, into 1.12 as a9b34a3db23, and into 1.11 as 
3bd44e083c8.

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 
> 1.11.3, 1.11.4, 1.12.0, 1.12.1, 1.12.2, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-08 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22081:

Fix Version/s: (was: 1.10.4)
   (was: 1.12.2)
   (was: 1.12.1)
   (was: 1.11.3)
   (was: 1.10.3)
   (was: 1.11.2)
   (was: 1.11.1)
   (was: 1.12.0)
   (was: 1.10.2)
   (was: 1.10.1)
   (was: 1.11.0)

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-08 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22081:

Affects Version/s: 1.13.0
   1.10.3
   1.11.3
   1.12.2

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-08 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22081:

Priority: Major  (was: Minor)

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-08 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316924#comment-17316924
 ] 

Arvid Heise commented on FLINK-22081:
-

Merging into 1.10 is quite an effort and the version is officially not 
maintained anymore. 

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-22081) Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin

2021-04-08 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22081.
-
Resolution: Fixed

> Entropy key not resolved if flink-s3-fs-hadoop is added as a plugin
> ---
>
> Key: FLINK-22081
> URL: https://issues.apache.org/jira/browse/FLINK-22081
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0
>Reporter: Chen Qin
>Assignee: Chen Qin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
> Attachments: image (13).png
>
>
> Using flink 1.11.2
> I added the flink-s3-fs-hadoop jar in plugins dir but I am seeing the 
> checkpoints paths like 
> {{s3://my_app/__ENTROPY__/app_name-staging/flink/checkpoints/e10f47968ae74934bd833108d2272419/chk-3071}}
>  which means the entropy injection key is not being resolved. After some 
> debugging I found that in the 
> [EntropyInjector|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/EntropyInjector.java#L97]
>  we check if the given fileSystem is of type {{ClassLoaderFixingFileSystem}} 
> and if so we check if the filesysystem is of type 
> {{SafetyNetWrapperFileSystem as well as it's delegate }}but don't check for 
> {{[ClassLoaderFixingFileSystem|https://github.com/apache/flink/blob/release-1.10.0/flink-core/src/main/java/org/apache/flink/core/fs/PluginFileSystemFactory.java#L65]}}
>  directly in getEntorpyFs method which would be the type if S3 file system 
> dependencies are added as a plugin.
>  
> Repro steps: 
> Flink 1.11.2 with flink-s3-fs-hadoop as plugin and turn on entropy injection 
> key _entropy_
> observe checkpoint dir with entropy marker not removed.
> s3a://xxx/dev/checkpoints/_entropy_/xenon/event-stream-splitter/jobid/chk-5/  
> compare to removed when running Flink 1.9.1
> s3a://xxx/dev/checkpoints/xenon/event-stream-splitter/jobid/chk-5/  
> Add some logging to getEntropyFs, observe it return null because passed in 
> parameter is not {{SafetyNetWrapperFileSystem}} but 
> {{ClassLoaderFixingFileSystem}}
> Apply patch, build release and run same job, resolved issue as attachment 
> shows
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22190) no guarantee on Flink exactly_once sink to Kafka

2021-04-12 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319910#comment-17319910
 ] 

Arvid Heise commented on FLINK-22190:
-

1. You get byZeroException because you are dividing by 0 in user code {{/ 
Random.nextInt(5)}}. That's something that you need to fix on your end.
2. Could you provide example output to show the duplicates? Where does the 
fail-over happen?

Note that exactly once does not mean deduplication of records or parts thereof. 
Exactly once ensures that there are no duplicates caused by fail-over/restarts.

> no guarantee on Flink exactly_once sink to Kafka 
> -
>
> Key: FLINK-22190
> URL: https://issues.apache.org/jira/browse/FLINK-22190
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.12.2
> Environment: *flink: 1.12.2*
> *kafka: 2.7.0*
>Reporter: Spongebob
>Priority: Major
>
> When I tried to test the function of flink exactly_once sink to kafka, I 
> found it can not run as expectation.  here's the pipline of the flink 
> applications: raw data(flink app0)-> kafka topic1 -> flink app1 -> kafka 
> topic2 -> flink app2, flink tasks may met / byZeroException in random. Below 
> shows the codes:
> {code:java}
> //代码占位符
> raw data, flink app0:
> class SimpleSource1 extends SourceFunction[String] {
>  var switch = true
>  val students: Array[String] = Array("Tom", "Jerry", "Gory")
>  override def run(sourceContext: SourceFunction.SourceContext[String]): Unit 
> = {
>  var i = 0
>  while (switch) {
>  sourceContext.collect(s"${students(Random.nextInt(students.length))},$i")
>  i += 1
>  Thread.sleep(5000)
>  }
>  }
>  override def cancel(): Unit = switch = false
> }
> val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
> val dataStream = streamEnv.addSource(new SimpleSource1)
> dataStream.addSink(new FlinkKafkaProducer[String]("xfy:9092", 
> "single-partition-topic-2", new SimpleStringSchema()))
> streamEnv.execute("sink kafka")
>  
> flink-app1:
> val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
> streamEnv.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE)
> val prop = new Properties()
> prop.setProperty("bootstrap.servers", "xfy:9092")
> prop.setProperty("group.id", "test")
> val dataStream = streamEnv.addSource(new FlinkKafkaConsumer[String](
>  "single-partition-topic-2",
>  new SimpleStringSchema,
>  prop
> ))
> val resultStream = dataStream.map(x => {
>  val data = x.split(",")
>  (data(0), data(1), data(1).toInt / Random.nextInt(5)).toString()
> }
> )
> resultStream.print().setParallelism(1)
> val propProducer = new Properties()
> propProducer.setProperty("bootstrap.servers", "xfy:9092")
> propProducer.setProperty("transaction.timeout.ms", s"${1000 * 60 * 5}")
> resultStream.addSink(new FlinkKafkaProducer[String](
>  "single-partition-topic",
>  new MyKafkaSerializationSchema("single-partition-topic"),
>  propProducer,
>  Semantic.EXACTLY_ONCE))
> streamEnv.execute("sink kafka")
>  
> flink-app2:
> val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
> val prop = new Properties()
> prop.setProperty("bootstrap.servers", "xfy:9092")
> prop.setProperty("group.id", "test")
> prop.setProperty("isolation_level", "read_committed")
> val dataStream = streamEnv.addSource(new FlinkKafkaConsumer[String](
>  "single-partition-topic",
>  new SimpleStringSchema,
>  prop
> ))
> dataStream.print().setParallelism(1)
> streamEnv.execute("consumer kafka"){code}
>  
> flink app1 will print some duplicate numbers, and to my expectation flink 
> app2 will deduplicate them but the fact shows not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21992) Fix availability notification in UnionInputGate

2021-04-13 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-21992:

Summary: Fix availability notification in UnionInputGate  (was: Investigate 
potential buffer leak in unaligned checkpoint)

> Fix availability notification in UnionInputGate
> ---
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Piotr Nowojski
>Priority: Blocker
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21992) Investigate potential buffer leak in unaligned checkpoint

2021-04-13 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319986#comment-17319986
 ] 

Arvid Heise commented on FLINK-21992:
-

It turns out that there is an issue with notification. We managed to reliable 
reproduce it with:
* Unaligned checkpoints with
* Unions going into
* Two input tasks.

The root cause is a bug in {{UnionInputGate}} introduced in FLINK-19026. The 
available notification of {{UnionInputGate}} is simply reset too early, leading 
to stuck tasks.

The bug can probably also be triggered with single input tasks but there are 
certain factors that rectify the bug: If you drain a union gate entirely 
without looking at availability after the first buffer, the bug would not be 
visible. Since we hot-loop at plenty of places until running out of data, it 
might be that just the combination of the three things actually makes it 
visible.

> Investigate potential buffer leak in unaligned checkpoint
> -
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Piotr Nowojski
>Priority: Blocker
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure

2021-04-13 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319992#comment-17319992
 ] 

Arvid Heise commented on FLINK-22173:
-

I'm expecting some connection to FLINK-21992.

> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-22173
> URL: https://issues.apache.org/jira/browse/FLINK-22173
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628
> {code}
> 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase
> 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no 
> scale union from 7 to 
> 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase)  
> Time elapsed: 607.467 s  <<< ERROR!
> 2021-04-08T23:25:56.3133586Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-08T23:25:56.3134070Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-08T23:25:56.3134643Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-08T23:25:56.3135577Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368)
> 2021-04-08T23:25:56.3138843Z  at 
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> 2021-04-08T23:25:56.3139402Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-08T23:25:56.3139880Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-08T23:25:56.3140328Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-08T23:25:56.3140844Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-08T23:25:56.3141768Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-08T23:25:56.3142272Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-08T23:25:56.3142706Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-08T23:25:56.3143142Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-08T23:25:56.3143608Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-08T23:25:56.3144039Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-08T23:25:56.3144434Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-08T23:25:56.3145027Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-08T23:25:56.3145484Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-08T23:25:56.3145981Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-08T23:25:56.3146421Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3146843Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-08T23:25:56.3147274Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-08T23:25:56.3147692Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-08T23:25:56.3148116Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-08T23:25:56.3148543Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-08T23:25:56.3148930Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-08T23:25:56.3149298Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-08T23:25:56.3149663Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3150075Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-08T23:25:56.3150488Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-08T23:25:56.3151148Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-08T23:25:56.3151691Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-08T23:25:56.3152115Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-08T23:25:56.3152534Z  

[jira] [Assigned] (FLINK-21992) Fix availability notification in UnionInputGate

2021-04-13 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-21992:
---

Assignee: Arvid Heise  (was: Piotr Nowojski)

> Fix availability notification in UnionInputGate
> ---
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321282#comment-17321282
 ] 

Arvid Heise edited comment on FLINK-22259 at 4/15/21, 7:25 AM:
---

-I guess the test assumes that the enumerator never fails (it has transient 
state). The test should also persist the transient state.-

The enumerator state is correct
{noformat}
snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, 
numCompletedCheckpoints=11}
{noformat}

It's just that the sync event is never reaching the reader caused by 
FLINK-18071 (or FLINK-21996).


was (Author: aheise):
-I guess the test assumes that the enumerator never fails (it has transient 
state). The test should also persist the transient state.-

The enumerator state is correct
{noformat}
snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, 
numCompletedCheckpoints=11}
{noformat}

It's just that the sync event is never reaching the reader caused by 
FLINK-21996 .

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> 

[jira] [Comment Edited] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321282#comment-17321282
 ] 

Arvid Heise edited comment on FLINK-22259 at 4/15/21, 7:08 AM:
---

-I guess the test assumes that the enumerator never fails (it has transient 
state). The test should also persist the transient state.-

The enumerator state is correct
{noformat}
snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, 
numCompletedCheckpoints=11}
{noformat}

It's just that the sync event is never reaching the reader caused by 
FLINK-21996 .


was (Author: aheise):
I guess the test assumes that the enumerator never fails (it has transient 
state). The test should also persist the transient state.

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> 

[jira] [Resolved] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure

2021-04-15 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22173.
-
  Assignee: Arvid Heise
Resolution: Cannot Reproduce

> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-22173
> URL: https://issues.apache.org/jira/browse/FLINK-22173
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628
> {code}
> 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase
> 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no 
> scale union from 7 to 
> 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase)  
> Time elapsed: 607.467 s  <<< ERROR!
> 2021-04-08T23:25:56.3133586Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-08T23:25:56.3134070Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-08T23:25:56.3134643Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-08T23:25:56.3135577Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368)
> 2021-04-08T23:25:56.3138843Z  at 
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> 2021-04-08T23:25:56.3139402Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-08T23:25:56.3139880Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-08T23:25:56.3140328Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-08T23:25:56.3140844Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-08T23:25:56.3141768Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-08T23:25:56.3142272Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-08T23:25:56.3142706Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-08T23:25:56.3143142Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-08T23:25:56.3143608Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-08T23:25:56.3144039Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-08T23:25:56.3144434Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-08T23:25:56.3145027Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-08T23:25:56.3145484Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-08T23:25:56.3145981Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-08T23:25:56.3146421Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3146843Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-08T23:25:56.3147274Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-08T23:25:56.3147692Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-08T23:25:56.3148116Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-08T23:25:56.3148543Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-08T23:25:56.3148930Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-08T23:25:56.3149298Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-08T23:25:56.3149663Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3150075Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-08T23:25:56.3150488Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-08T23:25:56.3151148Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-08T23:25:56.3151691Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-08T23:25:56.3152115Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 

[jira] [Commented] (FLINK-22019) UnalignedCheckpointRescaleITCase hangs on azure

2021-04-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322118#comment-17322118
 ] 

Arvid Heise commented on FLINK-22019:
-

Let's wait for FLINK-21346 for more investigation.

> UnalignedCheckpointRescaleITCase hangs on azure
> ---
>
> Key: FLINK-22019
> URL: https://issues.apache.org/jira/browse/FLINK-22019
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15658=logs=a57e0635-3fad-5b08-57c7-a4142d7d6fa9=5360d54c-8d94-5d85-304e-a89267eb785a=9347



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322030#comment-17322030
 ] 

Arvid Heise commented on FLINK-22259:
-

I'd close this issue in hopes that the upcoming fixes for FLINK-18071 and 
FLINK-21996, will solve this issue automatically. Please reopen if there are 
failures in the next week.

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9413097Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9413538Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9413964Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9414405Z  at 
> 

[jira] [Closed] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-15 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise closed FLINK-22259.
---
Resolution: Duplicate

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9413097Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9413538Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9413964Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9414405Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9414834Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9415263Z  at 
> 

[jira] [Commented] (FLINK-22173) UnalignedCheckpointRescaleITCase fails on azure

2021-04-15 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322038#comment-17322038
 ] 

Arvid Heise commented on FLINK-22173:
-

Given the frequency and the lack of logs, I'm closing it. There is a high 
chance that this is also caused by either FLINK-21992 or FLINK-18071. Let's 
reopening if it's still reappearing after these fixes and hopefully after 
FLINK-21346 gives us the needed logs.

> UnalignedCheckpointRescaleITCase fails on azure
> ---
>
> Key: FLINK-22173
> URL: https://issues.apache.org/jira/browse/FLINK-22173
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16232=logs=d8d26c26-7ec2-5ed2-772e-7a1a1eb8317c=be5fb08e-1ad7-563c-4f1a-a97ad4ce4865=9628
> {code}
> 2021-04-08T23:25:56.3131361Z [ERROR] Tests run: 31, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 839.623 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase
> 2021-04-08T23:25:56.3132784Z [ERROR] shouldRescaleUnalignedCheckpoint[no 
> scale union from 7 to 
> 7](org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase)  
> Time elapsed: 607.467 s  <<< ERROR!
> 2021-04-08T23:25:56.3133586Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-08T23:25:56.3134070Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-08T23:25:56.3134643Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-08T23:25:56.3135577Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointRescaleITCase.shouldRescaleUnalignedCheckpoint(UnalignedCheckpointRescaleITCase.java:368)
> 2021-04-08T23:25:56.3138843Z  at 
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> 2021-04-08T23:25:56.3139402Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-08T23:25:56.3139880Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-08T23:25:56.3140328Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-08T23:25:56.3140844Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-08T23:25:56.3141768Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-08T23:25:56.3142272Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-08T23:25:56.3142706Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-08T23:25:56.3143142Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-08T23:25:56.3143608Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-08T23:25:56.3144039Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-08T23:25:56.3144434Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-08T23:25:56.3145027Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-08T23:25:56.3145484Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-08T23:25:56.3145981Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-08T23:25:56.3146421Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3146843Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-08T23:25:56.3147274Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-08T23:25:56.3147692Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-08T23:25:56.3148116Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-08T23:25:56.3148543Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-08T23:25:56.3148930Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-08T23:25:56.3149298Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-08T23:25:56.3149663Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-08T23:25:56.3150075Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-08T23:25:56.3150488Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-08T23:25:56.3151148Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 

[jira] [Created] (FLINK-22290) Add new unaligned checkpiont options to Python API.

2021-04-15 Thread Arvid Heise (Jira)
Arvid Heise created FLINK-22290:
---

 Summary: Add new unaligned checkpiont options to Python API.
 Key: FLINK-22290
 URL: https://issues.apache.org/jira/browse/FLINK-22290
 Project: Flink
  Issue Type: Improvement
  Components: API / Python, Runtime / Checkpointing
Reporter: Arvid Heise
Assignee: Arvid Heise
 Fix For: 1.13.0


There is currently no python equivalent of

{noformat}
CheckpointConfig#setAlignmentTimeout
CheckpointConfig#setForceUnalignedCheckpoints
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22290) Add new unaligned checkpiont options to Python API.

2021-04-15 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22290:

Affects Version/s: 1.13.0

> Add new unaligned checkpiont options to Python API.
> ---
>
> Key: FLINK-22290
> URL: https://issues.apache.org/jira/browse/FLINK-22290
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python, Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
> Fix For: 1.13.0
>
>
> There is currently no python equivalent of
> {noformat}
> CheckpointConfig#setAlignmentTimeout
> CheckpointConfig#setForceUnalignedCheckpoints
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-10 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318485#comment-17318485
 ] 

Arvid Heise commented on FLINK-20816:
-

Downgraded to Major as it's a test-only issue.

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-10 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-20816:

Priority: Major  (was: Critical)

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19801) Add support for virtual channels

2021-04-12 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-19801:

Release Note:   (was: While recovering from unaligned checkpoints, users 
can now change the parallelism of the job. This change allows users to quickly 
upscale the job under backpressure.)

> Add support for virtual channels
> 
>
> Key: FLINK-19801
> URL: https://issues.apache.org/jira/browse/FLINK-19801
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> During rescaling of unaligned checkpoints, if state from multiple former 
> channels are read on input or output side to recover a specific channel, then 
> these buffers are multiplexed on output side and demultiplexed on input side 
> to guarantee a consistent recovery of spanning records:
> Assume two channels C1, C2 connect operator A and B and both have one buffer 
> in the output and in the input part of the channel respectively, where a 
> record spans. Assume that the buffers are named O1 for output buffer of C1 
> and I2 for input buffer of C2 etc. Then after rescaling both channels become 
> one channel C. Then, the buffers may be restored as I1, I2, O1, O2.
> Channels use the mapping of FLINK-19533 to infer the need for virtual 
> channels and distribute the needed resources. Virtual channels are removed on 
> the EndOfChannelRecovery epoch marker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-17979) Support rescaling for Unaligned Checkpoints

2021-04-12 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-17979.
-
Resolution: Fixed

Solved with FLINK-19533, FLINK-19801, and FLINK-21945.

> Support rescaling for Unaligned Checkpoints
> ---
>
> Key: FLINK-17979
> URL: https://issues.apache.org/jira/browse/FLINK-17979
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Roman Khachatryan
>Priority: Major
> Fix For: 1.13.0
>
>
> This is one of the limitations of Unaligned Checkpoints MVP.
> (see [Unaligned checkpoints: recovery & 
> rescaling|https://docs.google.com/document/d/1T2WB163uf8xt6Eu2JS0Jyy2XZyF4YpnzGiHlo6twrks/edit?usp=sharing]
>  for possible options)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17979) Support rescaling for Unaligned Checkpoints

2021-04-12 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-17979:

Release Note: While recovering from unaligned checkpoints, users can now 
change the parallelism of the job. This change allows users to quickly upscale 
the job under backpressure.

> Support rescaling for Unaligned Checkpoints
> ---
>
> Key: FLINK-17979
> URL: https://issues.apache.org/jira/browse/FLINK-17979
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Roman Khachatryan
>Priority: Major
> Fix For: 1.13.0
>
>
> This is one of the limitations of Unaligned Checkpoints MVP.
> (see [Unaligned checkpoints: recovery & 
> rescaling|https://docs.google.com/document/d/1T2WB163uf8xt6Eu2JS0Jyy2XZyF4YpnzGiHlo6twrks/edit?usp=sharing]
>  for possible options)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-12 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-20816.
-
Resolution: Fixed

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-12 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319664#comment-17319664
 ] 

Arvid Heise commented on FLINK-20816:
-

Merged into master as fad4874f9866de7d3c2f5fb3a473f4df744c8159 and into 1.12 as 
a1ee66d9ef9a14414b9c0fee9288a94685740471.

> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-13 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320466#comment-17320466
 ] 

Arvid Heise commented on FLINK-22259:
-

Seems to be test-only:

{noformat}
07:36:19,541 [SourceCoordinator-Source: source] INFO  
org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase [] - 
snapshotState EnumeratorState{unassignedSplits=[], numRestarts=5, 
numCompletedCheckpoints=14187}
{noformat}

We have enough restarts and completed checkpoints but for some reason the test 
is not finishing.


> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9413097Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9413538Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 

[jira] [Commented] (FLINK-21992) Fix availability notification in UnionInputGate

2021-04-14 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321096#comment-17321096
 ] 

Arvid Heise commented on FLINK-21992:
-

Merged into master as 7c3abe11a28d54a585985ef908f36d8cf5857e14.

> Fix availability notification in UnionInputGate
> ---
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.13.0, 1.12.3
>
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21992) Fix availability notification in UnionInputGate

2021-04-14 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321279#comment-17321279
 ] 

Arvid Heise commented on FLINK-21992:
-

Merged into 1.12 as e2cbfad6a3cdafe3d568bb43a1d048f9533b29ec.

> Fix availability notification in UnionInputGate
> ---
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.13.0, 1.12.3
>
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-21992) Fix availability notification in UnionInputGate

2021-04-14 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-21992.
-
Resolution: Fixed

> Fix availability notification in UnionInputGate
> ---
>
> Key: FLINK-21992
> URL: https://issues.apache.org/jira/browse/FLINK-21992
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.13.0, 1.12.3
>
>
> A user on mailing list reported that his job gets stuck with unaligned 
> checkpoint enabled.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Source-Operators-Stuck-in-the-requestBufferBuilderBlocking-tt42530.html
> We received two similar reports in the past, but the users didn't follow up, 
> so it was not as easy to diagnose as this time where the initial report 
> already contains many relevant data points. 
> Beside a buffer leak, there could also be an issue with priority notification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-14 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321282#comment-17321282
 ] 

Arvid Heise commented on FLINK-22259:
-

I guess the test assumes that the enumerator never fails (it has transient 
state). The test should also persist the transient state.

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9413097Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9413538Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9413964Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9414405Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 

[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-14 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321208#comment-17321208
 ] 

Arvid Heise commented on FLINK-22259:
-

Source is not finishing as {{numRestarts}} is out of sync (at 0 but should be 
at 5).

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9413097Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9413538Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9413964Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9414405Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9414834Z  at 
> 

[jira] [Commented] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-14 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321280#comment-17321280
 ] 

Arvid Heise commented on FLINK-22259:
-

It seems as if the {{SyncEvent}} is not sent from the coordinator although it 
should.

I also found this exception, which I have not seen before

{noformat}
07:12:41,454 [Checkpoint Timer] WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 320293149b12a901fb4f3750349040db.)
org.apache.flink.runtime.checkpoint.CheckpointException: Coordinator state not 
acknowledged successfully: DISCARDED Failure reason: Trigger checkpoint failure.
at 
org.apache.flink.runtime.checkpoint.OperatorCoordinatorCheckpoints.acknowledgeAllCoordinators(OperatorCoordinatorCheckpoints.java:125)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.OperatorCoordinatorCheckpoints.lambda$triggerAndAcknowledgeAllCoordinatorCheckpoints$1(OperatorCoordinatorCheckpoints.java:86)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) 
~[?:1.8.0_282]
at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
 ~[?:1.8.0_282]
at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 [?:1.8.0_282]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_282]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_282]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 [?:1.8.0_282]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 [?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint 
Coordinator is suspending.
at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:532)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1920)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1907)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1782)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1765)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingAndQueuedCheckpoints(CheckpointCoordinator.java:1965)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1748)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:47)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1434)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1048)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1020)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:569)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:269)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:250)
 ~[flink-runtime_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:233)
 

[jira] [Assigned] (FLINK-22259) UnalignedCheckpointITCase fails with "Value too large for header, this indicates that the test is running too long"

2021-04-14 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22259:
---

Assignee: Arvid Heise

> UnalignedCheckpointITCase fails with "Value too large for header, this 
> indicates that the test is running too long"
> ---
>
> Key: FLINK-22259
> URL: https://issues.apache.org/jira/browse/FLINK-22259
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16419=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2=9672]
>  
> {code:java}
> 2021-04-13T07:37:31.9388503Z [ERROR] execute[pipeline with remote channels, p 
> = 1, timeout = 
> 0](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 1,420.285 s  <<< ERROR!
> 2021-04-13T07:37:31.9395135Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-04-13T07:37:31.9395717Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-04-13T07:37:31.9396274Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:168)
> 2021-04-13T07:37:31.9396866Z  at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:274)
> 2021-04-13T07:37:31.9397318Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2021-04-13T07:37:31.9397723Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2021-04-13T07:37:31.9398312Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2021-04-13T07:37:31.9398724Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2021-04-13T07:37:31.9401916Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2021-04-13T07:37:31.9402764Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2021-04-13T07:37:31.9403756Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2021-04-13T07:37:31.9404222Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2021-04-13T07:37:31.9404624Z  at 
> org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
> 2021-04-13T07:37:31.9405008Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9405449Z  at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> 2021-04-13T07:37:31.9405855Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2021-04-13T07:37:31.9406362Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2021-04-13T07:37:31.9406774Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2021-04-13T07:37:31.9407512Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2021-04-13T07:37:31.9408202Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2021-04-13T07:37:31.9408655Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9409083Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9409521Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9410114Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9410775Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9411398Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2021-04-13T07:37:31.9411914Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2021-04-13T07:37:31.9412292Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2021-04-13T07:37:31.9412670Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2021-04-13T07:37:31.9413097Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2021-04-13T07:37:31.9413538Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2021-04-13T07:37:31.9413964Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2021-04-13T07:37:31.9414405Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2021-04-13T07:37:31.9414834Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2021-04-13T07:37:31.9415263Z  at 
> 

[jira] [Comment Edited] (FLINK-22368) UnalignedCheckpointITCase hangs on azure

2021-04-20 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325641#comment-17325641
 ] 

Arvid Heise edited comment on FLINK-22368 at 4/20/21, 9:30 AM:
---

The test doesn't finish as checkpointing gets stuck in the last execution 
attempt (5):


{noformat}
23:02:26,104 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job Flink 
Streaming Job (5d70bcb288d90589845e39c2953b27c3) switched from state RESTARTING 
to RUNNING.
23:02:26,118 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
source (2/20) (2d3357d530b11041d123bde87da7584b) switched from INITIALIZING to 
RUNNING.
... (in total all 100 tasks are RUNNING)
23:02:26,347 [flink-akka.actor.default-dispatcher-2] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - failing-map 
(10/20) (23870b8b94e5ea774ca3da72a7ca7251) switched from INITIALIZING to 
RUNNING.
...
23:02:27,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
23:02:28,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
... (in total 10k failed to trigger...)
01:55:56,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
{noformat}



was (Author: aheise):
The test doesn't finish as checkpointing gets stuck in the last execution 
attempt (5):


{noformat}
23:02:26,104 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job Flink 
Streaming Job (5d70bcb288d90589845e39c2953b27c3) switched from state RESTARTING 
to RUNNING.
23:02:26,118 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
source (2/20) (2d3357d530b11041d123bde87da7584b) switched from INITIALIZING to 
RUNNING.
... (in total 100 tasks are RUNNING)
23:02:26,347 [flink-akka.actor.default-dispatcher-2] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - failing-map 
(10/20) (23870b8b94e5ea774ca3da72a7ca7251) switched from INITIALIZING to 
RUNNING.
...
23:02:27,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
23:02:28,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
... (in total 10k failed to trigger...)
01:55:56,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
{noformat}


> UnalignedCheckpointITCase hangs on azure
> 
>
> Key: FLINK-22368
> URL: https://issues.apache.org/jira/browse/FLINK-22368
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22368) UnalignedCheckpointITCase hangs on azure

2021-04-20 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325641#comment-17325641
 ] 

Arvid Heise commented on FLINK-22368:
-

The test doesn't finish as checkpointing gets stuck in the last execution 
attempt (5):


{noformat}
23:02:26,104 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job Flink 
Streaming Job (5d70bcb288d90589845e39c2953b27c3) switched from state RESTARTING 
to RUNNING.
23:02:26,118 [flink-akka.actor.default-dispatcher-4] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
source (2/20) (2d3357d530b11041d123bde87da7584b) switched from INITIALIZING to 
RUNNING.
... (in total 100 tasks are RUNNING)
23:02:26,347 [flink-akka.actor.default-dispatcher-2] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - failing-map 
(10/20) (23870b8b94e5ea774ca3da72a7ca7251) switched from INITIALIZING to 
RUNNING.
...
23:02:27,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
23:02:28,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
... (in total 10k failed to trigger...)
01:55:56,165 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Failed to 
trigger checkpoint for job 5d70bcb288d90589845e39c2953b27c3 since some tasks of 
job 5d70bcb288d90589845e39c2953b27c3 has been finished, abort the checkpoint 
Failure reason: Not all required tasks are currently running.
{noformat}


> UnalignedCheckpointITCase hangs on azure
> 
>
> Key: FLINK-22368
> URL: https://issues.apache.org/jira/browse/FLINK-22368
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22368) UnalignedCheckpointITCase hangs on azure

2021-04-20 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22368:

Component/s: Runtime / Checkpointing

> UnalignedCheckpointITCase hangs on azure
> 
>
> Key: FLINK-22368
> URL: https://issues.apache.org/jira/browse/FLINK-22368
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22368) UnalignedCheckpointITCase hangs on azure

2021-04-20 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325647#comment-17325647
 ] 

Arvid Heise commented on FLINK-22368:
-

Okay, it makes sense that it doesn't checkpoint as

{noformat}
23:02:26,166 [Source: source (3/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (3/20)#5 (f45d32db48b407a34edc6dc048c5e0c2) switched from RUNNING to 
FINISHED.
{noformat}

I'm investigating further.

> UnalignedCheckpointITCase hangs on azure
> 
>
> Key: FLINK-22368
> URL: https://issues.apache.org/jira/browse/FLINK-22368
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22368) UnalignedCheckpointITCase hangs on azure

2021-04-20 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22368:

Component/s: (was: Runtime / Checkpointing)
 Runtime / Task

> UnalignedCheckpointITCase hangs on azure
> 
>
> Key: FLINK-22368
> URL: https://issues.apache.org/jira/browse/FLINK-22368
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818=logs=b0a398c0-685b-599c-eb57-c8c2a771138e=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc=10144



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22368) UnalignedCheckpointITCase hangs on azure

2021-04-20 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325658#comment-17325658
 ] 

Arvid Heise commented on FLINK-22368:
-

All source tasks are finished, but the job is not finishing for some reason

{noformat}
23:02:26,166 [Source: source (3/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (3/20)#5 (f45d32db48b407a34edc6dc048c5e0c2) switched from RUNNING to 
FINISHED.
23:02:26,167 [Source: source (8/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (8/20)#5 (53d62fcb80e6b2e5ee7657033a555d6f) switched from RUNNING to 
FINISHED.
23:02:26,166 [Source: source (7/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (7/20)#5 (1147339526bf7fadcd47ef579c4c4130) switched from RUNNING to 
FINISHED.
23:02:26,168 [Source: source (2/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (2/20)#5 (2d3357d530b11041d123bde87da7584b) switched from RUNNING to 
FINISHED.
23:02:26,170 [Source: source (9/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (9/20)#5 (426fbb3a5b561a61affed4c40b0a8f8a) switched from RUNNING to 
FINISHED.
23:02:26,171 [Source: source (1/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (1/20)#5 (a5585ddf277047a6d2b67d8d3cf2cd0e) switched from RUNNING to 
FINISHED.
23:02:26,213 [Source: source (5/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (5/20)#5 (9ad44b9747f266816750e98063306fc4) switched from RUNNING to 
FINISHED.
23:02:26,216 [Source: source (20/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (20/20)#5 (d92ff5256f0c3e1e2a43c7413f6ce71f) switched from RUNNING to 
FINISHED.
23:02:26,215 [Source: source (13/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (13/20)#5 (327096e9fc8562fdc0ea12e031f98749) switched from RUNNING to 
FINISHED.
23:02:26,215 [Source: source (10/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (10/20)#5 (456858f8cd5fca51c09519f29b21641e) switched from RUNNING to 
FINISHED.
23:02:26,223 [Source: source (18/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (18/20)#5 (8a75c4257e592a008bca8f0a29c5e856) switched from RUNNING to 
FINISHED.
23:02:26,223 [Source: source (16/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (16/20)#5 (c17c62b90c66f7bd11debe913916fd89) switched from RUNNING to 
FINISHED.
23:02:26,225 [Source: source (11/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (11/20)#5 (dd463d97b55ee56bcb6e2853760e3daf) switched from RUNNING to 
FINISHED.
23:02:26,239 [Source: source (19/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (19/20)#5 (202827bf688aa5c206bd3b900ad3beb2) switched from RUNNING to 
FINISHED.
23:02:26,240 [Source: source (4/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (4/20)#5 (475c75d73150c1d20eaaa185f96c81e1) switched from RUNNING to 
FINISHED.
23:02:26,245 [Source: source (17/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (17/20)#5 (afb774c210e06bd3e62d42e749c22417) switched from RUNNING to 
FINISHED.
23:02:26,245 [Source: source (6/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (6/20)#5 (776d580fcfd5ddf38643822311884d70) switched from RUNNING to 
FINISHED.
23:02:26,246 [Source: source (14/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (14/20)#5 (7eccee9beab3b7e6b5977d5f726e9c9e) switched from RUNNING to 
FINISHED.
23:02:26,245 [Source: source (15/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (15/20)#5 (45df84d24e611dfd7d972795548a7f33) switched from RUNNING to 
FINISHED.
23:02:26,252 [Source: source (12/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Source: 
source (12/20)#5 (65cad528fa7a6e0fe6822c4677d8797a) switched from RUNNING to 
FINISHED.
{noformat}

{{failing-map (8/20)#5}} and none of the sinks (naturally) are not finishing.

I found
{noformat}
23:02:26,422 [failing-map (8/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - failing-map 
(8/20)#5 (77221af17305e76caafacbf7bc696af7) switched from RUNNING to CANCELED.
23:02:26,425 [failing-map (8/20)#5] INFO  
org.apache.flink.runtime.taskmanager.Task[] - Freeing task 
resources for failing-map (8/20)#5 

[jira] [Resolved] (FLINK-22290) Add new unaligned checkpiont options to Python API.

2021-04-16 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-22290.
-
Resolution: Fixed

> Add new unaligned checkpiont options to Python API.
> ---
>
> Key: FLINK-22290
> URL: https://issues.apache.org/jira/browse/FLINK-22290
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python, Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> There is currently no python equivalent of
> {noformat}
> CheckpointConfig#setAlignmentTimeout
> CheckpointConfig#setForceUnalignedCheckpoints
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22290) Add new unaligned checkpiont options to Python API.

2021-04-16 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323936#comment-17323936
 ] 

Arvid Heise commented on FLINK-22290:
-

Merged as 
52b400f717dc9a5c903a9b834d02b0cbf609897b..de069df949866bf332ba006b244428082f886f55
 into master.

> Add new unaligned checkpiont options to Python API.
> ---
>
> Key: FLINK-22290
> URL: https://issues.apache.org/jira/browse/FLINK-22290
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python, Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> There is currently no python equivalent of
> {noformat}
> CheckpointConfig#setAlignmentTimeout
> CheckpointConfig#setForceUnalignedCheckpoints
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20816) NotifyCheckpointAbortedITCase failed due to timeout

2021-04-09 Thread Arvid Heise (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318134#comment-17318134
 ] 

Arvid Heise commented on FLINK-20816:
-

I found the root cause: the test assumes implicitly that abortion of chk-1 
happens before sync phase of chk-2. If abortion is late, the mail that handles 
the abortion is never executed since the waitLatch blocks the mailbox thread.

It's easy to reproduce by adding some sleep into 

{noformat}
private void CheckpointCoordinator#sendAbortedMessages(
List tasksToAbort, long checkpointId, long 
timeStamp) {
// send notification of aborted checkpoints asynchronously.
executor.execute(
() -> {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
{noformat}


> NotifyCheckpointAbortedITCase failed due to timeout
> ---
>
> Key: FLINK-20816
> URL: https://issues.apache.org/jira/browse/FLINK-20816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.0
>Reporter: Matthias
>Assignee: Arvid Heise
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.13.0
>
> Attachments: flink-20816-failure.log, flink-20816-success.log
>
>
> [This 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=152=logs=0a15d512-44ac-5ba5-97ab-13a5d066c22c=634cd701-c189-5dff-24cb-606ed884db87=4245]
>  failed caused by a failing of {{NotifyCheckpointAbortedITCase}} due to a 
> timeout.
> {code}
> 2020-12-29T21:48:40.9430511Z [INFO] Running 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087043Z [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 107.062 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase
> 2020-12-29T21:50:28.0087961Z [ERROR] 
> testNotifyCheckpointAborted[unalignedCheckpointEnabled 
> =true](org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase)  
> Time elapsed: 104.044 s  <<< ERROR!
> 2020-12-29T21:50:28.0088619Z org.junit.runners.model.TestTimedOutException: 
> test timed out after 10 milliseconds
> 2020-12-29T21:50:28.0088972Z  at java.lang.Object.wait(Native Method)
> 2020-12-29T21:50:28.0089267Z  at java.lang.Object.wait(Object.java:502)
> 2020-12-29T21:50:28.0089633Z  at 
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> 2020-12-29T21:50:28.0090458Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.verifyAllOperatorsNotifyAborted(NotifyCheckpointAbortedITCase.java:200)
> 2020-12-29T21:50:28.0091313Z  at 
> org.apache.flink.test.checkpointing.NotifyCheckpointAbortedITCase.testNotifyCheckpointAborted(NotifyCheckpointAbortedITCase.java:183)
> 2020-12-29T21:50:28.0091819Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-12-29T21:50:28.0092199Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-12-29T21:50:28.0092675Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-12-29T21:50:28.0093095Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-12-29T21:50:28.0093495Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-12-29T21:50:28.0093980Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-12-29T21:50:28.009Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-12-29T21:50:28.0094917Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-12-29T21:50:28.0095663Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> 2020-12-29T21:50:28.0096221Z  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> 2020-12-29T21:50:28.0096675Z  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-12-29T21:50:28.0097022Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> The branch contained changes from FLINK-20594 and FLINK-20595. These issues 
> remove code that is not used anymore and should have had only affects on unit 
> tests. [The previous 
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=151=results]
>  containing all the changes accept for 
> [9c57c37|https://github.com/XComp/flink/commit/9c57c37c50733a1f592a4fc5e492b22be80d8279]
>  passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-22136) Device application for unaligned checkpoint test on cluster

2021-04-08 Thread Arvid Heise (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-22136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise updated FLINK-22136:

Description: 
To test unaligned checkpoints, we should use a few different applications that 
use different features:

* Mixing forward/rescale channels with keyby or other shuffle operations
* Unions
* 2 or n-ary operators
* Associated state ((keyed) process function)
* Correctness verifications

The sinks should not be mocked but rather should be able to induce a fair 
amount of backpressure into the system. Quite possibly, it would be a good idea 
to have a way to add more backpressure to the sink by running the respective 
system on the cluster and be able to add/remove parallel instances.

Things to check in the application
* Inflight data is restored to the correct keygroups -> can be checked with 
keyed state in a process function
* Correctness: Completeness (no lost records) + no duplicates
* Orderness of data for keyed exchanges (we guarantee that records with the 
same key retain orderness across keyed operators)
* (To detect errors early, we can also use magic headers)


  was:
To test unaligned checkpoints, we should use a few different applications that 
use different features:

* Mixing forward/rescale channels with keyby or other shuffle operations
* Unions
* 2 or n-ary operators
* Associated state ((keyed) process function)
* Correctness verifications

The sinks should not be mocked but rather should be able to induce a fair 
amount of backpressure into the system. Quite possibly, it would be a good idea 
to have a way to add more backpressure to the sink by running the respective 
system on the cluster and be able to add/remove parallel instances.




> Device application for unaligned checkpoint test on cluster
> ---
>
> Key: FLINK-22136
> URL: https://issues.apache.org/jira/browse/FLINK-22136
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Arvid Heise
>Priority: Major
>
> To test unaligned checkpoints, we should use a few different applications 
> that use different features:
> * Mixing forward/rescale channels with keyby or other shuffle operations
> * Unions
> * 2 or n-ary operators
> * Associated state ((keyed) process function)
> * Correctness verifications
> The sinks should not be mocked but rather should be able to induce a fair 
> amount of backpressure into the system. Quite possibly, it would be a good 
> idea to have a way to add more backpressure to the sink by running the 
> respective system on the cluster and be able to add/remove parallel instances.
> Things to check in the application
> * Inflight data is restored to the correct keygroups -> can be checked with 
> keyed state in a process function
> * Correctness: Completeness (no lost records) + no duplicates
> * Orderness of data for keyed exchanges (we guarantee that records with the 
> same key retain orderness across keyed operators)
> * (To detect errors early, we can also use magic headers)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    3   4   5   6   7   8   9   10   11   12   >