[ https://issues.apache.org/jira/browse/APEXCORE-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970439#comment-15970439 ]
ASF GitHub Bot commented on APEXCORE-703: ----------------------------------------- GitHub user vrozov opened a pull request: https://github.com/apache/apex-core/pull/516 APEXCORE-703 Window processing timeout for finished/undeployed container. During an operator shutdown, mark it as INACTIVE to exclude it from the blocked operators check. @tweise Please review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vrozov/apex-core APEXCORE-703 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-core/pull/516.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #516 ---- commit 0ebc23ee0e40f5259098f538a1b9cea4aeba9794 Author: Vlad Rozov <v.ro...@datatorrent.com> Date: 2017-04-16T16:34:09Z APEXCORE-703 Window processing timeout for finished/undeployed container. During an operator shutdown mark it as INACTIVE to exclude it from the blocked operators check. ---- > Window processing timeout for finished/undeployed container > ----------------------------------------------------------- > > Key: APEXCORE-703 > URL: https://issues.apache.org/jira/browse/APEXCORE-703 > Project: Apache Apex Core > Issue Type: Bug > Affects Versions: 3.5.0 > Reporter: Daniel Halperin > Assignee: Vlad Rozov > > Using Apex 3.5.0 with Apache Beam, I have a 10-container pipeline. The first > container, id #1, finishes and gets undeployed at 12:41:10 PM. > Then, 60s later (at 12:42:10 PM), Apex decides that container is blocked > because no data has been received for 60s, declares failure, and restarts it. > This would seem to be a bug -- shouldn't finished and undeployed operators be > deregistered from the timeout logic that is detecting stuck operators? > Log below > {code} > Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer > processHeartbeatResponse > INFO: Undeploy request: [1] > Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer > undeploy > INFO: Undeploy complete. > Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager > updateRecoveryCheckpoints > WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked > committed window ffffffffffffffff, recovery window ffffffffffffffff, current > time 1492198930012, last window id change time 1492198869957, window > processing timeout millis 60000 > Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager > updateCheckpoints > INFO: Blocked operator PTOperator[id=1,name=TextIO.Read/Read] container > PTContainer[id=1(container-6),state=ACTIVE] time 60055ms > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.engine.StreamingContainer > processHeartbeatResponse > INFO: Received shutdown request > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StramLocalCluster run > INFO: Container container-6 restart. > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager > scheduleContainerRestart > INFO: Initiating recovery for container-6@localhost > Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager > updateRecoveryCheckpoints > WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked > committed window ffffffffffffffff, recovery window ffffffffffffffff, current > time 1492198931015, last window id change time 1492198869957, window > processing timeout millis 60000 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)