[
https://issues.apache.org/jira/browse/APEXCORE-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970439#comment-15970439
]
ASF GitHub Bot commented on APEXCORE-703:
-----------------------------------------
GitHub user vrozov opened a pull request:
https://github.com/apache/apex-core/pull/516
APEXCORE-703 Window processing timeout for finished/undeployed container.
During an operator shutdown, mark it as INACTIVE to exclude it from the
blocked operators check.
@tweise Please review.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vrozov/apex-core APEXCORE-703
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/apex-core/pull/516.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #516
----
commit 0ebc23ee0e40f5259098f538a1b9cea4aeba9794
Author: Vlad Rozov <[email protected]>
Date: 2017-04-16T16:34:09Z
APEXCORE-703 Window processing timeout for finished/undeployed container.
During an operator shutdown mark it as INACTIVE to exclude it from the
blocked operators check.
----
> Window processing timeout for finished/undeployed container
> -----------------------------------------------------------
>
> Key: APEXCORE-703
> URL: https://issues.apache.org/jira/browse/APEXCORE-703
> Project: Apache Apex Core
> Issue Type: Bug
> Affects Versions: 3.5.0
> Reporter: Daniel Halperin
> Assignee: Vlad Rozov
>
> Using Apex 3.5.0 with Apache Beam, I have a 10-container pipeline. The first
> container, id #1, finishes and gets undeployed at 12:41:10 PM.
> Then, 60s later (at 12:42:10 PM), Apex decides that container is blocked
> because no data has been received for 60s, declares failure, and restarts it.
> This would seem to be a bug -- shouldn't finished and undeployed operators be
> deregistered from the timeout logic that is detecting stuck operators?
> Log below
> {code}
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer
> processHeartbeatResponse
> INFO: Undeploy request: [1]
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer
> undeploy
> INFO: Undeploy complete.
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager
> updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked
> committed window ffffffffffffffff, recovery window ffffffffffffffff, current
> time 1492198930012, last window id change time 1492198869957, window
> processing timeout millis 60000
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager
> updateCheckpoints
> INFO: Blocked operator PTOperator[id=1,name=TextIO.Read/Read] container
> PTContainer[id=1(container-6),state=ACTIVE] time 60055ms
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.engine.StreamingContainer
> processHeartbeatResponse
> INFO: Received shutdown request
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StramLocalCluster run
> INFO: Container container-6 restart.
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager
> scheduleContainerRestart
> INFO: Initiating recovery for container-6@localhost
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager
> updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked
> committed window ffffffffffffffff, recovery window ffffffffffffffff, current
> time 1492198931015, last window id change time 1492198869957, window
> processing timeout millis 60000
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)