[ 
https://issues.apache.org/jira/browse/APEXCORE-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970240#comment-15970240
 ] 

Vlad Rozov commented on APEXCORE-703:
-------------------------------------

Why not to have a state that indicates that an operator is undeployed, but 
can't be removed from the plan till the checkpoint barrier is reached? Can't 
INACTIVE play such role? If not, can we introduce a new state? While 
shutdownOperators may be used to exclude an operator from being marked as 
blocked, the state is already used to mark only ACTIVE operators as blocked.

> Window processing timeout for finished/undeployed container
> -----------------------------------------------------------
>
>                 Key: APEXCORE-703
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-703
>             Project: Apache Apex Core
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Daniel Halperin
>
> Using Apex 3.5.0 with Apache Beam, I have a 10-container pipeline. The first 
> container, id #1, finishes and gets undeployed at 12:41:10 PM.
> Then, 60s later (at 12:42:10 PM), Apex decides that container is blocked 
> because no data has been received for 60s, declares failure, and restarts it.
> This would seem to be a bug -- shouldn't finished and undeployed operators be 
> deregistered from the timeout logic that is detecting stuck operators?
> Log below
> {code}
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer 
> processHeartbeatResponse
> INFO: Undeploy request: [1]
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer 
> undeploy
> INFO: Undeploy complete.
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager 
> updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked 
> committed window ffffffffffffffff, recovery window ffffffffffffffff, current 
> time 1492198930012, last window id change time 1492198869957, window 
> processing timeout millis 60000
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager 
> updateCheckpoints
> INFO: Blocked operator PTOperator[id=1,name=TextIO.Read/Read] container 
> PTContainer[id=1(container-6),state=ACTIVE] time 60055ms
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.engine.StreamingContainer 
> processHeartbeatResponse
> INFO: Received shutdown request
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StramLocalCluster run
> INFO: Container container-6 restart.
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager 
> scheduleContainerRestart
> INFO: Initiating recovery for container-6@localhost
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager 
> updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked 
> committed window ffffffffffffffff, recovery window ffffffffffffffff, current 
> time 1492198931015, last window id change time 1492198869957, window 
> processing timeout millis 60000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to