[ 
https://issues.apache.org/jira/browse/APEXCORE-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970454#comment-15970454
 ] 

Vlad Rozov commented on APEXCORE-703:
-------------------------------------

I believe that the second test case is already covered for example in 
AtMostOnceTest.testLinearInputOperatorRecovery. Let me know if you think that a 
separate unit test in StreamingContainerManagerTest is required. I open PR to 
make sure that we agree on the proposed fix and will add 2 or 3 additional unit 
tests. One that simulates the bug and another in 
StreamingContainerManagerTest.testOperatorShutdown.

> Window processing timeout for finished/undeployed container
> -----------------------------------------------------------
>
>                 Key: APEXCORE-703
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-703
>             Project: Apache Apex Core
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Daniel Halperin
>            Assignee: Vlad Rozov
>             Fix For: 3.6.0
>
>
> Using Apex 3.5.0 with Apache Beam, I have a 10-container pipeline. The first 
> container, id #1, finishes and gets undeployed at 12:41:10 PM.
> Then, 60s later (at 12:42:10 PM), Apex decides that container is blocked 
> because no data has been received for 60s, declares failure, and restarts it.
> This would seem to be a bug -- shouldn't finished and undeployed operators be 
> deregistered from the timeout logic that is detecting stuck operators?
> Log below
> {code}
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer 
> processHeartbeatResponse
> INFO: Undeploy request: [1]
> Apr 14, 2017 12:41:10 PM com.datatorrent.stram.engine.StreamingContainer 
> undeploy
> INFO: Undeploy complete.
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager 
> updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked 
> committed window ffffffffffffffff, recovery window ffffffffffffffff, current 
> time 1492198930012, last window id change time 1492198869957, window 
> processing timeout millis 60000
> Apr 14, 2017 12:42:10 PM com.datatorrent.stram.StreamingContainerManager 
> updateCheckpoints
> INFO: Blocked operator PTOperator[id=1,name=TextIO.Read/Read] container 
> PTContainer[id=1(container-6),state=ACTIVE] time 60055ms
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.engine.StreamingContainer 
> processHeartbeatResponse
> INFO: Received shutdown request
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StramLocalCluster run
> INFO: Container container-6 restart.
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager 
> scheduleContainerRestart
> INFO: Initiating recovery for container-6@localhost
> Apr 14, 2017 12:42:11 PM com.datatorrent.stram.StreamingContainerManager 
> updateRecoveryCheckpoints
> WARNING: Marking operator PTOperator[id=1,name=TextIO.Read/Read] blocked 
> committed window ffffffffffffffff, recovery window ffffffffffffffff, current 
> time 1492198931015, last window id change time 1492198869957, window 
> processing timeout millis 60000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to