[ 
https://issues.apache.org/jira/browse/UIMA-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lou DeGenaro updated UIMA-2911:
-------------------------------

    Component/s: DUCC
    
> 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] id:44721 state:NotFound
> ---------------------------------------------------------------
>
>                 Key: UIMA-2911
>                 URL: https://issues.apache.org/jira/browse/UIMA-2911
>             Project: UIMA
>          Issue Type: Bug
>          Components: DUCC
>            Reporter: Lou DeGenaro
>            Assignee: Lou DeGenaro
>            Priority: Minor
>             Fix For: 1.0-Ducc
>
>
> Job monitor incorrectly flags job as having failed.
> 5/14/13 9:51:23 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1033 error:0 retry:0 procs:13
> 5/14/13 9:51:43 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1033 error:0 retry:0 procs:12
> 5/14/13 9:52:03 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1034 error:0 retry:0 procs:11
> 5/14/13 9:52:23 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1035 error:0 retry:0 procs:9
> 5/14/13 9:52:43 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1037 error:0 retry:0 procs:8
> 5/14/13 9:53:03 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1037 error:0 retry:0 procs:7
> 5/14/13 9:53:43 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1038 error:0 retry:0 procs:5
> 5/14/13 9:54:03 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 
> done:1038 error:0 retry:0 procs:3
> 5/14/13 9:55:03 PM - 10 - INFO: [1.1.1] id:44721 state:Completing total:1039 
> done:1039 error:0 retry:0 procs:2
> 5/14/13 9:56:23 PM - 10 - INFO: [1.1.1] id:44721 state:Completing total:1039 
> done:1039 error:0 retry:0 procs:1
> 5/14/13 9:57:03 PM - 10 - INFO: [1.1.1] id:44721 state:Completed total:1039 
> done:1039 error:0 retry:0 procs:1
> 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] id:44721 state:NotFound
> 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] id:44721 rc:1
> 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] train-threshold-run - FAILED 1 
> (active=2)
> 5/14/13 9:59:22 PM - 10 - INFO: [1.1] train-threshold - FAILED 1 (active=1)
> 5/14/13 9:59:22 PM - 10 - INFO: [1] do-complete-run-parallel - FAILED 1 
> (active=0)
> 5/14/13 9:59:22 PM - 10 - INFO: First task to fail was: train-threshold-run
> 5/14/13 9:59:22 PM - 10 - INFO:   -----------------------------    END    
> -----------------------------
> Without looking at the code, here's what I'll guess happened.
> 1. Job reached Completed state, but not all job processes were reported as 
> stopped as is evidenced by:
> 5/14/13 9:57:03 PM - 10 - INFO: [1.1.1] id:44721 state:Completed total:1039 
> done:1039 error:0 retry:0 procs:1
> 2. Job monitor continues waiting for procs:0 before exiting
> 3. Orchestrator's next publication does not include job 44721 at all.  This 
> would be completely normal, since job as been Completed more than (I think) 5 
> minutes and (probably) by now procs==0.
> 4. Job monitor is "surprised" that Orchestrator publication no longer 
> contains this job.
> I can imagine several ways to fix this:
> a) Once Job monitor see Completed, then exit normally (regardless of procs 
> count)
> b) OR should make sure at least one publication goes out with Completed and 
> procs==0
> c) Job Monitor should interpret Job not included in OR publication as 
> successful completion (given that prior publication had total==done and state 
> was Completed)
> At present I vote for c) and will open a Jira against myself.
> Lou.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to