[ https://issues.apache.org/jira/browse/UIMA-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658592#comment-13658592 ]
Lou DeGenaro commented on UIMA-2911: ------------------------------------ Upon upon further reflection, it seems that reaching state Completed is sufficient to make a final disposition in the Monitor. The fact that one or more procs may still exist is due to latency of (or a bug in) the DUCC framework, and is therefore irrelevant. > 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] id:44721 state:NotFound > --------------------------------------------------------------- > > Key: UIMA-2911 > URL: https://issues.apache.org/jira/browse/UIMA-2911 > Project: UIMA > Issue Type: Bug > Reporter: Lou DeGenaro > Assignee: Lou DeGenaro > Priority: Minor > Fix For: 1.0-Ducc > > > Job monitor incorrectly flags job as having failed. > 5/14/13 9:51:23 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1033 error:0 retry:0 procs:13 > 5/14/13 9:51:43 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1033 error:0 retry:0 procs:12 > 5/14/13 9:52:03 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1034 error:0 retry:0 procs:11 > 5/14/13 9:52:23 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1035 error:0 retry:0 procs:9 > 5/14/13 9:52:43 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1037 error:0 retry:0 procs:8 > 5/14/13 9:53:03 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1037 error:0 retry:0 procs:7 > 5/14/13 9:53:43 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1038 error:0 retry:0 procs:5 > 5/14/13 9:54:03 PM - 10 - INFO: [1.1.1] id:44721 state:Running total:1039 > done:1038 error:0 retry:0 procs:3 > 5/14/13 9:55:03 PM - 10 - INFO: [1.1.1] id:44721 state:Completing total:1039 > done:1039 error:0 retry:0 procs:2 > 5/14/13 9:56:23 PM - 10 - INFO: [1.1.1] id:44721 state:Completing total:1039 > done:1039 error:0 retry:0 procs:1 > 5/14/13 9:57:03 PM - 10 - INFO: [1.1.1] id:44721 state:Completed total:1039 > done:1039 error:0 retry:0 procs:1 > 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] id:44721 state:NotFound > 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] id:44721 rc:1 > 5/14/13 9:59:22 PM - 10 - INFO: [1.1.1] train-threshold-run - FAILED 1 > (active=2) > 5/14/13 9:59:22 PM - 10 - INFO: [1.1] train-threshold - FAILED 1 (active=1) > 5/14/13 9:59:22 PM - 10 - INFO: [1] do-complete-run-parallel - FAILED 1 > (active=0) > 5/14/13 9:59:22 PM - 10 - INFO: First task to fail was: train-threshold-run > 5/14/13 9:59:22 PM - 10 - INFO: ----------------------------- END > ----------------------------- > Without looking at the code, here's what I'll guess happened. > 1. Job reached Completed state, but not all job processes were reported as > stopped as is evidenced by: > 5/14/13 9:57:03 PM - 10 - INFO: [1.1.1] id:44721 state:Completed total:1039 > done:1039 error:0 retry:0 procs:1 > 2. Job monitor continues waiting for procs:0 before exiting > 3. Orchestrator's next publication does not include job 44721 at all. This > would be completely normal, since job as been Completed more than (I think) 5 > minutes and (probably) by now procs==0. > 4. Job monitor is "surprised" that Orchestrator publication no longer > contains this job. > I can imagine several ways to fix this: > a) Once Job monitor see Completed, then exit normally (regardless of procs > count) > b) OR should make sure at least once publication goes out with Completed and > procs==0 > c) Job Monitor should interpret Job not included in OR publication as > successful completion (given that prior publication had total==done and state > was Completed) > At present I vote for c) and will open a Jira against myself. > Lou. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira