[ https://issues.apache.org/jira/browse/UIMA-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Eckart de Castilho resolved UIMA-4434. ---------------------------------------------- Resolution: Abandoned DUCC has been retired. > DUCC Orchestrator (OR) job:node blacklisting > -------------------------------------------- > > Key: UIMA-4434 > URL: https://issues.apache.org/jira/browse/UIMA-4434 > Project: UIMA > Issue Type: Improvement > Components: DUCC > Reporter: Lou DeGenaro > Assignee: Lou DeGenaro > Priority: Major > Fix For: future-DUCC > > > A submitted Job may have shares allocated on some nodes where the JP works > and some nodes where the JP fails. > With respect to initialization, the OR should have a limit to the number of > initialization failures on a node before that node is banished for the Job. > The OR should communicate the blacklisted nodes for each Job to the RM who > should then not allocate and shares on said nodes for said corresponding Jobs. > An example failure situation is as follows: > 1. Node X does not have Filesystem F mounted > 2. Job 1 is submitted and is allocated to Node X > 3. Job 1's JP on Node X fails initialization (missing files!) > 4. RM allocates next JP for Job 1 to same Node X, ad infinitum until max init > failures is reached > 5. Job 1 is prevented from expanding because of a single "bad" Node > If Node X had been blacklisted, then the RM would have allocated Node Y to > Job 1 and expansion could have occurred. > Other types of JP failure scenarios: process croak and work item > failure/timeout will not be considered for blacklisting, presently. -- This message was sent by Atlassian Jira (v8.20.10#820010)