[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

Mona Chitnis (JIRA) Thu, 25 Sep 2014 08:32:46 -0700

    [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147856#comment-14147856
 ]


Mona Chitnis commented on OOZIE-1976:
-------------------------------------

Thanks [~rkanter] for comments. 
    * We are thinking of using a serialize/deserialize technique (protobuf is 
one option) to convert back and forth from the object. I've created a class 
LogicalDependencySet for this object which either contains the subclass objects 
LogicalDependencyAndSet or LogicalDependencyOrSet and the leaf level is 
Dependency which has the lists of resolved and unresolved instances. Yet to see 
what is the cost of protobuf serde here.
   * Yes it is possible to do nested combinations, but will limit it to a depth 
of 2. i.e. both your examples are depth 2 and the most common cases that we 
should satisfy in the first go. An important thing to note here is the case of 
OR can have two 'strategies' :-
   ** 'Combined' : In case of {{A || B}}, instances of A and B can be 
interleaved to give the final "combined" set of total instances. For this, the 
requirement is that user considers both as equivalent, and they have the same 
frequency, initial instance etc.
   ** 'Exclusive' : In same case as above, either A should be completely used 
or B completely used. No interleaving.
   * Yes a better API output will be to display the action is waiting on which 
OR datasets' instances.

> Specifying coordinator input datasets in more logical ways
> ----------------------------------------------------------
>
>                 Key: OOZIE-1976
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1976
>             Project: Oozie
>          Issue Type: New Feature
>          Components: coordinator
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>            Assignee: Mona Chitnis
>             Fix For: trunk
>
>         Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, 
> OOZIE-1976-rough-design.pdf
>
>
> All dataset instances specified as input to coordinator, currently work on 
> AND logic i.e. ALL of them should be available for workflow to start. We 
> should enhance this to include more logical ways of specifying availability 
> criteria e.g.
>  * OR between instances
>  * minimum N out of K instances
>  * delta datasets (process data incrementally)
> Use-cases for this:
>  * Different datasets are BCP, and workflow can run with either, whichever 
> arrives earlier.
>  * Data is not guaranteed, and while $coord:latest allows skipping to 
> available ones, workflow will never trigger unless mentioned number of 
> instances are found.
>  * Workflow is like a ‘refining’ algorithm which should run after minimum 
> required datasets are ready, and should only process the delta for efficiency.
> This JIRA is to discuss the design and then the review the implementation for 
> some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

Reply via email to