Andras Salamon created OOZIE-3387:
-------------------------------------

             Summary: Optimize coordinator data input dependency search
                 Key: OOZIE-3387
                 URL: https://issues.apache.org/jira/browse/OOZIE-3387
             Project: Oozie
          Issue Type: Improvement
            Reporter: Andras Salamon


During data input dependency check Oozie evaluates EL functions likeĀ {{ 
coord:latest}} using a non-optimal way which may result more than necessary 
HDFS URI checks.

1. If the {{dataset}} frequency does not match the {{uri-template}} it checks 
the same HDFS URI multiple times. For instance in the following definition:
{noformat}
<dataset name="dataset1" frequency="${coord:minutes(1)}" 
initial-instance="2017-01-01T08:15Z" timezone="UTC">
    <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template>
    <done-flag>_SUCCESS</done-flag>
</dataset>
...
<data-in name="coordInput" dataset="dataset1">
    <instance>${coord:latest(0)}</instance>
</data-in>
{noformat}
oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It 
would be enough to check the file only once and skip the other 1439 tests.

2. If the frequency is 1 day and {{uri-template}} is definied in the following 
way:
{noformat}
<uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template>
{noformat}
oozie will check the following directories one by one even if the some of the 
parent directories are missing:
{noformat}
2018/11/20
2018/11/19
2018/11/18
...
{noformat}
If there is no {{2018/11}} directory then it is not necessary to check all the 
{{2018/11/xx}} directories. It would be possible to reduce the number of HDFS 
URI checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to