[ https://issues.apache.org/jira/browse/OOZIE-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andras Piros updated OOZIE-3387: -------------------------------- Description: During data input dependency check Oozie evaluates EL functions like {{coord:latest()}} using a non-optimal way which may result more than necessary HDFS URI checks. 1. If the {{dataset}} frequency does not match the {{uri-template}} it checks the same HDFS URI multiple times. For instance in the following definition: {noformat} <dataset name="dataset1" frequency="${coord:minutes(1)}" initial-instance="2017-01-01T08:15Z" timezone="UTC"> <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template> <done-flag>_SUCCESS</done-flag> </dataset> ... <data-in name="coordInput" dataset="dataset1"> <instance>${coord:latest(0)}</instance> </data-in> {noformat} oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It would be enough to check the file only once and skip the other 1439 tests. 2. If the frequency is 1 day and {{uri-template}} is definied in the following way: {noformat} <uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template> {noformat} oozie will check the following directories one by one even if the some of the parent directories are missing: {noformat} 2018/11/20 2018/11/19 2018/11/18 ... {noformat} If there is no {{2018/11}} directory then it is not necessary to check all the {{2018/11/xx}} directories. It would be possible to reduce the number of HDFS URI checks. was: During data input dependency check Oozie evaluates EL functions like {{ coord:latest}} using a non-optimal way which may result more than necessary HDFS URI checks. 1. If the {{dataset}} frequency does not match the {{uri-template}} it checks the same HDFS URI multiple times. For instance in the following definition: {noformat} <dataset name="dataset1" frequency="${coord:minutes(1)}" initial-instance="2017-01-01T08:15Z" timezone="UTC"> <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template> <done-flag>_SUCCESS</done-flag> </dataset> ... <data-in name="coordInput" dataset="dataset1"> <instance>${coord:latest(0)}</instance> </data-in> {noformat} oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It would be enough to check the file only once and skip the other 1439 tests. 2. If the frequency is 1 day and {{uri-template}} is definied in the following way: {noformat} <uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template> {noformat} oozie will check the following directories one by one even if the some of the parent directories are missing: {noformat} 2018/11/20 2018/11/19 2018/11/18 ... {noformat} If there is no {{2018/11}} directory then it is not necessary to check all the {{2018/11/xx}} directories. It would be possible to reduce the number of HDFS URI checks. > Optimize coordinator data input dependency search > ------------------------------------------------- > > Key: OOZIE-3387 > URL: https://issues.apache.org/jira/browse/OOZIE-3387 > Project: Oozie > Issue Type: Improvement > Affects Versions: 5.1.0 > Reporter: Andras Salamon > Priority: Major > > During data input dependency check Oozie evaluates EL functions like > {{coord:latest()}} using a non-optimal way which may result more than > necessary HDFS URI checks. > 1. If the {{dataset}} frequency does not match the {{uri-template}} it checks > the same HDFS URI multiple times. For instance in the following definition: > {noformat} > <dataset name="dataset1" frequency="${coord:minutes(1)}" > initial-instance="2017-01-01T08:15Z" timezone="UTC"> > > <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template> > <done-flag>_SUCCESS</done-flag> > </dataset> > ... > <data-in name="coordInput" dataset="dataset1"> > <instance>${coord:latest(0)}</instance> > </data-in> > {noformat} > oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It > would be enough to check the file only once and skip the other 1439 tests. > 2. If the frequency is 1 day and {{uri-template}} is definied in the > following way: > {noformat} > <uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template> > {noformat} > oozie will check the following directories one by one even if the some of the > parent directories are missing: > {noformat} > 2018/11/20 > 2018/11/19 > 2018/11/18 > ... > {noformat} > If there is no {{2018/11}} directory then it is not necessary to check all > the {{2018/11/xx}} directories. It would be possible to reduce the number of > HDFS URI checks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)