[jira] [Updated] (OOZIE-3387) Optimize coordinator data input dependency search

2018-11-20 Thread Andras Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Piros updated OOZIE-3387:

Description: 
During data input dependency check Oozie evaluates EL functions like 
{{coord:latest()}} using a non-optimal way which may result more than necessary 
HDFS URI checks.

1. If the {{dataset}} frequency does not match the {{uri-template}} it checks 
the same HDFS URI multiple times. For instance in the following definition:
{noformat}

${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}
_SUCCESS

...

${coord:latest(0)}

{noformat}
oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It 
would be enough to check the file only once and skip the other 1439 tests.

2. If the frequency is 1 day and {{uri-template}} is definied in the following 
way:
{noformat}
${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}
{noformat}
oozie will check the following directories one by one even if the some of the 
parent directories are missing:
{noformat}
2018/11/20
2018/11/19
2018/11/18
...
{noformat}
If there is no {{2018/11}} directory then it is not necessary to check all the 
{{2018/11/xx}} directories. It would be possible to reduce the number of HDFS 
URI checks.

  was:
During data input dependency check Oozie evaluates EL functions like {{ 
coord:latest}} using a non-optimal way which may result more than necessary 
HDFS URI checks.

1. If the {{dataset}} frequency does not match the {{uri-template}} it checks 
the same HDFS URI multiple times. For instance in the following definition:
{noformat}

${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}
_SUCCESS

...

${coord:latest(0)}

{noformat}
oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It 
would be enough to check the file only once and skip the other 1439 tests.

2. If the frequency is 1 day and {{uri-template}} is definied in the following 
way:
{noformat}
${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}
{noformat}
oozie will check the following directories one by one even if the some of the 
parent directories are missing:
{noformat}
2018/11/20
2018/11/19
2018/11/18
...
{noformat}
If there is no {{2018/11}} directory then it is not necessary to check all the 
{{2018/11/xx}} directories. It would be possible to reduce the number of HDFS 
URI checks.


> Optimize coordinator data input dependency search
> -
>
> Key: OOZIE-3387
> URL: https://issues.apache.org/jira/browse/OOZIE-3387
> Project: Oozie
>  Issue Type: Improvement
>Affects Versions: 5.1.0
>Reporter: Andras Salamon
>Priority: Major
>
> During data input dependency check Oozie evaluates EL functions like 
> {{coord:latest()}} using a non-optimal way which may result more than 
> necessary HDFS URI checks.
> 1. If the {{dataset}} frequency does not match the {{uri-template}} it checks 
> the same HDFS URI multiple times. For instance in the following definition:
> {noformat}
>  initial-instance="2017-01-01T08:15Z" timezone="UTC">
> 
> ${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}
> _SUCCESS
> 
> ...
> 
> ${coord:latest(0)}
> 
> {noformat}
> oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It 
> would be enough to check the file only once and skip the other 1439 tests.
> 2. If the frequency is 1 day and {{uri-template}} is definied in the 
> following way:
> {noformat}
> ${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}
> {noformat}
> oozie will check the following directories one by one even if the some of the 
> parent directories are missing:
> {noformat}
> 2018/11/20
> 2018/11/19
> 2018/11/18
> ...
> {noformat}
> If there is no {{2018/11}} directory then it is not necessary to check all 
> the {{2018/11/xx}} directories. It would be possible to reduce the number of 
> HDFS URI checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3387) Optimize coordinator data input dependency search

2018-11-20 Thread Andras Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Piros updated OOZIE-3387:

Affects Version/s: 5.1.0

> Optimize coordinator data input dependency search
> -
>
> Key: OOZIE-3387
> URL: https://issues.apache.org/jira/browse/OOZIE-3387
> Project: Oozie
>  Issue Type: Improvement
>Affects Versions: 5.1.0
>Reporter: Andras Salamon
>Priority: Major
>
> During data input dependency check Oozie evaluates EL functions like {{ 
> coord:latest}} using a non-optimal way which may result more than necessary 
> HDFS URI checks.
> 1. If the {{dataset}} frequency does not match the {{uri-template}} it checks 
> the same HDFS URI multiple times. For instance in the following definition:
> {noformat}
>  initial-instance="2017-01-01T08:15Z" timezone="UTC">
> 
> ${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}
> _SUCCESS
> 
> ...
> 
> ${coord:latest(0)}
> 
> {noformat}
> oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It 
> would be enough to check the file only once and skip the other 1439 tests.
> 2. If the frequency is 1 day and {{uri-template}} is definied in the 
> following way:
> {noformat}
> ${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}
> {noformat}
> oozie will check the following directories one by one even if the some of the 
> parent directories are missing:
> {noformat}
> 2018/11/20
> 2018/11/19
> 2018/11/18
> ...
> {noformat}
> If there is no {{2018/11}} directory then it is not necessary to check all 
> the {{2018/11/xx}} directories. It would be possible to reduce the number of 
> HDFS URI checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)