[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-07-22 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: (was: OOZIE-3265-v3.patch)

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-07-22 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: (was: OOZIE-3265-v3.patch)

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, 
> OOZIE-3265-v3.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-07-22 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: OOZIE-3265-v3.patch

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, 
> OOZIE-3265-v3.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-07-22 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: OOZIE-3265-v3.patch

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, 
> OOZIE-3265-v3.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-07-22 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: (was: OOZIE-3265-v3.patch)

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-07-22 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: OOZIE-3265-v3.patch

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, 
> OOZIE-3265-v3.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3266) Coord action rerun support RERUN_SKIP_NODES option

2018-07-20 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3266:
-
Attachment: (was: OOZIE-3266-v1.txt)

> Coord action rerun support RERUN_SKIP_NODES option
> --
>
> Key: OOZIE-3266
> URL: https://issues.apache.org/jira/browse/OOZIE-3266
> Project: Oozie
>  Issue Type: New Feature
>Affects Versions: 5.1.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: OOZIE-3266-v1.txt
>
>
> currently, when you re-run a workflow job, you have 3 options
>  # re-run all of its action nodes
>  # re-run failed nodes only
>  # re-run with specified nodes skipped
> if this workflow job is generated by a coord action. you can re-run this 
> coord action with 2 options
>  # re-run all of the workflow action nodes (generate a new workflow job id)
>  # re-run failed workflow action nodes only (workflow job id not changed)
> now we want to add a another option
>  - re-run with specified workflow nodes skipped (workflow job id not changed)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3266) Coord action rerun support RERUN_SKIP_NODES option

2018-07-20 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3266:
-
Attachment: OOZIE-3266-v1.txt

> Coord action rerun support RERUN_SKIP_NODES option
> --
>
> Key: OOZIE-3266
> URL: https://issues.apache.org/jira/browse/OOZIE-3266
> Project: Oozie
>  Issue Type: New Feature
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3266-v1.txt
>
>
> currently, when you re-run a workflow job, you have 3 options
>  # re-run all of its action nodes
>  # re-run failed nodes only
>  # re-run with specified nodes skipped
> if this workflow job is generated by a coord action. you can re-run this 
> coord action with 2 options
>  # re-run all of the workflow action nodes (generate a new workflow job id)
>  # re-run failed workflow action nodes only (workflow job id not changed)
> now we want to add a another option
>  - re-run with specified workflow nodes skipped (workflow job id not changed)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-06-27 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525172#comment-16525172
 ] 

TIAN XING commented on OOZIE-3265:
--

[~andras.piros], [~gezapeti] hey, sorry for late response. I just settled down 
in a new city, will finish this patch soon.

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-06-04 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v6.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, OOZIE-3156-v4.patch, OOZIE-3156-v5.patch, 
> OOZIE-3156-v6.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-06-04 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v5.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, OOZIE-3156-v4.patch, OOZIE-3156-v5.patch, 
> ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-06-03 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v4.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, OOZIE-3156-v4.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-06-03 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v3.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-06-03 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: (was: OOZIE-3156-v3.patch)

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-06-03 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499655#comment-16499655
 ] 

TIAN XING commented on OOZIE-3156:
--

[~andras.piros], Thanks for the clarification! The reason why I didn't extract 
those lines that create a ssh action bean as an another method called 
\{{createSshWorkflowAction()}} is that, such kind of action-bean-creation code 
appears in nearly all test methods of \{{TestSshActionExecutor}}. Shall I do 
such extraction to all of them also?

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-31 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496518#comment-16496518
 ] 

TIAN XING commented on OOZIE-3156:
--

Hey [~andras.piros], thanks for the review.

In \{{TestSshActionExecutor#testSshCheckWithHostConnectFailure()}}, I copy the 
code from \{{TestSshActionExecutor#testJobStart}} which gives us an example 
ends with OK status. In oder to create a "SSH connection failure" situation, I 
changed action's \{{TrackerUri}} from "\{{@localhost}}"  to 
"\{{dummy@dummyHost}}" during action status check. An exception is expected to 
be thrown out, while before this patch, the check method will execute normally 
and end with OK status.

Do you have any better suggestions on how to design such test case? Thanks!

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-30 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496138#comment-16496138
 ] 

TIAN XING commented on OOZIE-3156:
--

[~andras.piros]  Test case added.

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-30 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v3.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> OOZIE-3156-v3.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-30 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v2.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, 
> ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-30 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Attachment: OOZIE-3156-v1.patch

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch, ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3267) Re-run coord action without checking input dependencies

2018-05-30 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3267:
-
Attachment: (was: OOZIE-3156-v1.patch)

> Re-run coord action without checking input dependencies
> ---
>
> Key: OOZIE-3267
> URL: https://issues.apache.org/jira/browse/OOZIE-3267
> Project: Oozie
>  Issue Type: New Feature
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
>
> A coord action will get stuck in WAITING status if some of its input 
> dependencies are missing.
> However, there are cases where users want to force start the coord action 
> without checking input dependencies.
> For example, the process of training a day model need 24 different hour input 
> files to get best accuracy ( 23 input files are also OK, but the accuracy may 
> decrease).
> In some cases (e.g., one hour input file get delayed), it is more important 
> to get the model on time than high accuracy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3267) Re-run coord action without checking input dependencies

2018-05-30 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3267:
-
Attachment: OOZIE-3156-v1.patch

> Re-run coord action without checking input dependencies
> ---
>
> Key: OOZIE-3267
> URL: https://issues.apache.org/jira/browse/OOZIE-3267
> Project: Oozie
>  Issue Type: New Feature
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: OOZIE-3156-v1.patch
>
>
> A coord action will get stuck in WAITING status if some of its input 
> dependencies are missing.
> However, there are cases where users want to force start the coord action 
> without checking input dependencies.
> For example, the process of training a day model need 24 different hour input 
> files to get best accuracy ( 23 input files are also OK, but the accuracy may 
> decrease).
> In some cases (e.g., one hour input file get delayed), it is more important 
> to get the model on time than high accuracy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-30 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494928#comment-16494928
 ] 

TIAN XING commented on OOZIE-3156:
--

[~andras.piros] Will work on it, thanks for the comments.

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-05-30 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494770#comment-16494770
 ] 

TIAN XING commented on OOZIE-3265:
--

[~gezapeti] test case added

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-05-30 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: OOZIE-3265-v2.patch

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, OOZIE-3265-v2.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-05-29 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Attachment: OOZIE-3265-v1.patch

> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: OOZIE-3265-v1.patch, rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3267) Re-run coord action without checking input dependencies

2018-05-29 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493591#comment-16493591
 ] 

TIAN XING commented on OOZIE-3267:
--

[~gezapeti] Let's assume that the task of coord job is to train a predict model 
every day 0:01, based on 24 input files of yesterday (1 file per hour). 

If all the 24 files are available, the predict model can get best accuracy. if 
some input files are missing, we can get the predict model as well but with 
lower accuracy.

In most cases, we want to wait for all input files, however, let's say, it is 
2am now, hour-09-yesterday input file is still missing, and we have to get the 
predict model for production environment before 3am even with lower accuracy. 
In such case, we need start the coord action right now without checking inputs.

the coordInput OR logic won't help. if use OR logic, we have to know in advance 
which input files are conditional. What's more, once we fix coord input logic, 
the coord action will always start without waiting certain inputs. In our 
example, it is the user that decide whether coordAction wait for certain inputs 
or not, which is more flexible.

> Re-run coord action without checking input dependencies
> ---
>
> Key: OOZIE-3267
> URL: https://issues.apache.org/jira/browse/OOZIE-3267
> Project: Oozie
>  Issue Type: New Feature
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
>
> A coord action will get stuck in WAITING status if some of its input 
> dependencies are missing.
> However, there are cases where users want to force start the coord action 
> without checking input dependencies.
> For example, the process of training a day model need 24 different hour input 
> files to get best accuracy ( 23 input files are also OK, but the accuracy may 
> decrease).
> In some cases (e.g., one hour input file get delayed), it is more important 
> to get the model on time than high accuracy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OOZIE-3267) Re-run coord action without checking input dependencies

2018-05-29 Thread TIAN XING (JIRA)
TIAN XING created OOZIE-3267:


 Summary: Re-run coord action without checking input dependencies
 Key: OOZIE-3267
 URL: https://issues.apache.org/jira/browse/OOZIE-3267
 Project: Oozie
  Issue Type: New Feature
Reporter: TIAN XING
Assignee: TIAN XING


A coord action will get stuck in WAITING status if some of its input 
dependencies are missing.

However, there are cases where users want to force start the coord action 
without checking input dependencies.

For example, the process of training a day model need 24 different hour input 
files to get best accuracy ( 23 input files are also OK, but the accuracy may 
decrease).

In some cases (e.g., one hour input file get delayed), it is more important to 
get the model on time than high accuracy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OOZIE-3266) Coord action rerun support RERUN_SKIP_NODES option

2018-05-29 Thread TIAN XING (JIRA)
TIAN XING created OOZIE-3266:


 Summary: Coord action rerun support RERUN_SKIP_NODES option
 Key: OOZIE-3266
 URL: https://issues.apache.org/jira/browse/OOZIE-3266
 Project: Oozie
  Issue Type: New Feature
Reporter: TIAN XING
Assignee: TIAN XING


currently, when you re-run a workflow job, you have 3 options
 # re-run all of its action nodes
 # re-run failed nodes only
 # re-run with specified nodes skipped

if this workflow job is generated by a coord action. you can re-run this coord 
action with 2 options
 # re-run all of the workflow action nodes (generate a new workflow job id)
 # re-run failed workflow action nodes only (workflow job id not changed)

now we want to add a another option
 - re-run with specified workflow nodes skipped (workflow job id not changed)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-05-29 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3265:
-
Description: 
Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes"  
set to true,

you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
specified, even if you set "oozie.wf.rerun.failnodes" to false.

This kind of limitation is not reasonable. There is only one case where 
"oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not null 
or empty, that should be disallowed.

  was:
Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes"  
set to true,

you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
specified, even if you set "oozie.wf.rerun.failnodes" to false.

This kind of limitation is not reasonable. There is only one case where 
"oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.failnodes" is not null 
or empty, that should be disallowed.


> properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear 
> together
> --
>
> Key: OOZIE-3265
> URL: https://issues.apache.org/jira/browse/OOZIE-3265
> Project: Oozie
>  Issue Type: Task
>Affects Versions: 5.0.0
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Minor
> Attachments: rerun.patch
>
>
> Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes" 
>  set to true,
> you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
> specified, even if you set "oozie.wf.rerun.failnodes" to false.
> This kind of limitation is not reasonable. There is only one case where 
> "oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.skip.nodes" is not 
> null or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-29 Thread TIAN XING (JIRA)


[ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493440#comment-16493440
 ] 

TIAN XING commented on OOZIE-3156:
--

[~andras.piros] hey Andras, any news on this patch?

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-05-29 Thread TIAN XING (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Description: 
When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
connect to the host and check whether the pid of the process that ssh action 
started is still there (by checking the returned value of command "{{ssh 
 ps -p }}" ) to determine whether ssh action completes or not.

However, we found cases where oozie fails to connect to host during action 
status check (e.g., the host is under heavy load, or network is bad etc.).

In such cases, the return value of command "{{ssh  ps -p }}" will 
be 255 (ssh command exits with the exit status of the remote command or with 
255 if an error occurred.).

According the current logic of method {{getActionStatus()}} in 
{{SshActionExecutor}}, the action status will be determined as OK which may not 
be correct. 

  was:
When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
connect to the host and check whether action shell pid is still there (by 
checking the returned value of command {{ssh $hostIp ps -p $pid}} ) to 
determine whether the action is running or not.

However, there are cases where oozie fails to connect to the host during action 
status check (e.g., the host is under heavy load, or network is bad etc.).

In such cases, the return value of the command {{ssh $hostIp ps -p $pid}} will 
be 255 (ssh command exits with the exit status of the remote command or with 
255 if an error occurred.).

According the current logic of method {{getActionStatus()}} in 
{{SshActionExecutor}}, the action status will be determined as OK which may not 
be correct. 


> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Reporter: TIAN XING
>Assignee: TIAN XING
>Priority: Major
> Attachments: ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether the pid of the process that ssh action 
> started is still there (by checking the returned value of command "{{ssh 
>  ps -p }}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action 
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh  ps -p }}" 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (OOZIE-3265) properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together

2018-05-29 Thread TIAN XING (JIRA)
TIAN XING created OOZIE-3265:


 Summary: properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should 
be able to appear together
 Key: OOZIE-3265
 URL: https://issues.apache.org/jira/browse/OOZIE-3265
 Project: Oozie
  Issue Type: Task
Reporter: TIAN XING
 Attachments: rerun.patch

Currently when you re-run a workflow with property "oozie.wf.rerun.failnodes"  
set to true,

you can no longer re-run it again with "oozie.wf.rerun.skip.nodes" property 
specified, even if you set "oozie.wf.rerun.failnodes" to false.

This kind of limitation is not reasonable. There is only one case where 
"oozie.wf.rerun.failnodes" is true and "oozie.wf.rerun.failnodes" is not null 
or empty, that should be disallowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-01-11 Thread TIAN XING (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Remaining Estimate: (was: 1h)
 Original Estimate: (was: 1h)
   Description: 
When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
connect to the host and check whether action shell pid is still there (by 
checking the returned value of command {{ssh $hostIp ps -p $pid}} ) to 
determine whether the action is running or not.

However, there are cases where oozie fails to connect to the host during action 
status check (e.g., the host is under heavy load, or network is bad etc.).

In such cases, the return value of the command {{ssh $hostIp ps -p $pid}} will 
be 255 (ssh command exits with the exit status of the remote command or with 
255 if an error occurred.).

According the current logic of method {{getActionStatus()}} in 
{{SshActionExecutor}}, the action status will be determined as OK which may not 
be correct. 

  was:
When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
connect to the host and check whether action shell pid is still there (by 
checking the returned value of command `{{ssh $hostIp ps -p $pid}}` ) to 
determine whether the action is running or not.

However, there are cases where oozie fails to connect to the host during action 
status check (e.g., the host is under heavy load, or network is bad etc.).

In such cases, the return value of the command `{{ssh $hostIp ps -p $pid}}` 
will be 255 (ssh command exits with the exit status of the remote command or 
with 255 if an error occurred.).

According the current logic of method {{getActionStatus()}} in 
{{SshActionExecutor}}, the action status will be determined as OK which may not 
be correct. 


> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 4.0.0, 4.1.0, 4.2.0, 4.3.0
>Reporter: TIAN XING
> Fix For: 4.3.0
>
> Attachments: ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether action shell pid is still there (by 
> checking the returned value of command {{ssh $hostIp ps -p $pid}} ) to 
> determine whether the action is running or not.
> However, there are cases where oozie fails to connect to the host during 
> action status check (e.g., the host is under heavy load, or network is bad 
> etc.).
> In such cases, the return value of the command {{ssh $hostIp ps -p $pid}} 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-01-10 Thread TIAN XING (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TIAN XING updated OOZIE-3156:
-
Remaining Estimate: 1h
 Original Estimate: 1h

> SSH action status turns OK wrongly when failed to connect to host
> -
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
>  Issue Type: Bug
>  Components: action
>Affects Versions: 4.0.0, 4.1.0, 4.2.0, 4.3.0
>Reporter: TIAN XING
> Fix For: 4.3.0
>
> Attachments: ssh-check-bug.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
> connect to the host and check whether action shell pid is still there (by 
> checking the returned value of command `{{ssh $hostIp ps -p $pid}}` ) to 
> determine whether the action is running or not.
> However, there are cases where oozie fails to connect to the host during 
> action status check (e.g., the host is under heavy load, or network is bad 
> etc.).
> In such cases, the return value of the command `{{ssh $hostIp ps -p $pid}}` 
> will be 255 (ssh command exits with the exit status of the remote command or 
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in 
> {{SshActionExecutor}}, the action status will be determined as OK which may 
> not be correct. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host

2018-01-10 Thread TIAN XING (JIRA)
TIAN XING created OOZIE-3156:


 Summary: SSH action status turns OK wrongly when failed to connect 
to host
 Key: OOZIE-3156
 URL: https://issues.apache.org/jira/browse/OOZIE-3156
 Project: Oozie
  Issue Type: Bug
  Components: action
Affects Versions: 4.3.0, 4.2.0, 4.1.0, 4.0.0
Reporter: TIAN XING
 Fix For: 4.3.0
 Attachments: ssh-check-bug.patch

When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh 
connect to the host and check whether action shell pid is still there (by 
checking the returned value of command `{{ssh $hostIp ps -p $pid}}` ) to 
determine whether the action is running or not.

However, there are cases where oozie fails to connect to the host during action 
status check (e.g., the host is under heavy load, or network is bad etc.).

In such cases, the return value of the command `{{ssh $hostIp ps -p $pid}}` 
will be 255 (ssh command exits with the exit status of the remote command or 
with 255 if an error occurred.).

According the current logic of method {{getActionStatus()}} in 
{{SshActionExecutor}}, the action status will be determined as OK which may not 
be correct. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)