[
https://issues.apache.org/jira/browse/OOZIE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cecily Myles updated OOZIE-3721:
--------------------------------
Description:
When my cluster is loaded, I am faced with the problem of hanging subsidiaries
in the status of "RUNNING". I get such a mistake when working with the HIVE
tables. But also, I managed to reproduce the problem, launching the usual
calculation of the number of pi in many subsidiaries, imitating the load.
I launch an Oozie workflow with the following structure:
{code:java}
-- Oozie workflow
------> subworkflow_1
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n
------> subworkflow_2
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n {code}
One of the fork have status "RUNNING" but if you open this fork, then it has
"SUCCESS" status.
Parent workflow:
{code:java}
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf/job
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-25 15:55 GMT
Started : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended : -
CoordAction ID: -Actions
-------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID
Ext Status Err Code
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start: OK -
OK -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork OK -
OK -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7 OK
0067643-240125161152217-oozie-oozi-WSUCCEEDED -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9 OK
0067640-240125161152217-oozie-oozi-WSUCCEEDED -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10 RUNNING
0067641-240125161152217-oozie-oozi-WRUNNING -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5 OK
0067645-240125161152217-oozie-oozi-WSUCCEEDED -
-------------------------------------------------------------------------------------------------------------------------
{code}
Running subworkflow:
{code:java}
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-26 04:20 GMT
Started : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
-------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID
Ext Status Err Code
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start: OK -
OK -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork OK -
OK -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21 RUNNING
application_1706187939089_147514RUNNING -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22 RUNNING
application_1706187939089_147519RUNNING -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18 RUNNING
application_1706187939089_147518RUNNING -
-------------------------------------------------------------------------------------------------------------------------{code}
But, running app have state "SUCCEEDED" and "FINISHED"
{code:java}
Application Report :
Application-Id : application_1706187939089_147514
Application-Name :
oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
Application-Type : Oozie Launcher
User : cecyl
Queue : default
Application Priority : 0
Start-Time : 1706259786568
Finish-Time : 1706259853156
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED {code}
The problem began to appear more often after tuning the HA. Solving the problem
- reducing the load and restarting the application. But such a solution is not
normal for me.
There are no signs in the laying and server logs that something is going wrong.
Someone has ideas why such behavior can appear?
was:
When my cluster is loaded, I am faced with the problem of hanging subsidiaries
in the status of "RUNNING". I get such a mistake when working with the HIVE
tables. But also, I managed to reproduce the problem, launching the usual
calculation of the number of pi in many subsidiaries, imitating the load.
I launch an Oozie workflow with the following structure:
{code:java}
-- Oozie workflow
------> subworkflow_1
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n
------> subworkflow_2
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n {code}
One of the fork have status "RUNNING" but if you open this fork, then it has
"SUCCESS" status.
Parent workflow:
{code:java}
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf/job
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-25 15:55 GMT
Started : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended : -
CoordAction ID: -Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID
Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start: OK -
OK -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork OK -
OK -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7 OK
0067643-240125161152217-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9 OK
0067640-240125161152217-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10 RUNNING
0067641-240125161152217-oozie-oozi-WRUNNING -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5 OK
0067645-240125161152217-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
{code}
Running subworkflow:
{code:java}
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-26 04:20 GMT
Started : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID
Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start: OK -
OK -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork OK -
OK -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21 RUNNING
application_1706187939089_147514RUNNING -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22 RUNNING
application_1706187939089_147519RUNNING -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18 RUNNING
application_1706187939089_147518RUNNING -
------------------------------------------------------------------------------------------------------------------------------------
{code}
But, running app have state "SUCCEEDED" and "FINISHED"
{code:java}
Application Report :
Application-Id : application_1706187939089_147514
Application-Name :
oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
Application-Type : Oozie Launcher
User : cecyl
Queue : default
Application Priority : 0
Start-Time : 1706259786568
Finish-Time : 1706259853156
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED {code}
The problem began to appear more often after tuning the HA. Solving the problem
- reducing the load and restarting the application. But such a solution is not
normal for me.
There are no signs in the laying and server logs that something is going wrong.
Someone has ideas why such behavior can appear?
> Subsidiaries freeze in the status of "RUNNING" during a high load on the
> cluster
> --------------------------------------------------------------------------------
>
> Key: OOZIE-3721
> URL: https://issues.apache.org/jira/browse/OOZIE-3721
> Project: Oozie
> Issue Type: Bug
> Components: core
> Affects Versions: 5.2.0
> Reporter: Cecily Myles
> Priority: Blocker
>
> When my cluster is loaded, I am faced with the problem of hanging
> subsidiaries in the status of "RUNNING". I get such a mistake when working
> with the HIVE tables. But also, I managed to reproduce the problem, launching
> the usual calculation of the number of pi in many subsidiaries, imitating the
> load.
> I launch an Oozie workflow with the following structure:
> {code:java}
> -- Oozie workflow
> ------> subworkflow_1
> ---------- fork_1
> ---------- fork_2
> ---------- ...
> ---------- fork_n
> ------> subworkflow_2
> ---------- fork_1
> ---------- fork_2
> ---------- ...
> ---------- fork_n {code}
> One of the fork have status "RUNNING" but if you open this fork, then it has
> "SUCCESS" status.
> Parent workflow:
> {code:java}
> Job ID : 0061971-240125161152217-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------
> Workflow Name : test-subworkflow
> App Path : hdfs://mycluster:8020/user/cecyl/subwf/job
> Status : RUNNING
> Run : 0
> User : cecyl
> Group : -
> Created : 2024-01-25 15:55 GMT
> Started : 2024-01-25 15:55 GMT
> Last Modified : 2024-01-30 06:24 GMT
> Ended : -
> CoordAction ID: -Actions
> -------------------------------------------------------------------------------------------------------------------------
> ID Status Ext ID
> Ext Status Err Code
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@:start: OK -
> OK -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork OK -
> OK -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork7 OK
> 0067643-240125161152217-oozie-oozi-WSUCCEEDED -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork9 OK
> 0067640-240125161152217-oozie-oozi-WSUCCEEDED -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork10 RUNNING
> 0067641-240125161152217-oozie-oozi-WRUNNING -
> -------------------------------------------------------------------------------------------------------------------------
> 0061971-240125161152217-oozie-oozi-W@fork5 OK
> 0067645-240125161152217-oozie-oozi-WSUCCEEDED -
> -------------------------------------------------------------------------------------------------------------------------
> {code}
> Running subworkflow:
> {code:java}
> Job ID : 0067641-240125161152217-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------------------
> Workflow Name : test-subworkflow
> App Path : hdfs://mycluster:8020/user/cecyl/subwf
> Status : RUNNING
> Run : 0
> User : cecyl
> Group : -
> Created : 2024-01-26 04:20 GMT
> Started : 2024-01-26 04:20 GMT
> Last Modified : 2024-01-26 08:23 GMT
> Ended : -
> CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
> -------------------------------------------------------------------------------------------------------------------------
> ID Status Ext ID
> Ext Status Err Code
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@:start: OK -
> OK -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork OK -
> OK -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork21 RUNNING
> application_1706187939089_147514RUNNING -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork22 RUNNING
> application_1706187939089_147519RUNNING -
> -------------------------------------------------------------------------------------------------------------------------
> 0067641-240125161152217-oozie-oozi-W@fork18 RUNNING
> application_1706187939089_147518RUNNING -
> -------------------------------------------------------------------------------------------------------------------------{code}
> But, running app have state "SUCCEEDED" and "FINISHED"
> {code:java}
> Application Report :
> Application-Id : application_1706187939089_147514
> Application-Name :
> oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
> Application-Type : Oozie Launcher
> User : cecyl
> Queue : default
> Application Priority : 0
> Start-Time : 1706259786568
> Finish-Time : 1706259853156
> Progress : 100%
> State : FINISHED
> Final-State : SUCCEEDED {code}
> The problem began to appear more often after tuning the HA. Solving the
> problem - reducing the load and restarting the application. But such a
> solution is not normal for me.
> There are no signs in the laying and server logs that something is going
> wrong. Someone has ideas why such behavior can appear?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)