Cecily Myles created OOZIE-3721:
-----------------------------------
Summary: Subsidiaries freeze in the status of "RUNNING" during a
high load on the cluster
Key: OOZIE-3721
URL: https://issues.apache.org/jira/browse/OOZIE-3721
Project: Oozie
Issue Type: Bug
Components: core
Affects Versions: 5.2.0
Reporter: Cecily Myles
When my cluster is loaded, I am faced with the problem of hanging subsidiaries
in the status of "RUNNING". I get such a mistake when working with the HIVE
tables. But also, I managed to reproduce the problem, launching the usual
calculation of the number of pi in many subsidiaries, imitating the load.
I launch an Oozie workflow with the following structure:
{code:java}
-- Oozie workflow
------> subworkflow_1
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n
------> subworkflow_2
---------- fork_1
---------- fork_2
---------- ...
---------- fork_n {code}
One of the fork have status "RUNNING" but if you open this fork, then it has
"SUCCESS" status.
Parent workflow:
{code:java}
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf/job
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-25 15:55 GMT
Started : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended : -
CoordAction ID: -Actions
------------------------------------------------------------------------------------------------------------------------------------
ID
Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start:
OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork
OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7
OK 0067643-240125161152217-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9
OK 0067640-240125161152217-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10
RUNNING 0067641-240125161152217-oozie-oozi-WRUNNING -
------------------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5
OK 0067645-240125161152217-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
{code}
Running subworkflow:
{code:java}
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-26 04:20 GMT
Started : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
------------------------------------------------------------------------------------------------------------------------------------
ID
Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start:
OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork
OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21
RUNNING application_1706187939089_147514RUNNING -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22
RUNNING application_1706187939089_147519RUNNING -
------------------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18
RUNNING application_1706187939089_147518RUNNING -
------------------------------------------------------------------------------------------------------------------------------------
{code}
But, running app have state "SUCCEEDED" and "FINISHED"
{code:java}
Application Report :
Application-Id : application_1706187939089_147514
Application-Name :
oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
Application-Type : Oozie Launcher
User : cecyl
Queue : default
Application Priority : 0
Start-Time : 1706259786568
Finish-Time : 1706259853156
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED {code}
The problem began to appear more often after tuning the HA. Solving the problem
- reducing the load and restarting the application. But such a solution is not
normal for me.
There are no signs in the laying and server logs that something is going wrong.
Someone has ideas why such behavior can appear?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)