Robert Kanter created OOZIE-1978:
------------------------------------
Summary: Forkjoin validation code is ridiculously slow in some
cases
Key: OOZIE-1978
URL: https://issues.apache.org/jira/browse/OOZIE-1978
Project: Oozie
Issue Type: Bug
Components: core
Affects Versions: 4.0.1, trunk
Reporter: Robert Kanter
Fix For: trunk
Attachments: workflow.xml
We've had a few users who have run into problems where submitting a workflow
appears to hang (in the case of a subworkflow, it's similar but stuck in PREP).
It turns out that if you wait long enough, it will actually go through and the
workflow will run normally. The problem is that the forkjoin validation code
is taking a really long time.
The attached example has a series of 20 forks where each fork has 6 actions
(it's based on an actual workflow, but all of the names were changed and the
actions were all replaced by simple shell actions). One of our support guys
said it took 1-2 hours , but on my computer it was taking {color:red}*15+
hours*{color} (I had to cancel it)
While this example doesn't have any nested forks, those can also take a long
time too.
It's easy to verify that it's the forkjoin validation code that's taking so
long by looking at a jstack of the Oozie server and seeing deep recursive calls
to {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}.
I also noticed a lot of sitting around in calls LinkedList.contains.
I think we have 3 options:
# See if we can make the existing code faster somehow. Perhaps there's a way
to parallelize it? Maybe there's some redundant checking that we can identify
and skip? Change some data structures? etc
# See if we can write a new way to do this validation. I had originally
completely rewritten this code a while ago, and we've since made a few fixes to
catch edge cases and things. Perhaps it needs another rewrite?
# Try to identify when it's taking a long time and at least let the user know
what's happening or something. Right now, it just appears that the Oozie CLI
has hung and the job doesn't show up in the Oozie server. Most users aren't
going to wait more than a minute or two.
--
This message was sent by Atlassian JIRA
(v6.2#6252)