[ 
https://issues.apache.org/jira/browse/OOZIE-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064458#comment-14064458
 ] 

Mona Chitnis commented on OOZIE-1938:
-------------------------------------

More context - all actions are completed, some via server 1 others via server 
2. 

1) Checking the SignalXCommand code also against the WF_ACTIONS table for all 
actions for this job, all of them have pending=0. This probably explains why 
they weren't recovered by ActionCheckerRunnable.

2) As each forked action finishes, two signals are sent - signal value OK and 
signal value :sync:. The 'sync' is needed to maintain the fork-join count, so 
increment on initial forks sending signal :sync:, and then decrement on joins 
sending signal :sync:. I think because of the time when one of the servers was 
down, these :sync:'s were lost or failed to get processed. We dont see this 
problem in a different scenario when both servers were up before actions 
finished and started signaling :sync:.

Not very confident about changing the way we handle the :sync:, so would like 
to discuss the best approach here. The easier approach would be to set the 
action's pending flag in this process so that recovery will pick up action and 
help restore correct :sync: count.

Feedback/corrections?

> Fork-join job does not execute join node sometimes during HA failover
> ---------------------------------------------------------------------
>
>                 Key: OOZIE-1938
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1938
>             Project: Oozie
>          Issue Type: Bug
>          Components: HA
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>             Fix For: trunk
>
>
> Reported by [~mchiang].
> Scenario: (2 Oozie HA servers)
> 21:38:56 submit job at oozie client
> 21:41:42 shut down server1
> 21:46:52 shut down server2
> 21:47:30 start server1
> 22:15:05 start server2
> the last fork path end time is 21:52:53.
> 22:36:48 the job is still RUNNING, not moving to join node.
> Digging into the logs, the locking part seems to work fine with forked action 
> processing distributed amongst the two servers when both running or when one 
> of them is down. The issue seems to be why even RecoveryService fails to pick 
> up the job after all the forks had completed



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to