from:"Peter Bacsko \(JIRA\)"

[jira] [Commented] (OOZIE-3578) MapReduce counters cannot be used over 120

2020-01-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010531#comment-17010531
 ] 

Peter Bacsko commented on OOZIE-3578:
-

Thanks [~dionusos] IMO the patch is good.

Just a couple of nits:
 # Instead of hardcoding ""mapreduce.job.counters.max", you can use 
{{MRJobConfig.COUNTERS_MAX_KEY}}
 # Instead of {{get("mapreduce.job.counters.max")}}, you can use {{getInt()}} 
so you don't need {{Integer.parseInt()}}


> MapReduce counters cannot be used over 120
> --
>
> Key: OOZIE-3578
> URL: https://issues.apache.org/jira/browse/OOZIE-3578
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
> Attachments: OOZIE-3578-001.patch, OOZIE-3578-002.patch
>
>
> When we create a mapreduce action which then creates more than 120 counters 
> then the following exception is thrown:
> {noformat}
> org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101)
> org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108)
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78)
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95)
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounterImpl(AbstractCounterGroup.java:123)
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:113)
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:130)
> org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:155)
> org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:264)
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:383)
> org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859)
> org.apache.hadoop.mapreduce.Job$8.run(Job.java:820)
> org.apache.hadoop.mapreduce.Job$8.run(Job.java:817)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:422)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
> org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817)
> org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:379)
> org.apache.oozie.action.hadoop.MapReduceActionExecutor.end(MapReduceActionExecutor.java:252)
> org.apache.oozie.command.wf.ActionEndXCommand.execute(ActionEndXCommand.java:183)
> org.apache.oozie.command.wf.ActionEndXCommand.execute(ActionEndXCommand.java:62)
> org.apache.oozie.command.XCommand.call(XCommand.java:291)
> org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:244)
> org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:56)
> org.apache.oozie.command.XCommand.call(XCommand.java:291)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:210)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> It turned out if we use Oozie with Hadoop 3 the MR class called {{Limits}} is 
> not initialised properly but with default values:  
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/counters/Limits.java#L40
> If we set the "mapreduce.job.counters.max" to 500 in mapred-site.xml or in 
> core-site.xml has no positive effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-27 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983414#comment-16983414
 ] 

Peter Bacsko commented on OOZIE-3561:
-

Good to hear.

[~asalamon74] please review patch v3 and commit it if you think it's good.

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
> Attachments: OOZIE-3561-002.patch, OOZIE-3561-003.patch, 
> OOZIE-3561_001.patch
>
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-26 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3561:

Attachment: (was: OOZIE-3561-004.patch)

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
> Attachments: OOZIE-3561-002.patch, OOZIE-3561-003.patch, 
> OOZIE-3561_001.patch
>
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-26 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3561:

Attachment: OOZIE-3561-003.patch

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
> Attachments: OOZIE-3561-002.patch, OOZIE-3561-003.patch, 
> OOZIE-3561_001.patch
>
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-25 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981480#comment-16981480
 ] 

Peter Bacsko commented on OOZIE-3561:
-

_"Just store the {{NodeDef}} object in the set, not a string. That should 
exhibit the exact same behavior."_

After doing this locally, it turned out that this alone is not sufficient. We 
have to:
 # Move the memoization part a bit
 # Don't store End and Join nodes

Now all existing tests pass. 

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
> Attachments: OOZIE-3561-002.patch, OOZIE-3561_001.patch
>
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-25 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3561:

Attachment: OOZIE-3561-002.patch

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
> Attachments: OOZIE-3561-002.patch, OOZIE-3561_001.patch
>
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-22 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980123#comment-16980123
 ] 

Peter Bacsko commented on OOZIE-3561:
-

[~dionusos] thanks for the patch, I believe this is the approach that we need.
As we discussed in person, let's improve this further:

1. Just store the {{NodeDef}} object in the set, not a string. That should 
exhibit the exact same behavior.
2. Call the set sth like "seenNodes" or "visitedNodes".
3. Next week, let's come up with some more edge cases, eg. "errorTo" of a node 
inside a fork points to a node which is located in another fork.

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
> Attachments: OOZIE-3561_001.patch
>
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979310#comment-16979310
 ] 

Peter Bacsko edited comment on OOZIE-3561 at 11/22/19 6:59 AM:
---

I refactored the validator 3 years ago, so I had to check it again how it works:

1. Basic validation makes sure that the workflow is not acyclic. That's 
definitely fast.
2. Fork-join validation: it was more tricky. Multiple fork-joins did cause 
problems because paths were re-walked unnecessarily - this had exponential 
runtime with regards to the number of fork-join pairs. However, OOZIE-1978 made 
sure that no unnecessary walks take place by making sure that we stop the 
recursion when we encounter a join. 

Right now I don't see what could go wrong.


was (Author: pbacsko):
I refactored the validator 3 years ago, so I had to check it again how it works:

1. Basic validation makes sure that the workflow is acyclic. That's definitely 
fast.
2. Fork-join validation: it was more tricky. Multiple fork-joins did cause 
problems because paths were re-walked unnecessarily - this had exponential 
runtime with regards to the number of fork-join pairs. However, OOZIE-1978 made 
sure that no unnecessary walks take place by making sure that we stop the 
recursion when we encounter a join. 

Right now I don't see what could go wrong.

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979354#comment-16979354
 ] 

Peter Bacsko commented on OOZIE-3561:
-

So, as we discussed in private, the problem is that the "error" path might lead 
back to the workflow. Usually it's a very short sequence of actions, eg. 
sending an email then kill the execution. When the flow is redirected back to 
the "normal" path from an action node, then essentially every subsequent nodes 
are available from two different paths.

So in your example, "a4" is available in 8 different ways ([ok, ok, ok], [ok, 
ok, error], [ok, error, ok], ... [error, error, error]). So we have an 
exponential runtime, which is pretty sad. I believe we have to use memoization: 
just simply store the nodes that have been already validated. But we have to be 
careful and think about edge cases.

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979310#comment-16979310
 ] 

Peter Bacsko commented on OOZIE-3561:
-

I refactored the validator 3 years ago, so I had to check it again how it works:

1. Basic validation makes sure that the workflow is acyclic. That's definitely 
fast.
2. Fork-join validation: it was more tricky. Multiple fork-joins did cause 
problems because paths were re-walked unnecessarily - this had exponential 
runtime with regards to the number of fork-join pairs. However, OOZIE-1978 made 
sure that no unnecessary walks take place by making sure that we stop the 
recursion when we encounter a join. 

Right now I don't see what could go wrong.

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OOZIE-3561) Forkjoin validation is slow when there are many actions in chain

2019-11-21 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/OOZIE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979305#comment-16979305
 ] 

Peter Bacsko commented on OOZIE-3561:
-

[~dionusos] I don't exactly understand the theory. In your example, you have a 
graph of 80 nodes, which is basically a list without forks. There's no way that 
the runtime is O(n!). What do the nodes represent in your example? What is 
"a1", "a2", etc?

> Forkjoin validation is slow when there are many actions in chain
> 
>
> Key: OOZIE-3561
> URL: https://issues.apache.org/jira/browse/OOZIE-3561
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Denes Bodo
>Assignee: Denes Bodo
>Priority: Critical
>  Labels: performance
>
> In case we have a workflow which has, let's say, 80 actions after each other:
> {{a1 -> a2 -> ... a80}}
> then the validator code "never" finishes.
> Currently the validation (in my understanding) does depth first checks from 
> the start node and runs in time of n! . This is confirmed as when we split 
> this huge workflow into two 40-element workflow then we get 2x ~40!-step in 
> validation instead of ~80! steps.
> Guys, could you please confirm or disprove my theory?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (OOZIE-2586) Eliminate Thread.sleep() calls from TestCoordPushDependencyCheckXCommand

2019-09-12 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/OOZIE-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2586:

Parent: OOZIE-3111
Issue Type: Sub-task  (was: Bug)

> Eliminate Thread.sleep() calls from TestCoordPushDependencyCheckXCommand
> 
>
> Key: OOZIE-2586
> URL: https://issues.apache.org/jira/browse/OOZIE-2586
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
>
> In the test class TestCoordPushDependencyCheckXCommand, there are a couple of 
> Thread.sleep(100) calls to wait for a certain transition of events. On slower 
> machines, this is not enough.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (OOZIE-3512) Flaky test TestActionStartXCommand.testActionWithEscapedStringAndCDATA

2019-07-22 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890154#comment-16890154
 ] 

Peter Bacsko edited comment on OOZIE-3512 at 7/22/19 3:37 PM:
--

Usually an application stays in ACCEPTED state if there are not enough 
resources (vcores / memory). Another problem is when a node manager becomes 
UNHEALTHY - we set up the Mini YARN cluster with a single NM cluster, so if 
that happens, we can't run any applications. This happened many times before 
due to disk space issues and the disk checker, having detected a low amount of 
free space, marked the NM as "unhealthy" so we ended up with 0 NMs. But this 
was addressed and the threshold was raised to 99% or 100%. Anyway I'd examine 
the RM or NM output from the Mini cluster to see why this wasn't scheduled.


was (Author: pbacsko):
Usually an application stays in ACCEPTED state if there are not enough 
resources (vcores / memory). Another problem is when a node manager becomes 
UNHEALTHY - we set up the Mini YARN cluster with a single NM cluster, so if 
that happens, we can't run any applications. This happened many times before 
due to disk space issues and the disk checker, having detected a low amount of 
free space, marked the NM as "unhalthy" so we ended up with 0 NMs. But this was 
addressed and the threashold was raised to 99% or 100%. Anyway I'd examine the 
RM or NM output from the Mini cluster to see why this wasn't scheduled.

> Flaky test TestActionStartXCommand.testActionWithEscapedStringAndCDATA
> --
>
> Key: OOZIE-3512
> URL: https://issues.apache.org/jira/browse/OOZIE-3512
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Andras Salamon
>Assignee: duan xiong
>Priority: Major
>
> {{TestActionStartXCommand.testActionWithEscapedStringAndCDATA}} is flaky, 
> sometimes (for instance: 
> https://issues.apache.org/jira/browse/OOZIE-3470?focusedCommentId=16817901&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16817901
>  ) it fails with the following error message:
> {noformat}junit.framework.AssertionFailedError: YARN App state for app 
> application_1559489642789_0018 expected: but was:
>   at junit.framework.Assert.fail(Assert.java:57)
>   at junit.framework.Assert.failNotEquals(Assert.java:329)
>   at junit.framework.Assert.assertEquals(Assert.java:78)
>   at junit.framework.TestCase.assertEquals(TestCase.java:244)
>   at 
> org.apache.oozie.test.XTestCase.waitUntilYarnAppDoneAndAssertSuccess(XTestCase.java:1358)
>   at 
> org.apache.oozie.command.wf.TestActionStartXCommand.testActionWithEscapedStringAndCDATA(TestActionStartXCommand.java:235)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (OOZIE-2566) TestCoordActionInputCheckXCommand.testCoordActionInputCheckXCommandUniqueness() is flaky

2019-07-22 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890156#comment-16890156
 ] 

Peter Bacsko commented on OOZIE-2566:
-

[~asalamon74] I don't have a definite answer to these questions right now. It 
could be that there's no added value, but we have to double-check this. If it's 
unnecessary, then we can just remove it. Let's examine this together next week.

> TestCoordActionInputCheckXCommand.testCoordActionInputCheckXCommandUniqueness()
>  is flaky
> 
>
> Key: OOZIE-2566
> URL: https://issues.apache.org/jira/browse/OOZIE-2566
> Project: Oozie
>  Issue Type: Sub-task
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Andras Salamon
>Priority: Major
> Attachments: OOZIE-2566-01.patch
>
>
> The testcase testCoordActionInputCheckXCommandUniqueness is unstable.
> We add three XCommands with the same actionId (entityKeys are different) into 
> the CallableQueueService. Only the first XCommand is expected to run.
> The reason why sometimes either the 2nd or 3rd XCommand executes is because 
> as soon as the first starts to run, its removed from the {{uniqueCallables}} 
> map immediately. If the first scheduled task runs quickly, then either the 
> 2nd or 3rd XCommand has the chance to get scheduled.
> Step by step:
> 1. Schedule first XCommand
> 2. XCommand is added to {{uniqueCallables}}
> 3. Schedule second XCommand
> 4. First XCommand starts to run in the thread pool and removes itself from 
> {{uniqueCallables}} (see {{CallableWrapper.run()}})
> 5. Second XCommand can successfully add itself to {{uniqueCallables}}
> 6. Second XCommand starts to run
> Please clarify whether this is the expected behavior of CallableQueueService.
> If not, then moving {{removeFromUniqueCallables()}} to the finally block 
> solves the problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (OOZIE-3512) Flaky test TestActionStartXCommand.testActionWithEscapedStringAndCDATA

2019-07-22 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890154#comment-16890154
 ] 

Peter Bacsko commented on OOZIE-3512:
-

Usually an application stays in ACCEPTED state if there are not enough 
resources (vcores / memory). Another problem is when a node manager becomes 
UNHEALTHY - we set up the Mini YARN cluster with a single NM cluster, so if 
that happens, we can't run any applications. This happened many times before 
due to disk space issues and the disk checker, having detected a low amount of 
free space, marked the NM as "unhalthy" so we ended up with 0 NMs. But this was 
addressed and the threashold was raised to 99% or 100%. Anyway I'd examine the 
RM or NM output from the Mini cluster to see why this wasn't scheduled.

> Flaky test TestActionStartXCommand.testActionWithEscapedStringAndCDATA
> --
>
> Key: OOZIE-3512
> URL: https://issues.apache.org/jira/browse/OOZIE-3512
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Andras Salamon
>Assignee: duan xiong
>Priority: Major
>
> {{TestActionStartXCommand.testActionWithEscapedStringAndCDATA}} is flaky, 
> sometimes (for instance: 
> https://issues.apache.org/jira/browse/OOZIE-3470?focusedCommentId=16817901&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16817901
>  ) it fails with the following error message:
> {noformat}junit.framework.AssertionFailedError: YARN App state for app 
> application_1559489642789_0018 expected: but was:
>   at junit.framework.Assert.fail(Assert.java:57)
>   at junit.framework.Assert.failNotEquals(Assert.java:329)
>   at junit.framework.Assert.assertEquals(Assert.java:78)
>   at junit.framework.TestCase.assertEquals(TestCase.java:244)
>   at 
> org.apache.oozie.test.XTestCase.waitUntilYarnAppDoneAndAssertSuccess(XTestCase.java:1358)
>   at 
> org.apache.oozie.command.wf.TestActionStartXCommand.testActionWithEscapedStringAndCDATA(TestActionStartXCommand.java:235)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (OOZIE-2566) TestCoordActionInputCheckXCommand.testCoordActionInputCheckXCommandUniqueness() is flaky

2019-07-22 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890073#comment-16890073
 ] 

Peter Bacsko commented on OOZIE-2566:
-

[~asalamon74] can't you replace this 200ms delay with some more realiable 
wait/notify logic? In the past, these kind of static delays caused a lot of 
headaches. I know here it solves the problem, but if there's a better way, we 
better try that.

> TestCoordActionInputCheckXCommand.testCoordActionInputCheckXCommandUniqueness()
>  is flaky
> 
>
> Key: OOZIE-2566
> URL: https://issues.apache.org/jira/browse/OOZIE-2566
> Project: Oozie
>  Issue Type: Sub-task
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Andras Salamon
>Priority: Major
> Attachments: OOZIE-2566-01.patch
>
>
> The testcase testCoordActionInputCheckXCommandUniqueness is unstable.
> We add three XCommands with the same actionId (entityKeys are different) into 
> the CallableQueueService. Only the first XCommand is expected to run.
> The reason why sometimes either the 2nd or 3rd XCommand executes is because 
> as soon as the first starts to run, its removed from the {{uniqueCallables}} 
> map immediately. If the first scheduled task runs quickly, then either the 
> 2nd or 3rd XCommand has the chance to get scheduled.
> Step by step:
> 1. Schedule first XCommand
> 2. XCommand is added to {{uniqueCallables}}
> 3. Schedule second XCommand
> 4. First XCommand starts to run in the thread pool and removes itself from 
> {{uniqueCallables}} (see {{CallableWrapper.run()}})
> 5. Second XCommand can successfully add itself to {{uniqueCallables}}
> 6. Second XCommand starts to run
> Please clarify whether this is the expected behavior of CallableQueueService.
> If not, then moving {{removeFromUniqueCallables()}} to the finally block 
> solves the problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (OOZIE-3478) Oozie needs execute permission on the submitting users home directory

2019-05-06 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833978#comment-16833978
 ] 

Peter Bacsko commented on OOZIE-3478:
-

+1 LGTM

> Oozie needs execute permission on the submitting users home directory
> -
>
> Key: OOZIE-3478
> URL: https://issues.apache.org/jira/browse/OOZIE-3478
> Project: Oozie
>  Issue Type: Bug
>  Components: action, security
>Affects Versions: 5.1.0
>Reporter: Andras Salamon
>Assignee: Andras Salamon
>Priority: Major
> Attachments: OOZIE-3478-01-wip.patch, OOZIE-3478-02.patch
>
>
> On a secure cluster oozie user needs execute permission on the submitting 
> user's home directory. The bug affects multiple actions ( probably all which 
> is based on JavaActionExecutor ). Easiest way to reproduce is to use a shell 
> action, where the {{workflow.xml}} contains the following action:
> {noformat}
> 
> ${resourceManager}
> ${nameNode}
> 
> 
> mapred.job.queue.name
> ${queueName}
> 
> 
> test.sh
> /user/systest/test.sh#test.sh
> 
> 
> 
> 
> 
> {noformat}
> If the directory has the following permissions:
> {noformat}drwx--   - systest supergroup  0 2019-04-16 08:19 
> /user/systest
> {noformat}
> then running the workflow gives JA009 error code with the following exception:
> {noformat}ozie-oozi-W@shell-node] Error starting action [shell-node]. 
> ErrorType [TRANSIENT], ErrorCode [JA009], Message [JA009: Permission denied: 
> user=oozie, access=EXECUTE, 
> inode="/user/systest":systest:supergroup:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:316)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:243)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:605)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1822)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:674)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3060)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1151)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:940)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> ]
> org.apache.oozie.action.ActionExecutorException: JA009: Permission denied: 
> user=oozie, access=EXECUTE, 
> inode="/user/systest":systest:supergroup:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:316)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:243)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:605)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirect

[jira] [Commented] (OOZIE-3478) Oozie needs execute permission on the submitting users home directory

2019-05-03 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832552#comment-16832552
 ] 

Peter Bacsko commented on OOZIE-3478:
-

I suggest using {{UserGroupInformationService.getProxyUser()}}. Oozie is 
already loggin in + authenticated at this point, so you don't have to mess 
around with the credentials - Oozie will be able to enter the target directory 
on behalf of the user.

I'd also add some extra comments describing why you're doing this.

> Oozie needs execute permission on the submitting users home directory
> -
>
> Key: OOZIE-3478
> URL: https://issues.apache.org/jira/browse/OOZIE-3478
> Project: Oozie
>  Issue Type: Bug
>  Components: action, security
>Affects Versions: 5.1.0
>Reporter: Andras Salamon
>Assignee: Andras Salamon
>Priority: Major
> Attachments: OOZIE-3478-01-wip.patch
>
>
> On a secure cluster oozie user needs execute permission on the submitting 
> user's home directory. The bug affects multiple actions ( probably all which 
> is based on JavaActionExecutor ). Easiest way to reproduce is to use a shell 
> action, where the {{workflow.xml}} contains the following action:
> {noformat}
> 
> ${resourceManager}
> ${nameNode}
> 
> 
> mapred.job.queue.name
> ${queueName}
> 
> 
> test.sh
> /user/systest/test.sh#test.sh
> 
> 
> 
> 
> 
> {noformat}
> If the directory has the following permissions:
> {noformat}drwx--   - systest supergroup  0 2019-04-16 08:19 
> /user/systest
> {noformat}
> then running the workflow gives JA009 error code with the following exception:
> {noformat}ozie-oozi-W@shell-node] Error starting action [shell-node]. 
> ErrorType [TRANSIENT], ErrorCode [JA009], Message [JA009: Permission denied: 
> user=oozie, access=EXECUTE, 
> inode="/user/systest":systest:supergroup:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:316)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:243)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:605)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1804)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1822)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:674)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3060)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1151)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:940)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> ]
> org.apache.oozie.action.ActionExecutorException: JA009: Permission denied: 
> user=oozie, access=EXECUTE, 
> inode="/user/systest":systest:supergroup:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:316)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:243)
> at 
> org.apache.hadoop.hdf

[jira] [Comment Edited] (OOZIE-3350) Forkjoin validation does not fail if a node is reachable from two different forks

2018-12-17 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723087#comment-16723087
 ] 

Peter Bacsko edited comment on OOZIE-3350 at 12/17/18 3:37 PM:
---

[~andras.piros] basically the idea is what I outlined above: for every {{fork}} 
node, we maintain a list of nodes seen so far. So, we introduce a map like 
{{Map>}}. Every node which is not a fork should 
belong to one and only one {{ForkNodeDef}}. If it belongs to two, then we have 
a problem.


was (Author: pbacsko):
[~andras.piros] basically the idea is what I outlined above: for every {{fork}} 
node, we maintain a list of nodes seen so far. So, we introduce a map like 
{{Map}}. Every node which is not a fork should belong 
to one and only one {{ForkNodeDef}}. If it belongs to two, then we have a 
problem.

> Forkjoin validation does not fail if a node is reachable from two different 
> forks
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Assignee: Julia Kinga Marton
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3350) Forkjoin validation does not fail if a node is reachable from two different forks

2018-12-17 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723087#comment-16723087
 ] 

Peter Bacsko commented on OOZIE-3350:
-

[~andras.piros] basically the idea is what I outlined above: for every {{fork}} 
node, we maintain a list of nodes seen so far. So, we introduce a map like 
{{Map}}. Every node which is not a fork should belong 
to one and only one {{ForkNodeDef}}. If it belongs to two, then we have a 
problem.

> Forkjoin validation does not fail if a node is reachable from two different 
> forks
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Assignee: Julia Kinga Marton
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3252) Flaky test TestPurgeXCommand#testPurgeBundleWithCoordChildWithWFChild1

2018-12-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709304#comment-16709304
 ] 

Peter Bacsko commented on OOZIE-3252:
-

[~asalamon74] I saw this during a test execution maybe only once, it could be 
that it's difficult to reproduce. Order of tests might also play a role, who 
knows - maybe it's not even possible to reproduce it in a distributed test 
environment. 

> Flaky test TestPurgeXCommand#testPurgeBundleWithCoordChildWithWFChild1
> --
>
> Key: OOZIE-3252
> URL: https://issues.apache.org/jira/browse/OOZIE-3252
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The test case TestPurgeXCommand#testPurgeBundleWithCoordChildWithWFChild1 
> failed with the following error:
> {noformat}
> [ERROR] 
> testPurgeBundleWithCoordChildWithWFChild1(org.apache.oozie.command.TestPurgeXCommand)
>   Time elapsed: 0.606 s  <<< FAILURE!
> junit.framework.AssertionFailedError: Bundle Job should not have been purged
>   at 
> org.apache.oozie.command.TestPurgeXCommand.testPurgeBundleWithCoordChildWithWFChild1(TestPurgeXCommand.java:1387)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3350) Forkjoin validation does not fail if a node is reachable from two different forks

2018-11-29 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703142#comment-16703142
 ] 

Peter Bacsko commented on OOZIE-3350:
-

[~andras.piros] we talked about this with Kinga weeks ago, the root cause is 
already well understood and we already have a proprosed solution, which should 
do the job. 

> Forkjoin validation does not fail if a node is reachable from two different 
> forks
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Assignee: Julia Kinga Marton
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3380) TestCoordMaterializeTransitionXCommand failure after DST change date

2018-11-12 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684016#comment-16684016
 ] 

Peter Bacsko edited comment on OOZIE-3380 at 11/12/18 4:15 PM:
---

{noformat}
private static final int TIME_IN_MIN = 60 * 1000;
private static final int TIME_IN_HOURS = TIME_IN_MIN * 60;
private static final int TIME_IN_DAY = TIME_IN_HOURS * 24;
{noformat}

Can't we just use {{java.util.concurrent.TimeUnit}} for such calculations? It's 
so much nicer:
{noformat}
private static final int TIME_IN_MIN = TimeUnit.SECONDS(60).toMillis();
private static final int TIME_IN_HOURS = TimeUnit.HOURS(1).toMillis();
private static final int TIME_IN_DAY = TimeUnit.DAYS(1).toMillis();
{noformat}



was (Author: pbacsko):
{noformat}
private static final int TIME_IN_MIN = 60 * 1000;
private static final int TIME_IN_HOURS = TIME_IN_MIN * 60;
private static final int TIME_IN_DAY = TIME_IN_HOURS * 24;
{noformat}

Can't we just use {{java.util.concurrent.TimeUnit}} for such calculations? It's 
so much nicer:
{noformat}
private static final int TIME_IN_MIN = TimeUnit.SECONDS(60).toMillis();
private static final int TIME_IN_HOURS = TimeUnit..HOURS(1).toMillis();
private static final int TIME_IN_DAY = TimeUnit.DAYS(1).toMillis();
{noformat}


> TestCoordMaterializeTransitionXCommand failure after DST change date
> 
>
> Key: OOZIE-3380
> URL: https://issues.apache.org/jira/browse/OOZIE-3380
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Andras Salamon
>Assignee: Andras Salamon
>Priority: Major
> Attachments: OOZIE-3380-01.patch, OOZIE-3380-02.patch
>
>
> TestCoordMaterializeTransitionXCommand.testMaterializationLookup failed for 
> OOZIE-3377 and OOZIE-3378. It also fails for the trunk:
> {noformat}
> junit.framework.AssertionFailedError: 
> Expected :Mon Nov 05 17:21:58 CET 2018
> Actual   :Sun Nov 04 17:21:58 CET 2018
>  
>   at junit.framework.Assert.fail(Assert.java:57)
>   at junit.framework.Assert.failNotEquals(Assert.java:329)
>   at junit.framework.Assert.assertEquals(Assert.java:78)
>   at junit.framework.Assert.assertEquals(Assert.java:86)
>   at junit.framework.TestCase.assertEquals(TestCase.java:253)
>   at 
> org.apache.oozie.command.coord.TestCoordMaterializeTransitionXCommand.testMaterializationLookup(TestCoordMaterializeTransitionXCommand.java:691)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at junit.framework.TestCase.runTest(TestCase.java:176)
>   at junit.framework.TestCase.runBare(TestCase.java:141)
>   at junit.framework.TestResult$1.protect(TestResult.java:122)
>   at junit.framework.TestResult.runProtected(TestResult.java:142)
>   at junit.framework.TestResult.run(TestResult.java:125)
>   at junit.framework.TestCase.run(TestCase.java:129)
>   at junit.framework.TestSuite.runTest(TestSuite.java:255)
>   at junit.framework.TestSuite.run(TestSuite.java:250)
>   at 
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {noformat}
> This test uses the following dates for testing:
> {noformat}
> startTime = new Date(new Date().getTime() - TIME_IN_DAY * 3);
> endTime = new Date(startTime.getTime() + TIME_IN_DAY * 3);   
> Date next = new Date(startTime.getTime() + TIME_IN_DAY * 3);
> {noformat}
> start time is before the DST change date, end time is after the DST change 
> date. If I shift the interval by two days (so start and end are both after 
> the DST change date) the test works correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3380) TestCoordMaterializeTransitionXCommand failure after DST change date

2018-11-12 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684016#comment-16684016
 ] 

Peter Bacsko commented on OOZIE-3380:
-

{noformat}
private static final int TIME_IN_MIN = 60 * 1000;
private static final int TIME_IN_HOURS = TIME_IN_MIN * 60;
private static final int TIME_IN_DAY = TIME_IN_HOURS * 24;
{noformat}

Can't we just use {{java.util.concurrent.TimeUnit}} for such calculations? It's 
so much nicer:
{noformat}
private static final int TIME_IN_MIN = TimeUnit.SECONDS(60).toMillis();
private static final int TIME_IN_HOURS = TimeUnit..HOURS(1).toMillis();
private static final int TIME_IN_DAY = TimeUnit.DAYS(1).toMillis();
{noformat}


> TestCoordMaterializeTransitionXCommand failure after DST change date
> 
>
> Key: OOZIE-3380
> URL: https://issues.apache.org/jira/browse/OOZIE-3380
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Andras Salamon
>Assignee: Andras Salamon
>Priority: Major
> Attachments: OOZIE-3380-01.patch, OOZIE-3380-02.patch
>
>
> TestCoordMaterializeTransitionXCommand.testMaterializationLookup failed for 
> OOZIE-3377 and OOZIE-3378. It also fails for the trunk:
> {noformat}
> junit.framework.AssertionFailedError: 
> Expected :Mon Nov 05 17:21:58 CET 2018
> Actual   :Sun Nov 04 17:21:58 CET 2018
>  
>   at junit.framework.Assert.fail(Assert.java:57)
>   at junit.framework.Assert.failNotEquals(Assert.java:329)
>   at junit.framework.Assert.assertEquals(Assert.java:78)
>   at junit.framework.Assert.assertEquals(Assert.java:86)
>   at junit.framework.TestCase.assertEquals(TestCase.java:253)
>   at 
> org.apache.oozie.command.coord.TestCoordMaterializeTransitionXCommand.testMaterializationLookup(TestCoordMaterializeTransitionXCommand.java:691)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at junit.framework.TestCase.runTest(TestCase.java:176)
>   at junit.framework.TestCase.runBare(TestCase.java:141)
>   at junit.framework.TestResult$1.protect(TestResult.java:122)
>   at junit.framework.TestResult.runProtected(TestResult.java:142)
>   at junit.framework.TestResult.run(TestResult.java:125)
>   at junit.framework.TestCase.run(TestCase.java:129)
>   at junit.framework.TestSuite.runTest(TestSuite.java:255)
>   at junit.framework.TestSuite.run(TestSuite.java:250)
>   at 
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {noformat}
> This test uses the following dates for testing:
> {noformat}
> startTime = new Date(new Date().getTime() - TIME_IN_DAY * 3);
> endTime = new Date(startTime.getTime() + TIME_IN_DAY * 3);   
> Date next = new Date(startTime.getTime() + TIME_IN_DAY * 3);
> {noformat}
> start time is before the DST change date, end time is after the DST change 
> date. If I shift the interval by two days (so start and end are both after 
> the DST change date) the test works correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3379) Auth token cache file name should include Oozie URL

2018-11-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681400#comment-16681400
 ] 

Peter Bacsko commented on OOZIE-3379:
-

[~zuston] can you create a ReviewBoard review for this patch?

> Auth token cache file name should include Oozie URL
> ---
>
> Key: OOZIE-3379
> URL: https://issues.apache.org/jira/browse/OOZIE-3379
> Project: Oozie
>  Issue Type: Bug
>  Components: client
>Affects Versions: 5.0.0
>Reporter: ZhangJunfan
>Assignee: ZhangJunfan
>Priority: Major
> Attachments: oozie-3379-1.patch, oozie-3379-2.patch, 
> oozie-3379-3.patch, oozie-3379-4.patch
>
>
> We have a program that uses the oozie client, but when the client connects to 
> multiple clusters,
> the authOozieClient class frequently requests the kdc server because the 
> authentication token cache file is invalid.
> This will cause subsequent requests in our program to be blocked, resulting 
> in unstable services.
> So, oozie client's auth token cache file name should include Oozie URL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3377) [docs] Remaining 5.1.0 documentation changes

2018-11-05 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675030#comment-16675030
 ] 

Peter Bacsko commented on OOZIE-3377:
-

+1

> [docs] Remaining 5.1.0 documentation changes
> 
>
> Key: OOZIE-3377
> URL: https://issues.apache.org/jira/browse/OOZIE-3377
> Project: Oozie
>  Issue Type: Task
>  Components: docs
>Affects Versions: 5.1.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Attachments: OOZIE-3377.001.patch
>
>
> Following documentation changes needed for 5.1.0:
>  * link {{DG_FluentJobAPI.md}} from {{index.md}}
>  * extract docs on Git action from {{WorkflowFunctionalSpecification.md}} to 
> its own action extension docs: {{DG_GitActionExtension.md}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3350) Forkjoin validation does not fail if a node is reachable from two different forks

2018-10-31 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670061#comment-16670061
 ] 

Peter Bacsko edited comment on OOZIE-3350 at 10/31/18 1:05 PM:
---

This looks like a valid problem.

It's true that we only store the "topDecisionParent" which is not enough if 
forks are involved.

The solution might not be trivial, because we also have to think about nested 
forks, for example (it's a bit lame ASCII drawing):
{noformat}
  +--+
  |  |
   +--+  fork1   +-+
   |  |  | |
   |  +--+ |
   |   |
   |   |
+--v--+ +--v---+
|A| |  |
| | |  fork2   +---+
+--+--+ |  |   |
   |++-+   |
   | | |
   | | |
   |  +--v---+  +--v+
   |  |B |  |D  |
   |  |  |  |   |
+--v---+  +---+--+  +---+---+
|C |  | |
|  |  | |
+--++ | |
|+v--+  +---v---+
||C  |  |E  |
||   |  |   |
|+---+-+  +-+---+
|  |  |
|  |  |
|+-v--v-+
||  |
||  join2   |
||  |
|+--+---+
|   |
|   |
  +-v--+|
  |+^---+
  |  join1 |
  ||
  ||
  ++
{noformat}

Here, "C" is reachable from two different forks as well. We probably have to 
maintain a {{forkNodes <--> seen nodes}} mapping and we have to make sure that 
every node is available only from a single fork.


was (Author: pbacsko):
This looks like a valid problem.

It's true that we only store the "topDecisionParent" which is not enough if 
forks are involved.

The solution might not be trivial, because we also have to think about nested 
forks, for example (it's a bit lame ASCII drawing):
{noformat}
  +--+
  |  |
   +--+  fork1   +-+
   |  |  | |
   |  +--+ |
   |   |
   |   |
+--v--+ +--v---+
|A| |  |
| | |  fork2   +---+
+--+--+ |  |   |
   |++-+   |
   | | |
   | | |
   |  +--v---+  +--v+
   |  |B |  |D  |
   |  |  |  |   |
+--v---+  +---+--+  +---+---+
|C |  | |
|  |  | |
+--++ | |
|+v--+  +---v---+
||C  |  |E  |
||   |  |   |
|+---+-+  +-+---+
|  |  |
|  |  |
|+-v--v-+
||  |
||  join2   |
||  |
|+--+---+
|   |
|   |
  +-v--+|
  |+^---+
  |  join1 |
  ||
  ||
  ++
{noformat}

Here, "C" is reachable from two different forks as well. We probably have to 
maintain a forkNodes<-->availableNodes mapping and we have to make sure that 
every node is available only from a single fork.

> Forkjoin validation does not fail if a node is reachable from two different 
> forks
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Priority: Major
> Fix For: trunk
>
>

[jira] [Updated] (OOZIE-3350) Forkjoin validation does not fail if a node is reachable from two different forks

2018-10-31 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3350:

Summary: Forkjoin validation does not fail if a node is reachable from two 
different forks  (was: forkjoin validation error when "multiple ok to same 
node" under decision node)

> Forkjoin validation does not fail if a node is reachable from two different 
> forks
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3350) Forkjoin validation does not fail if a node is reachable from two different forks

2018-10-31 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670074#comment-16670074
 ] 

Peter Bacsko commented on OOZIE-3350:
-

Changed the title of the JIRA to be more accurate.

> Forkjoin validation does not fail if a node is reachable from two different 
> forks
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3350) forkjoin validation error when "multiple ok to same node" under decision node

2018-10-31 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670069#comment-16670069
 ] 

Peter Bacsko commented on OOZIE-3350:
-

[~andras.piros] please join the discussion and share your thoughts.

> forkjoin validation error when "multiple ok to same node" under decision node
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3350) forkjoin validation error when "multiple ok to same node" under decision node

2018-10-31 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670061#comment-16670061
 ] 

Peter Bacsko commented on OOZIE-3350:
-

This looks like a valid problem.

It's true that we only store the "topDecisionParent" which is not enough if 
forks are involved.

The solution might not be trivial, because we also have to think about nested 
forks, for example (it's a bit lame ASCII drawing):
{noformat}
  +--+
  |  |
   +--+  fork1   +-+
   |  |  | |
   |  +--+ |
   |   |
   |   |
+--v--+ +--v---+
|A| |  |
| | |  fork2   +---+
+--+--+ |  |   |
   |++-+   |
   | | |
   | | |
   |  +--v---+  +--v+
   |  |B |  |D  |
   |  |  |  |   |
+--v---+  +---+--+  +---+---+
|C |  | |
|  |  | |
+--++ | |
|+v--+  +---v---+
||C  |  |E  |
||   |  |   |
|+---+-+  +-+---+
|  |  |
|  |  |
|+-v--v-+
||  |
||  join2   |
||  |
|+--+---+
|   |
|   |
  +-v--+|
  |+^---+
  |  join1 |
  ||
  ||
  ++
{noformat}

Here, "C" is reachable from two different forks as well. We probably have to 
maintain a forkNodes<-->availableNodes mapping and we have to make sure that 
every node is available only from a single fork.

> forkjoin validation error when "multiple ok to same node" under decision node
> -
>
> Key: OOZIE-3350
> URL: https://issues.apache.org/jira/browse/OOZIE-3350
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 4.3.1
>Reporter: wang jinyin
>Priority: Major
> Fix For: trunk
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> when "multiple ok to same node" under decision node, forkjoin validation 
> error.
>  
> such as below example, 'action_C' and 'action_D' both transition to 
> 'action_E'.
> But, because they are under same topDecisionParent 'decision_A', validator 
> will not throw any exception. 
>  
> {quote}
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> 
>     
>     
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3371) TestSubWorkflowActionExecutor#testSubWorkflowRerun() is flaky

2018-10-25 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663521#comment-16663521
 ] 

Peter Bacsko commented on OOZIE-3371:
-

Looks fairly trivial, +1

> TestSubWorkflowActionExecutor#testSubWorkflowRerun() is flaky
> -
>
> Key: OOZIE-3371
> URL: https://issues.apache.org/jira/browse/OOZIE-3371
> Project: Oozie
>  Issue Type: Sub-task
>  Components: core, tests
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Attachments: OOZIE-3371.001.patch
>
>
> {noformat}
> 2018-10-19 09:43:39 [INFO] 
> ---
> 2018-10-19 09:43:39 [INFO]  T E S T S
> 2018-10-19 09:43:39 [INFO] 
> ---
> 2018-10-19 09:46:54 [INFO] Running 
> org.apache.oozie.action.oozie.TestSubWorkflowActionExecutor
> 2018-10-19 09:46:54 [ERROR] Tests run: 15, Failures: 1, Errors: 0, Skipped: 
> 0, Time elapsed: 194.156 s <<< FAILURE! - in 
> org.apache.oozie.action.oozie.TestSubWorkflowActionExecutor
> 2018-10-19 09:46:54 [ERROR] 
> testSubWorkflowRerun(org.apache.oozie.action.oozie.TestSubWorkflowActionExecutor)
>   Time elapsed: 103.736 s  <<< FAILURE!
> 2018-10-19 09:46:54 junit.framework.AssertionFailedError: 
> expected: but was:
> 2018-10-19 09:46:54   at junit.framework.Assert.fail(Assert.java:57)
> 2018-10-19 09:46:54   at junit.framework.Assert.failNotEquals(Assert.java:329)
> 2018-10-19 09:46:54   at junit.framework.Assert.assertEquals(Assert.java:78)
> 2018-10-19 09:46:54   at junit.framework.Assert.assertEquals(Assert.java:86)
> 2018-10-19 09:46:54   at 
> junit.framework.TestCase.assertEquals(TestCase.java:253)
> 2018-10-19 09:46:54   at 
> org.apache.oozie.action.oozie.TestSubWorkflowActionExecutor.testSubWorkflowRerun(TestSubWorkflowActionExecutor.java:580)
> ...
> 2018-10-19 09:46:54 [ERROR]   
> TestSubWorkflowActionExecutor.testSubWorkflowRerun:580 expected: 
> but was:
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3369) [core] Upgrade guru.nidi:graphviz-java to 0.7.0

2018-10-15 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650435#comment-16650435
 ] 

Peter Bacsko commented on OOZIE-3369:
-

+1

> [core] Upgrade guru.nidi:graphviz-java to 0.7.0
> ---
>
> Key: OOZIE-3369
> URL: https://issues.apache.org/jira/browse/OOZIE-3369
> Project: Oozie
>  Issue Type: Task
>  Components: core
>Affects Versions: 5.1.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Attachments: OOZIE-3369.001.patch
>
>
> There are some transitive dependencies of {{guru.nidi:graphviz-java:0.2.2}} 
> that are obsolete and / or are subject to [security 
> vulnerabilities|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-8013].
>  Let's upgrade to latest version {{0.7.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3137) Add support for log4j2 in HiveMain

2018-10-10 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3137:

Attachment: OOZIE-3137-001.patch

> Add support for log4j2 in HiveMain
> --
>
> Key: OOZIE-3137
> URL: https://issues.apache.org/jira/browse/OOZIE-3137
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: OOZIE-3137-001.patch
>
>
> Hive 2.0 is using log4j 2 (HIVE-11304). In order to support Hadoop 3 we 
> should add a mechanism to configure log4j 2 in HiveMain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (OOZIE-3137) Add support for log4j2 in HiveMain

2018-10-10 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned OOZIE-3137:
---

Assignee: Peter Bacsko  (was: Julia Kinga Marton)

> Add support for log4j2 in HiveMain
> --
>
> Key: OOZIE-3137
> URL: https://issues.apache.org/jira/browse/OOZIE-3137
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Peter Bacsko
>Priority: Major
>
> Hive 2.0 is using log4j 2 (HIVE-11304). In order to support Hadoop 3 we 
> should add a mechanism to configure log4j 2 in HiveMain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644531#comment-16644531
 ] 

Peter Bacsko edited comment on OOZIE-3136 at 10/10/18 6:56 AM:
---

So looks like the plan is
 # Add log4j-api12 dependency
 # Remove log4j-specific calls from XLogService, add new log4j2 setup code
 # Rewrite OozieRollingPolicy or rewrite streaming code
 # Adapt the remaining classes to log4j2:
 ** TestXLogService.java
 ** TestXLogStreamingService.java
 ** XLogUtil.java
 ** LocalOozie.java
 ** TestSignalXCommand.java
 ** ... some other test classes
 # Remove log4j dependency
 # Test like crazy, especially streaming and backward compatibility

Anything I missed? I think we're fine with the YARN/sharelib side. So it's only 
oozie-core.

Can we give a possible ETA to the Bigtop guys? I assume it's at least 4-5 
weeks, perhaps more.


was (Author: pbacsko):
So looks like the plan is

# Add log4j-api12 dependency
# Remove log4j-specific calls from XLogService, add new log4j2 setup code
# Rewrite OozieRollingPolicy or rewrite streaming code
# Adapt the remaining classes to log4j2:
** TestXLogService.java
** TestXLogStreamingService.java
** XLogUtil.java
** LocalOozie.java
** TestSignalXCommand.java
** ... some other test classes
# Remove log4j dependency
# Test like crazy, especially streaming and backward compatibility

Anything I missed?

Can we give a possible ETA to the Bigtop guys? I assume it's at least 4-5 
weeks, perhaps more.


> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644531#comment-16644531
 ] 

Peter Bacsko edited comment on OOZIE-3136 at 10/10/18 6:55 AM:
---

So looks like the plan is

# Add log4j-api12 dependency
# Remove log4j-specific calls from XLogService, add new log4j2 setup code
# Rewrite OozieRollingPolicy or rewrite streaming code
# Adapt the remaining classes to log4j2:
** TestXLogService.java
** TestXLogStreamingService.java
** XLogUtil.java
** LocalOozie.java
** TestSignalXCommand.java
** ... some other test classes
# Remove log4j dependency
# Test like crazy, especially streaming and backward compatibility

Anything I missed?

Can we give a possible ETA to the Bigtop guys? I assume it's at least 4-5 
weeks, perhaps more.



was (Author: pbacsko):
So looks like the plan is

# Add log4j-api12 dependency
# Remove log4j-specific calls from XLogService, add new log4j2 setup code
# Rewrite OozieRollingPolicy or rewrite streaming code
# Adapt the remaining classes to log4j2:
* TestXLogService.java
* TestXLogStreamingService.java
* XLogUtil.java
* LocalOozie.java
* TestSignalXCommand.java
* ... some other test classes
# Remove log4j dependency
# Test like crazy, especially streaming and backward compatibility

Anything I missed?

Can we give a possible ETA to the Bigtop guys? I assume it's at least 4-5 
weeks, perhaps more.


> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644531#comment-16644531
 ] 

Peter Bacsko commented on OOZIE-3136:
-

So looks like the plan is

# Add log4j-api12 dependency
# Remove log4j-specific calls from XLogService, add new log4j2 setup code
# Rewrite OozieRollingPolicy or rewrite streaming code
# Adapt the remaining classes to log4j2:
* TestXLogService.java
* TestXLogStreamingService.java
* XLogUtil.java
* LocalOozie.java
* TestSignalXCommand.java
* ... some other test classes
# Remove log4j dependency
# Test like crazy, especially streaming and backward compatibility

Anything I missed?

Can we give a possible ETA to the Bigtop guys? I assume it's at least 4-5 
weeks, perhaps more.


> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643106#comment-16643106
 ] 

Peter Bacsko commented on OOZIE-3136:
-

Well, yeah, we do access log4j directly:

https://github.com/apache/oozie/blob/ba665da34c23b1fa86bf1570724147e6f8c85b1d/core/src/main/java/org/apache/oozie/service/XLogService.java#L149

https://github.com/apache/oozie/blob/ba665da34c23b1fa86bf1570724147e6f8c85b1d/core/src/main/java/org/apache/oozie/service/XLogService.java#L174-L178

We must go through the whole code to see if there's any direct calls to Log4j 
classes, then we have to rewrite them appropriately...

> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643098#comment-16643098
 ] 

Peter Bacsko edited comment on OOZIE-3136 at 10/9/18 10:29 AM:
---

Regarding the bridge, check this out: 
https://logging.apache.org/log4j/log4j-2.2/log4j-1.2-api/index.html

"To use the Log4j Legacy Bridge just remove all the Log4j 1.x jars from the 
application and replace them with the bridge jar. Once in place all logging 
that uses Log4j 1.x will be routed to Log4j 2. *However, applications that 
attempt to modify legacy Log4j by adding Appenders, Filters, etc may experience 
problems if they try to verify the success of these actions as these methods 
are largely no-ops*."

Do we do anything like that? I mean, messing around log4j classes directly. 
This must be checked.

If we directly use stuff like {{PropertyConfigurator}}, that's a no-op in the 
bridge. We must also make sure that initial {{Logger.getLogger()}} calls 
configure the Log4j2 based on the original log4j properties file. I just 
checked this and does not seem to be the case - there is a converter class, but 
looks like we have to call that manually. 

So there are quite a few things to consider/validate here.


was (Author: pbacsko):
Regarding the bridge, check this out: 
https://logging.apache.org/log4j/log4j-2.2/log4j-1.2-api/index.html

"To use the Log4j Legacy Bridge just remove all the Log4j 1.x jars from the 
application and replace them with the bridge jar. Once in place all logging 
that uses Log4j 1.x will be routed to Log4j 2. *However, applications that 
attempt to modify legacy Log4j by adding Appenders, Filters, etc may experience 
problems if they try to verify the success of these actions as these methods 
are largely no-ops*."

Do we do anything like that? I mean, messing around log4j classes directly. 
This must be checked.

If we directly use stuff like {{PropertyConfigurator}}, that's a no-op in the 
bridge. We must also make sure that initial {{Logger.getlogger()}} calls 
configure the Log4j2 based on the original log4j properties file. I just 
checked this and does not seem to be the case - there is a converter class, but 
looks like we have to call that manually. 

So there are quite a few things to consider/validate here.

> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643098#comment-16643098
 ] 

Peter Bacsko commented on OOZIE-3136:
-

Regarding the bridge, check this out: 
https://logging.apache.org/log4j/log4j-2.2/log4j-1.2-api/index.html

"To use the Log4j Legacy Bridge just remove all the Log4j 1.x jars from the 
application and replace them with the bridge jar. Once in place all logging 
that uses Log4j 1.x will be routed to Log4j 2. *However, applications that 
attempt to modify legacy Log4j by adding Appenders, Filters, etc may experience 
problems if they try to verify the success of these actions as these methods 
are largely no-ops*."

Do we do anything like that? I mean, messing around log4j classes directly. 
This must be checked.

If we directly use stuff like {{PropertyConfigurator}}, that's a no-op in the 
bridge. We must also make sure that initial {{Logger.getlogger()}} calls 
configure the Log4j2 based on the original log4j properties file. I just 
checked this and does not seem to be the case - there is a converter class, but 
looks like we have to call that manually. 

So there are quite a few things to consider/validate here.

> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-09 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643051#comment-16643051
 ] 

Peter Bacsko commented on OOZIE-3136:
-

I do agree that we don't have to upgrade right now. That would be a much bigger 
undertaking, taking at least weeks to finish and then test properly. Backward 
compatibility must be preserved.

bq. if the user has a fat JAR consisting also of log4j12 classes that are 
versions incompatible w/ the Oozie server 

Fat jar has always been a problem and right now there's not much we can do 
about it (we could mitigate it to a certain degree with better classloader 
isolation, but again, that's not straightforward). We've seen conflicting 
Guavas as well. Also, the YARN side is a different story, we don't even use 
loggers there, just sysout. On a separate note, we shall focus on those sysouts 
too, it doesn't look good :)

bq.  I don't recall any issues with Oozie's logging performance

I do remember one instance and I'm sure that so do you: when we enabled full 
DEBUG loggin in the tests, that completely choked log4j and upstream tests 
timed out. Pretty nasty. Indeed, lot of output was generated, but still, such 
things should not happen with a well-performing logging library.

Just the fact that we depend on an ancient library that hasn't been actively 
maintained in the last 3-4 years is alone a good reason for an upgrade - 
probably in Oozie 6.0.

> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3135) Configure log4j2 in SqoopMain

2018-10-08 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641937#comment-16641937
 ] 

Peter Bacsko commented on OOZIE-3135:
-

Ya, I missed that. I thought we haven't filed a JIRA for that yet.

> Configure log4j2 in SqoopMain
> -
>
> Key: OOZIE-3135
> URL: https://issues.apache.org/jira/browse/OOZIE-3135
> Project: Oozie
>  Issue Type: Sub-task
>Affects Versions: 5.0.0b1
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
> Attachments: OOZIE-3135-001.patch
>
>
> In Hadoop 3, MAPREDUCE-6983 switched to use slfj4 & log4j2 in 
> {{org.apache.hadoop.mapreduce.Job}} (that prints out MR job id-s needed for 
> Oozie). We need to setup log4j accordingly (it is also related to 
> HADOOP-12956).
> Without proper configuration in the Sqoop action, we won't be able to get 
> external job id-s (SqoopActionExecutor unit tests and real action would be 
> also affected).
>
> [The API for Log4j 2 is not compatible with Log4j 
> 1.x|https://logging.apache.org/log4j/2.x/], but we will need to support both 
> hadoop 2 and hadoop 3 profiles for a while. 
> We could use reflection to determine the type of the logger object in 
> {{org.apache.hadoop.mapreduce.Job}} and configure log4j settings based on it, 
> but there might be a better way.
> For example we could do something like this:
> - add a new method for configuring log4j2:
> {code}
> private String setUpSqoopLog4J2(final String rootLogLevel) throws 
> IOException {
> System.out.println("Setting up log4j2");
> final String logFile = getSqoopLogFile();
> final File log4j2Xml = new File(SQOOP_LOG4J2_XML);
> try (Writer writer = new FileWriter(log4j2Xml))
> {
> final String logj2SettingsXml = " encoding=\"UTF-8\"?>\n" +
> "\n" +
> "\n" +
> " target=\"SYSTEM_OUT\">\n" +
> " [%t] %-5level %logger{36} - %msg%n\"/>\n" +
> "\n" +
> " "\">  \n" +
> " [%t] %-5level %logger{36} - %msg%n\"/>\n" +
> " \n" +
> "\n" +
> "\n" +
> " "\">\n" +
> "\n" +
> "\n" +
> "\n" +
> "\n" +
> "";
> writer.write(logj2SettingsXml);
> }
> System.out.printf("log4j2 configuration file created at %s%n", 
> log4j2Xml.getAbsolutePath());
> final   LoggerContext context = (LoggerContext) 
> LogManager.getContext(false);
> context.setConfigLocation(log4j2Xml.toURI()); // forces log4j2 
> reconfiguration
> return logFile;
> }
> {code}
> and call it in the {{run()}} method if the mapreduce client is using slf4j 
> for logging:
> {code}
> String logFile;
> // MAPREDUCE-6983 switches to slfj4 & log4j2. Need to setup log4j 
> accordingly
> if 
> (org.apache.hadoop.mapreduce.Job.class.getDeclaredField("LOG").getType().
> isAssignableFrom(org.slf4j.Logger.class)) {
> logFile = setUpSqoopLog4J2(rootLogLevel);
> }
> else {
> logFile = setUpSqoopLog4J(rootLogLevel, logLevel);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3136) Upgrade from Log4j 1.x to 2.x

2018-10-08 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641930#comment-16641930
 ] 

Peter Bacsko commented on OOZIE-3136:
-

HADOOP-12956 proposed a complete migration to Log4j2. I think we should do that 
as well. It is certainly more complex, but cleaner.

Having both the adapter layer and the original log4j on the classpath can cause 
strange issues. For example, an application wants to use the legacy log4j and 
the classloader finds the adapter classes first and it will redirect all 
logging calls to log4j2. Depending on the situation this might or might not be 
what we want, but in general, having the same API in two different artifacts is 
simply a pain and a perfect recipe for hard-to-debug problems.

> Upgrade from Log4j 1.x to 2.x
> -
>
> Key: OOZIE-3136
> URL: https://issues.apache.org/jira/browse/OOZIE-3136
> Project: Oozie
>  Issue Type: Sub-task
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
>
> {{5 August 2015 --The Apache Logging Services™ Project Management Committee 
> (PMC) has announced that the Log4j™ 1.x logging framework has reached its end 
> of life (EOL) and is no longer officially supported.}} 
> https://blogs.apache.org/foundation/entry/apache_logging_services_project_announces
> We should upgrade from Log4j 1.x to 2.x . Perhaps we could use slf4j .
> Related tickets: MAPREDUCE-6983, HADOOP-12956, OOZIE-3135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3135) Configure log4j2 in SqoopMain

2018-10-08 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641915#comment-16641915
 ] 

Peter Bacsko commented on OOZIE-3135:
-

This is exactly what we have to do for Hive as well. They also switched to 
log4j2 and we have to detect whether it's log4j or log4j2. 

> Configure log4j2 in SqoopMain
> -
>
> Key: OOZIE-3135
> URL: https://issues.apache.org/jira/browse/OOZIE-3135
> Project: Oozie
>  Issue Type: Sub-task
>Affects Versions: 5.0.0b1
>Reporter: Attila Sasvari
>Assignee: Julia Kinga Marton
>Priority: Major
> Attachments: OOZIE-3135-001.patch
>
>
> In Hadoop 3, MAPREDUCE-6983 switched to use slfj4 & log4j2 in 
> {{org.apache.hadoop.mapreduce.Job}} (that prints out MR job id-s needed for 
> Oozie). We need to setup log4j accordingly (it is also related to 
> HADOOP-12956).
> Without proper configuration in the Sqoop action, we won't be able to get 
> external job id-s (SqoopActionExecutor unit tests and real action would be 
> also affected).
>
> [The API for Log4j 2 is not compatible with Log4j 
> 1.x|https://logging.apache.org/log4j/2.x/], but we will need to support both 
> hadoop 2 and hadoop 3 profiles for a while. 
> We could use reflection to determine the type of the logger object in 
> {{org.apache.hadoop.mapreduce.Job}} and configure log4j settings based on it, 
> but there might be a better way.
> For example we could do something like this:
> - add a new method for configuring log4j2:
> {code}
> private String setUpSqoopLog4J2(final String rootLogLevel) throws 
> IOException {
> System.out.println("Setting up log4j2");
> final String logFile = getSqoopLogFile();
> final File log4j2Xml = new File(SQOOP_LOG4J2_XML);
> try (Writer writer = new FileWriter(log4j2Xml))
> {
> final String logj2SettingsXml = " encoding=\"UTF-8\"?>\n" +
> "\n" +
> "\n" +
> " target=\"SYSTEM_OUT\">\n" +
> " [%t] %-5level %logger{36} - %msg%n\"/>\n" +
> "\n" +
> " "\">  \n" +
> " [%t] %-5level %logger{36} - %msg%n\"/>\n" +
> " \n" +
> "\n" +
> "\n" +
> " "\">\n" +
> "\n" +
> "\n" +
> "\n" +
> "\n" +
> "";
> writer.write(logj2SettingsXml);
> }
> System.out.printf("log4j2 configuration file created at %s%n", 
> log4j2Xml.getAbsolutePath());
> final   LoggerContext context = (LoggerContext) 
> LogManager.getContext(false);
> context.setConfigLocation(log4j2Xml.toURI()); // forces log4j2 
> reconfiguration
> return logFile;
> }
> {code}
> and call it in the {{run()}} method if the mapreduce client is using slf4j 
> for logging:
> {code}
> String logFile;
> // MAPREDUCE-6983 switches to slfj4 & log4j2. Need to setup log4j 
> accordingly
> if 
> (org.apache.hadoop.mapreduce.Job.class.getDeclaredField("LOG").getType().
> isAssignableFrom(org.slf4j.Logger.class)) {
> logFile = setUpSqoopLog4J2(rootLogLevel);
> }
> else {
> logFile = setUpSqoopLog4J(rootLogLevel, logLevel);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-10-03 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636705#comment-16636705
 ] 

Peter Bacsko commented on OOZIE-3160:
-

Please create a new JIRA about investigating why tasks get sometimes stuck in 
the executors.

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: trunk
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-007.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, OOZIE-3160.amend.001.patch, 
> OOZIE-3160.amend.002.patch, OOZIE-3160.amend.003.patch, 
> OOZIE-3160.amend.004.patch, OOZIE-3160.amend.005.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-10-03 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636702#comment-16636702
 ] 

Peter Bacsko commented on OOZIE-3160:
-

+1 to the latest amend patch if Jenkins build passes.

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: trunk
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-007.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, OOZIE-3160.amend.001.patch, 
> OOZIE-3160.amend.002.patch, OOZIE-3160.amend.003.patch, 
> OOZIE-3160.amend.004.patch, OOZIE-3160.amend.005.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3354) [core] [SSH action] SSH action gets hung

2018-09-28 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631842#comment-16631842
 ] 

Peter Bacsko commented on OOZIE-3354:
-

+1

> [core] [SSH action] SSH action gets hung
> 
>
> Key: OOZIE-3354
> URL: https://issues.apache.org/jira/browse/OOZIE-3354
> Project: Oozie
>  Issue Type: Bug
>  Components: action, core
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: OOZIE-3354.001.patch, OOZIE-3354.002.patch, 
> OOZIE-3354.003.patch
>
>
> In OOZIE-3183 {{SshActionExecutor#drainBuffers()}} has changed. Previously, 
> it called {{Process#exitCode()}} that would return immediately either with 
> the exit code, or would throw an {{IllegalThreadStateException}} if the 
> process would still be running.
> In the current implementation introduced by OOZIE-3183, {{Process#waitFor()}} 
> is used that would block until the process finishes. Given the fact that 
> sometime {{SshActionExecutor#check()}} calls {{ssh ... cat stdout}}, and this 
> SSH process can be trapped even after {{cat stdout}} has been finished on the 
> target host, it can happen that {{SshActionExecutor#drainBuffers()}} waits 
> indefinitely without a chance to gather any {{stdout}} or {{stderr}} logs. 
> Hence this particular one is a compatibility breaking change with existing 
> SSH action behavior.
> Let's re-introduce the former behavior in 
> {{SshActionExecutor#drainBuffers()}} that keeps polling 
> {{Process#exitValue()}} and reading the progress on {{stdout}} and {{stderr}} 
> till the process finishes, for backwards compatibility.
> [This 
> article|https://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html]
>  covers the behavioral details of {{Process#waitFor()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3354) [core] [SSH action] SSH action gets hung

2018-09-28 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631681#comment-16631681
 ] 

Peter Bacsko commented on OOZIE-3354:
-

Ok, this one looks good. Let's wait for Jenkins then I'll +1 it if there are no 
errors.

> [core] [SSH action] SSH action gets hung
> 
>
> Key: OOZIE-3354
> URL: https://issues.apache.org/jira/browse/OOZIE-3354
> Project: Oozie
>  Issue Type: Bug
>  Components: action, core
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: OOZIE-3354.001.patch, OOZIE-3354.002.patch, 
> OOZIE-3354.003.patch
>
>
> In OOZIE-3183 {{SshActionExecutor#drainBuffers()}} has changed. Previously, 
> it called {{Process#exitCode()}} that would return immediately either with 
> the exit code, or would throw an {{IllegalThreadStateException}} if the 
> process would still be running.
> In the current implementation introduced by OOZIE-3183, {{Process#waitFor()}} 
> is used that would block until the process finishes. Given the fact that 
> sometime {{SshActionExecutor#check()}} calls {{ssh ... cat stdout}}, and this 
> SSH process can be trapped even after {{cat stdout}} has been finished on the 
> target host, it can happen that {{SshActionExecutor#drainBuffers()}} waits 
> indefinitely without a chance to gather any {{stdout}} or {{stderr}} logs. 
> Hence this particular one is a compatibility breaking change with existing 
> SSH action behavior.
> Let's re-introduce the former behavior in 
> {{SshActionExecutor#drainBuffers()}} that keeps polling 
> {{Process#exitValue()}} and reading the progress on {{stdout}} and {{stderr}} 
> till the process finishes, for backwards compatibility.
> [This 
> article|https://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html]
>  covers the behavioral details of {{Process#waitFor()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3354) [core] [SSH action] SSH action gets hung

2018-09-28 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631675#comment-16631675
 ] 

Peter Bacsko commented on OOZIE-3354:
-

Thoughts from me:

1. Add a short comment that the current solution is kind of like an emergency 
solution to a problem that arises when {{Process.waitFor()}} is used. Or sth 
like that.
2. Increase sleep to 500ms. 100ms might be too tight.


> [core] [SSH action] SSH action gets hung
> 
>
> Key: OOZIE-3354
> URL: https://issues.apache.org/jira/browse/OOZIE-3354
> Project: Oozie
>  Issue Type: Bug
>  Components: action, core
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: OOZIE-3354.001.patch, OOZIE-3354.002.patch
>
>
> In OOZIE-3183 {{SshActionExecutor#drainBuffers()}} has changed. Previously, 
> it called {{Process#exitCode()}} that would return immediately either with 
> the exit code, or would throw an {{IllegalThreadStateException}} if the 
> process would still be running.
> In the current implementation introduced by OOZIE-3183, {{Process#waitFor()}} 
> is used that would block until the process finishes. Given the fact that 
> sometime {{SshActionExecutor#check()}} calls {{ssh ... cat stdout}}, and this 
> SSH process can be trapped even after {{cat stdout}} has been finished on the 
> target host, it can happen that {{SshActionExecutor#drainBuffers()}} waits 
> indefinitely without a chance to gather any {{stdout}} or {{stderr}} logs. 
> Hence this particular one is a compatibility breaking change with existing 
> SSH action behavior.
> Let's re-introduce the former behavior in 
> {{SshActionExecutor#drainBuffers()}} that keeps polling 
> {{Process#exitValue()}} and reading the progress on {{stdout}} and {{stderr}} 
> till the process finishes, for backwards compatibility.
> [This 
> article|https://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html]
>  covers the behavioral details of {{Process#waitFor()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3353) Add support for WebHDFS token provider

2018-09-27 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628566#comment-16628566
 ] 

Peter Bacsko edited comment on OOZIE-3353 at 9/27/18 2:23 PM:
--

I'll add an example later.

But for instance, using DistCp with webhdfs will not work in a secure 
environment.


was (Author: pbacsko):
I'll add an example later.

But for example, using DistCp with webhdfs will not work in a secure 
environment.

> Add support for WebHDFS token provider
> --
>
> Key: OOZIE-3353
> URL: https://issues.apache.org/jira/browse/OOZIE-3353
> Project: Oozie
>  Issue Type: New Feature
>  Components: core
>Reporter: Peter Bacsko
>Priority: Major
>
> Oozie currently doesn't support fetching delegation tokens for WebHDFS.
> Ordinary HDFS tokens are not adequate because the token kind is different: 
> WEBHDFS_TOKEN_KIND vs HDFS_DELEGATION_TOKEN.
> We need to use REST calls to retrieve the tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3353) Add support for WebHDFS token provider

2018-09-26 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628566#comment-16628566
 ] 

Peter Bacsko commented on OOZIE-3353:
-

I'll add an example later.

But for example, using DistCp with webhdfs will not work in a secure 
environment.

> Add support for WebHDFS token provider
> --
>
> Key: OOZIE-3353
> URL: https://issues.apache.org/jira/browse/OOZIE-3353
> Project: Oozie
>  Issue Type: New Feature
>  Components: core
>Reporter: Peter Bacsko
>Priority: Major
>
> Oozie currently doesn't support fetching delegation tokens for WebHDFS.
> Ordinary HDFS tokens are not adequate because the token kind is different: 
> WEBHDFS_TOKEN_KIND vs HDFS_DELEGATION_TOKEN.
> We need to use REST calls to retrieve the tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (OOZIE-3353) Add support for WebHDFS token provider

2018-09-26 Thread Peter Bacsko (JIRA)

Peter Bacsko created OOZIE-3353:
---

 Summary: Add support for WebHDFS token provider
 Key: OOZIE-3353
 URL: https://issues.apache.org/jira/browse/OOZIE-3353
 Project: Oozie
  Issue Type: New Feature
  Components: core
Reporter: Peter Bacsko


Oozie currently doesn't support fetching delegation tokens for WebHDFS.

Ordinary HDFS tokens are not adequate because the token kind is different: 
WEBHDFS_TOKEN_KIND vs HDFS_DELEGATION_TOKEN.

We need to use REST calls to retrieve the tokens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3340) [fluent-job] Create error handler ACTION only if needed

2018-09-25 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627126#comment-16627126
 ] 

Peter Bacsko commented on OOZIE-3340:
-

+1 LGTM if Jenkins build passes.

> [fluent-job] Create error handler ACTION only if needed
> ---
>
> Key: OOZIE-3340
> URL: https://issues.apache.org/jira/browse/OOZIE-3340
> Project: Oozie
>  Issue Type: Improvement
>  Components: fluent-job
>Reporter: Andras Salamon
>Assignee: Julia Kinga Marton
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: OOZIE-3340-001.patch, OOZIE-3340.002.patch
>
>
> The Shell and MultipleShellActions example of the Fluent Job API generates 
> multiple actions with the same name ({{email-on-error}}) which gives 
> {{E0705}} error code.
> For MultipleShellActions the generated XML:
> {noformat}
> Workflow job definition generated from API jar: 
> 
>  xmlns:workflow="uri:oozie:workflow:1.0" 
> xmlns:shell="uri:oozie:shell-action:1.0" name="shell-example">
> 
> 
> Action failed, error 
> message[${wf:errorMessage(wf:lastErrorNode())}]
> 
> 
> 
> someb...@apache.org
> Workflow error
> Shell action failed, error 
> message[${wf:errorMessage(wf:lastErrorNode())}]
> 
> 
> 
> 
> 
> ...
> 
> 
>...
> 
> 
> ...
> 
> 
> ...
> 
> 
> ...
> 
> ...
> {noformat}
> The error message:
> {noformat}
> bin/oozie job -oozie http://localhost:11000/oozie -runjar fluenttest.jar 
> -config job.properties -verbose
> ...
> Error: E0705 : E0705: Nnode already defined, node [email-on-error]
> {noformat}
> The Shell example also creates an XML with multiple {{email-on-error}} 
> actions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3307) [core][oya] Limit heap usage of LauncherAM

2018-09-25 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627053#comment-16627053
 ] 

Peter Bacsko edited comment on OOZIE-3307 at 9/25/18 9:53 AM:
--

Test failures are unrelated, +1 for this.


was (Author: pbacsko):
Test failure are unrelated, +1 for this.

> [core][oya] Limit heap usage of LauncherAM
> --
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 5.1.0
>
> Attachments: OOZIE-3307.001.patch, OOZIE-3307.002.patch, 
> OOZIE-3307.003.patch, OOZIE-3307.004.patch, OOZIE-3307.005.patch, 
> OOZIE-3307.006.patch
>
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3307) [core][oya] Limit heap usage of LauncherAM

2018-09-25 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627053#comment-16627053
 ] 

Peter Bacsko commented on OOZIE-3307:
-

Test failure are unrelated, +1 for this.

> [core][oya] Limit heap usage of LauncherAM
> --
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 5.1.0
>
> Attachments: OOZIE-3307.001.patch, OOZIE-3307.002.patch, 
> OOZIE-3307.003.patch, OOZIE-3307.004.patch, OOZIE-3307.005.patch, 
> OOZIE-3307.006.patch
>
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3352) TestCallableQueueService#testPriorityExecutionOrder() is failing with ConcurrentModification

2018-09-24 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3352:

Attachment: OOZIE-3352-002.patch

> TestCallableQueueService#testPriorityExecutionOrder() is failing with 
> ConcurrentModification
> 
>
> Key: OOZIE-3352
> URL: https://issues.apache.org/jira/browse/OOZIE-3352
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Julia Kinga Marton
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: OOZIE-3352-001.patch, OOZIE-3352-002.patch
>
>
> During last few precommit runs, 
> {{org.apache.oozie.service.TestCallableQueueService#testPriorityExecutionOrder()}}
>  failed with {{ConcurrentModificationException}}:
> {noformat}
> [ERROR] 
> testPriorityExecutionOrder(org.apache.oozie.service.TestCallableQueueService) 
> Time elapsed: 20.999 s <<< ERROR!
>  java.util.ConcurrentModificationException
>  at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>  ...
>  at java.util.Collections.min(Collections.java:599)
>  at 
> org.apache.oozie.service.TestCallableQueueService.testPriorityExecutionOrder(TestCallableQueueService.java:993)
>  ...
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3352) TestCallableQueueService#testPriorityExecutionOrder() is failing with ConcurrentModification

2018-09-24 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3352:

Attachment: OOZIE-3352-001.patch

> TestCallableQueueService#testPriorityExecutionOrder() is failing with 
> ConcurrentModification
> 
>
> Key: OOZIE-3352
> URL: https://issues.apache.org/jira/browse/OOZIE-3352
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Julia Kinga Marton
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: OOZIE-3352-001.patch
>
>
> During last few precommit runs, 
> {{org.apache.oozie.service.TestCallableQueueService#testPriorityExecutionOrder()}}
>  failed with {{ConcurrentModificationException}}:
> {noformat}
> [ERROR] 
> testPriorityExecutionOrder(org.apache.oozie.service.TestCallableQueueService) 
> Time elapsed: 20.999 s <<< ERROR!
>  java.util.ConcurrentModificationException
>  at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>  ...
>  at java.util.Collections.min(Collections.java:599)
>  at 
> org.apache.oozie.service.TestCallableQueueService.testPriorityExecutionOrder(TestCallableQueueService.java:993)
>  ...
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (OOZIE-3352) TestCallableQueueService#testPriorityExecutionOrder() is failing with ConcurrentModification

2018-09-24 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned OOZIE-3352:
---

Assignee: Peter Bacsko

> TestCallableQueueService#testPriorityExecutionOrder() is failing with 
> ConcurrentModification
> 
>
> Key: OOZIE-3352
> URL: https://issues.apache.org/jira/browse/OOZIE-3352
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Julia Kinga Marton
>Assignee: Peter Bacsko
>Priority: Major
>
> During last few precommit runs, 
> {{org.apache.oozie.service.TestCallableQueueService#testPriorityExecutionOrder()}}
>  failed with {{ConcurrentModificationException}}:
> {noformat}
> [ERROR] 
> testPriorityExecutionOrder(org.apache.oozie.service.TestCallableQueueService) 
> Time elapsed: 20.999 s <<< ERROR!
>  java.util.ConcurrentModificationException
>  at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>  ...
>  at java.util.Collections.min(Collections.java:599)
>  at 
> org.apache.oozie.service.TestCallableQueueService.testPriorityExecutionOrder(TestCallableQueueService.java:993)
>  ...
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3352) TestCallableQueueService#testPriorityExecutionOrder() is failing with ConcurrentModification

2018-09-24 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625745#comment-16625745
 ] 

Peter Bacsko commented on OOZIE-3352:
-

I'm converting this to a unit test - will upload patch soon.

> TestCallableQueueService#testPriorityExecutionOrder() is failing with 
> ConcurrentModification
> 
>
> Key: OOZIE-3352
> URL: https://issues.apache.org/jira/browse/OOZIE-3352
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Julia Kinga Marton
>Priority: Major
>
> During last few precommit runs, 
> {{org.apache.oozie.service.TestCallableQueueService#testPriorityExecutionOrder()}}
>  failed with {{ConcurrentModificationException}}:
> {noformat}
> [ERROR] 
> testPriorityExecutionOrder(org.apache.oozie.service.TestCallableQueueService) 
> Time elapsed: 20.999 s <<< ERROR!
>  java.util.ConcurrentModificationException
>  at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>  ...
>  at java.util.Collections.min(Collections.java:599)
>  at 
> org.apache.oozie.service.TestCallableQueueService.testPriorityExecutionOrder(TestCallableQueueService.java:993)
>  ...
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3352) TestCallableQueueService#testPriorityExecutionOrder() is failing with ConcurrentModification

2018-09-24 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625713#comment-16625713
 ] 

Peter Bacsko commented on OOZIE-3352:
-

This is probably another test which should be a pure unit test. Besides, I'm 
not 100% convinced that we need it. I'll think about this. 

> TestCallableQueueService#testPriorityExecutionOrder() is failing with 
> ConcurrentModification
> 
>
> Key: OOZIE-3352
> URL: https://issues.apache.org/jira/browse/OOZIE-3352
> Project: Oozie
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: trunk
>Reporter: Julia Kinga Marton
>Priority: Major
>
> During last few precommit runs, 
> {{org.apache.oozie.service.TestCallableQueueService#testPriorityExecutionOrder()}}
>  failed with {{ConcurrentModificationException}}:
> {noformat}
> [ERROR] 
> testPriorityExecutionOrder(org.apache.oozie.service.TestCallableQueueService) 
> Time elapsed: 20.999 s <<< ERROR!
>  java.util.ConcurrentModificationException
>  at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>  ...
>  at java.util.Collections.min(Collections.java:599)
>  at 
> org.apache.oozie.service.TestCallableQueueService.testPriorityExecutionOrder(TestCallableQueueService.java:993)
>  ...
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (OOZIE-3351) Flaky test TestMemoryLocks#testWriteLockSameThreadNoWait()

2018-09-21 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned OOZIE-3351:
---

Assignee: Peter Bacsko

> Flaky test TestMemoryLocks#testWriteLockSameThreadNoWait()
> --
>
> Key: OOZIE-3351
> URL: https://issues.apache.org/jira/browse/OOZIE-3351
> Project: Oozie
>  Issue Type: Sub-task
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: OOZIE-3351-001.patch
>
>
> The unit test {{TestMemoryLocks#testWriteLockSameThreadNoWait()}} is flaky:
> {noformat}
> 2018-09-21 09:58:36 [INFO] 
> ---
> 2018-09-21 09:58:36 [INFO]  T E S T S
> 2018-09-21 09:58:36 [INFO] 
> ---
> 2018-09-21 09:59:01 [INFO] Running org.apache.oozie.lock.TestMemoryLocks
> 2018-09-21 09:59:01 [ERROR] Tests run: 12, Failures: 1, Errors: 0, Skipped: 
> 0, Time elapsed: 24.06 s <<< FAILURE! - in 
> org.apache.oozie.lock.TestMemoryLocks
> 2018-09-21 09:59:01 [ERROR] 
> testWriteLockSameThreadNoWait(org.apache.oozie.lock.TestMemoryLocks)  Time 
> elapsed: 0.219 s  <<< FAILURE!
> 2018-09-21 09:59:01 junit.framework.ComparisonFailure: expected: a:1-L2 a:1-U1 a:2-N] a:1-U2> but was:
> 2018-09-21 09:59:01   at junit.framework.Assert.assertEquals(Assert.java:100)
> 2018-09-21 09:59:01   at junit.framework.Assert.assertEquals(Assert.java:107)
> 2018-09-21 09:59:01   at 
> junit.framework.TestCase.assertEquals(TestCase.java:269)
> 2018-09-21 09:59:01   at 
> org.apache.oozie.lock.TestMemoryLocks.testWriteLockSameThreadNoWait(TestMemoryLocks.java:301)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3351) Flaky test TestMemoryLocks#testWriteLockSameThreadNoWait()

2018-09-21 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3351:

Attachment: OOZIE-3351-001.patch

> Flaky test TestMemoryLocks#testWriteLockSameThreadNoWait()
> --
>
> Key: OOZIE-3351
> URL: https://issues.apache.org/jira/browse/OOZIE-3351
> Project: Oozie
>  Issue Type: Sub-task
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Priority: Major
> Attachments: OOZIE-3351-001.patch
>
>
> The unit test {{TestMemoryLocks#testWriteLockSameThreadNoWait()}} is flaky:
> {noformat}
> 2018-09-21 09:58:36 [INFO] 
> ---
> 2018-09-21 09:58:36 [INFO]  T E S T S
> 2018-09-21 09:58:36 [INFO] 
> ---
> 2018-09-21 09:59:01 [INFO] Running org.apache.oozie.lock.TestMemoryLocks
> 2018-09-21 09:59:01 [ERROR] Tests run: 12, Failures: 1, Errors: 0, Skipped: 
> 0, Time elapsed: 24.06 s <<< FAILURE! - in 
> org.apache.oozie.lock.TestMemoryLocks
> 2018-09-21 09:59:01 [ERROR] 
> testWriteLockSameThreadNoWait(org.apache.oozie.lock.TestMemoryLocks)  Time 
> elapsed: 0.219 s  <<< FAILURE!
> 2018-09-21 09:59:01 junit.framework.ComparisonFailure: expected: a:1-L2 a:1-U1 a:2-N] a:1-U2> but was:
> 2018-09-21 09:59:01   at junit.framework.Assert.assertEquals(Assert.java:100)
> 2018-09-21 09:59:01   at junit.framework.Assert.assertEquals(Assert.java:107)
> 2018-09-21 09:59:01   at 
> junit.framework.TestCase.assertEquals(TestCase.java:269)
> 2018-09-21 09:59:01   at 
> org.apache.oozie.lock.TestMemoryLocks.testWriteLockSameThreadNoWait(TestMemoryLocks.java:301)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3351) Flaky test TestMemoryLocks#testWriteLockSameThreadNoWait()

2018-09-21 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623506#comment-16623506
 ] 

Peter Bacsko commented on OOZIE-3351:
-

This must be an ordering problem in {{SameThreadWriteLocker}}:


{code:java}
if (token != null) {
  coordinator.lockAcquireDone();

  log.info("Got lock [{0}]", nameIndex);
  sb.append(nameIndex + "-L1 ");
  if (token2 != null) {
sb.append(nameIndex + "-L2 ");
  }
  sb.append(nameIndex + "-U1 ");
...{code}

We should call {{lockAcquireDone()}} only after updating the StringBuffer, just 
like in the abstract {{Locker}} class:


{code:java}
LockToken token = getLock();
if (token != null) {
  log.info("Got lock [{0}]", nameIndex);
  sb.append(nameIndex + "-L ");

  coordinator.lockAcquireDone();{code}

> Flaky test TestMemoryLocks#testWriteLockSameThreadNoWait()
> --
>
> Key: OOZIE-3351
> URL: https://issues.apache.org/jira/browse/OOZIE-3351
> Project: Oozie
>  Issue Type: Sub-task
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Priority: Major
>
> The unit test {{TestMemoryLocks#testWriteLockSameThreadNoWait()}} is flaky:
> {noformat}
> 2018-09-21 09:58:36 [INFO] 
> ---
> 2018-09-21 09:58:36 [INFO]  T E S T S
> 2018-09-21 09:58:36 [INFO] 
> ---
> 2018-09-21 09:59:01 [INFO] Running org.apache.oozie.lock.TestMemoryLocks
> 2018-09-21 09:59:01 [ERROR] Tests run: 12, Failures: 1, Errors: 0, Skipped: 
> 0, Time elapsed: 24.06 s <<< FAILURE! - in 
> org.apache.oozie.lock.TestMemoryLocks
> 2018-09-21 09:59:01 [ERROR] 
> testWriteLockSameThreadNoWait(org.apache.oozie.lock.TestMemoryLocks)  Time 
> elapsed: 0.219 s  <<< FAILURE!
> 2018-09-21 09:59:01 junit.framework.ComparisonFailure: expected: a:1-L2 a:1-U1 a:2-N] a:1-U2> but was:
> 2018-09-21 09:59:01   at junit.framework.Assert.assertEquals(Assert.java:100)
> 2018-09-21 09:59:01   at junit.framework.Assert.assertEquals(Assert.java:107)
> 2018-09-21 09:59:01   at 
> junit.framework.TestCase.assertEquals(TestCase.java:269)
> 2018-09-21 09:59:01   at 
> org.apache.oozie.lock.TestMemoryLocks.testWriteLockSameThreadNoWait(TestMemoryLocks.java:301)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-21 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623360#comment-16623360
 ] 

Peter Bacsko commented on OOZIE-3160:
-

{{TestCallableQueueService#testQueueSizeWhenMaxConcurrencyIsReached()}} yeah I 
understand why it's flaky. It's a bit difficult to write this test to be 
super-stable.

Basically we want to verify that after submitting 1 elements, the queue 
size is close to this number. By the time we retrieve the number, a number of 
elements have already been removed, depending on how fast the computer is. It's 
not evident how to choose a good threshold. Looks like >9000 is too high. Maybe 
something like >5000? 

I can think of a better approach: use barriers or latches in the XCallables and 
let them run only after we examined the queue size. This would give us a stable 
result of queue size == 1.

 

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: trunk
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 5.1.0
>
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-007.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, OOZIE-3160.amend.001.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3349) Test cases in oozie fail with java.net.ConnectException

2018-09-20 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621826#comment-16621826
 ] 

Peter Bacsko commented on OOZIE-3349:
-

Alright, thanks.

> Test cases in oozie fail with java.net.ConnectException
> ---
>
> Key: OOZIE-3349
> URL: https://issues.apache.org/jira/browse/OOZIE-3349
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Alisha Prabhu
>Assignee: Andras Piros
>Priority: Major
>  Labels: ppc64le, x86_64
> Fix For: 5.1.0
>
>
> Maven command used : mvn test -fn
> Error :
> {code:java}
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor232.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
>   at 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3349) Test cases in oozie fail with java.net.ConnectException

2018-09-20 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621713#comment-16621713
 ] 

Peter Bacsko edited comment on OOZIE-3349 at 9/20/18 9:18 AM:
--

[~andras.piros] but we still have to figure out why {{false}} causes test 
problems. Or will you create a new JIRA for that?


was (Author: pbacsko):
[~andras.piros] but we still have to figure out why false causes test problems. 
Or will you create a new JIRA for that?

> Test cases in oozie fail with java.net.ConnectException
> ---
>
> Key: OOZIE-3349
> URL: https://issues.apache.org/jira/browse/OOZIE-3349
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Alisha Prabhu
>Assignee: Andras Piros
>Priority: Major
>  Labels: ppc64le, x86_64
> Fix For: 5.1.0
>
>
> Maven command used : mvn test -fn
> Error :
> {code:java}
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor232.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
>   at 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3349) Test cases in oozie fail with java.net.ConnectException

2018-09-20 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621713#comment-16621713
 ] 

Peter Bacsko commented on OOZIE-3349:
-

[~andras.piros] but we still have to figure out why false causes test problems. 
Or will you create a new JIRA for that?

> Test cases in oozie fail with java.net.ConnectException
> ---
>
> Key: OOZIE-3349
> URL: https://issues.apache.org/jira/browse/OOZIE-3349
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Alisha Prabhu
>Assignee: Andras Piros
>Priority: Major
>  Labels: ppc64le, x86_64
> Fix For: 5.1.0
>
>
> Maven command used : mvn test -fn
> Error :
> {code:java}
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor232.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
>   at 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3349) Test cases in oozie fail with java.net.ConnectException

2018-09-19 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621185#comment-16621185
 ] 

Peter Bacsko edited comment on OOZIE-3349 at 9/19/18 8:49 PM:
--

I can't fathom how that commit could possibly cause such an error, but if it's 
consistently reproducible and not trivial to find the root cause, we can revert 
it and then study the problem without affecting trunk builds.

I would check what happens if 
{{oozie.service.CallableQueueService.queue.oldImpl}} is set to true. 


was (Author: pbacsko):
I can't fathom how that commit could possibly cause such an error, but if it's 
consistently reproducible and not trivial to find the root cause, we can roll 
it back and then see what goes wrong.

I would check what happens if 
{{oozie.service.CallableQueueService.queue.oldImpl}} is set to true. 

> Test cases in oozie fail with java.net.ConnectException
> ---
>
> Key: OOZIE-3349
> URL: https://issues.apache.org/jira/browse/OOZIE-3349
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Alisha Prabhu
>Assignee: Andras Piros
>Priority: Major
>  Labels: ppc64le, x86_64
> Fix For: 5.1.0
>
>
> Mvn command used : mvn test -fn
> Error :
> {code:java}
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor232.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
>   at 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3349) Test cases in oozie fail with java.net.ConnectException

2018-09-19 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621185#comment-16621185
 ] 

Peter Bacsko commented on OOZIE-3349:
-

I can't fathom how that commit could possibly cause such an error, but if it's 
consistently reproducible and not trivial to find the root cause, we can roll 
it back and then see what goes wrong.

I would check what happens if 
{{oozie.service.CallableQueueService.queue.oldImpl}} is set to true. 

> Test cases in oozie fail with java.net.ConnectException
> ---
>
> Key: OOZIE-3349
> URL: https://issues.apache.org/jira/browse/OOZIE-3349
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: trunk
>Reporter: Alisha Prabhu
>Assignee: Andras Piros
>Priority: Major
>  Labels: ppc64le, x86_64
> Fix For: 5.1.0
>
>
> Mvn command used : mvn test -fn
> Error :
> {code:java}
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor232.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy33.getFileInfo(Unknown Source)
>   at 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3298) OYA: external ID is not filled properly and failing MR job is treated as SUCCEEDED

2018-09-10 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609366#comment-16609366
 ] 

Peter Bacsko commented on OOZIE-3298:
-

+1 from me, just take care of that Findbugs error.

> OYA: external ID is not filled properly and failing MR job is treated as 
> SUCCEEDED
> --
>
> Key: OOZIE-3298
> URL: https://issues.apache.org/jira/browse/OOZIE-3298
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Peter Bacsko
>Assignee: Andras Piros
>Priority: Blocker
> Fix For: 5.1.0
>
> Attachments: OOZIE-3298.001.patch, OOZIE-3298.002.patch, 
> OOZIE-3298.003.patch, OOZIE-3298.004.patch, OOZIE-3298.005.patch, 
> OOZIE-3298.007.patch, OOZIE-3298.008.patch, OOZIE-3298.009.patch, 
> OOZIE-3298.010.patch, OOZIE-3298.011.patch, OOZIE-3298.012.patch, 
> OOZIE-3298.013.patch
>
>
> When a MapReduce action is launched from Oozie (OYA), we don't properly fill 
> the external ID field. It gets populated with the YARN id of the LauncherAM, 
> not with the id of the actual MR job. If the MR job is succesfully submitted 
> then fails, it will be treated as a successfully executed action, which is 
> very misleading and can potentially confuse Oozie users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3307) Limit heap usage of LauncherAM

2018-09-07 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3307:

Component/s: core

> Limit heap usage of LauncherAM
> --
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>  Components: core
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 5.1.0
>
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-06 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605731#comment-16605731
 ] 

Peter Bacsko commented on OOZIE-3160:
-

The stability test on a 4-node cluster has passed. I think we're good to go 
ahead and commit this.

Thanks everyone for the review, committed to master!

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-007.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-05 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604272#comment-16604272
 ] 

Peter Bacsko commented on OOZIE-3160:
-

Uploaded patch v7 because accidentally I removed some Findbugs annotations.

[~andras.piros] there's a stability test running on a 4-node cluster. Let's 
hope it will pass.

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-007.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-05 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-007.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-007.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-04 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-006.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-006.patch, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, 
> PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-2877) Oozie Git Action

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603000#comment-16603000
 ] 

Peter Bacsko commented on OOZIE-2877:
-

+1 for the latest patch. 

 

I think in the current state it can be committed to master.

> Oozie Git Action
> 
>
> Key: OOZIE-2877
> URL: https://issues.apache.org/jira/browse/OOZIE-2877
> Project: Oozie
>  Issue Type: Sub-task
>  Components: action
>Affects Versions: 5.0.0
>Reporter: Clay B.
>Assignee: Clay B.
>Priority: Major
>  Labels: action
> Fix For: 5.1.0
>
> Attachments: 0001-OOZIE-2877-Oozie-Git-Action.patch, 
> 0002-OOZIE-2877-Oozie-Git-Action.patch, 
> 0003-OOZIE-2877-Oozie-Git-Action.patch, 
> 0004-OOZIE-2877-Oozie-Git-Action.patch, 
> 0005-OOZIE-2877-Oozie-Git-Action.patch, 
> 0006-OOZIE-2877-Oozie-Git-Action.patch, 
> 0007-OOZIE-2877-Oozie-Git-Action.patch, 
> 0008-OOZIE-2877-Oozie-Git-Action.patch, 
> 0009-OOZIE-2877-Oozie-Git-Action.patch, OOZIE-2877.010.patch, 
> OOZIE-2877.011.patch, OOZIE-2877.012.patch, OOZIE-2877.013-1.patch, 
> OOZIE-2877.013.patch, OOZIE-2877.014-1.patch, OOZIE-2877.014-2.patch, 
> OOZIE-2877.014-3.patch, OOZIE-2877.015.patch, OOZIE-2877.016.patch, 
> OOZIE-2877.017.patch
>
>
> To aide in deploying ASCII artifacts to clusters, let's provide a tie-in for 
> a source-code management system. Git would be my preferred choice.
> Ideally, this could handle a user's key material e.g. for an ssh key to pull 
> down from a secured repository. This would free users from handling their own 
> key staging and clean-up on YARN nodes and only require them to store the key 
> secured in HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602944#comment-16602944
 ] 

Peter Bacsko commented on OOZIE-3160:
-

Uploaded patch v5 which contains some minor cleanup.

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-POC01.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, 
> OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-04 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-005.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-005.patch, OOZIE-3160-POC01.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, 
> OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-09-04 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-004.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-004.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3307) oozie workflow gets failed throwing error virtual memory limits

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602855#comment-16602855
 ] 

Peter Bacsko edited comment on OOZIE-3307 at 9/4/18 10:17 AM:
--

[~andras.piros] could you take care of this?
 # Set {{-Xmx}} to be 80% of the container memory size
 # If it's already defined as a JVM option by the user, let that setting take 
precedence and don't override it (optional: we can still parse it and print a 
warning if the settings can lead to the situation described in this ticket)

Right now we use 2GB as default in the container request, to me this looks like 
a reasonable value.


was (Author: pbacsko):
[~andras.piros] could you take care of this?
 # Set {{-Xmx}} to be 80% of the container memory size
 # If it's already defined as a java-opts, let that setting take precedence

Right now we use 2GB as default in the container request, to me this looks like 
a reasonable value.

> oozie workflow gets failed throwing error virtual memory limits
> ---
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 5.1.0
>
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3307) oozie workflow gets failed throwing error virtual memory limits

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602855#comment-16602855
 ] 

Peter Bacsko edited comment on OOZIE-3307 at 9/4/18 10:16 AM:
--

[~andras.piros] could you take care of this?
 # Set {{-Xmx}} to be 80% of the container memory size
 # If it's already defined as a java-opts, let that setting take precedence

Right now we use 2GB as default in the container request, to me this looks like 
a reasonable value.


was (Author: pbacsko):
[~andras.piros] could you take care of this?
 # Set {{-Xmx}} to be 80% of the container memory size
 # If it's already defined as a java-opts, let that setting take precedence

Right now we use 2GB as default in the container request, to me this is a 
reasonable value.

> oozie workflow gets failed throwing error virtual memory limits
> ---
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
> Fix For: 5.1.0
>
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3307) oozie workflow gets failed throwing error virtual memory limits

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602855#comment-16602855
 ] 

Peter Bacsko commented on OOZIE-3307:
-

[~andras.piros] could you take care of this?
 # Set {{-Xmx}} to be 80% of the container memory size
 # If it's already defined as a java-opts, let that setting take precedence

Right now we use 2GB as default in the container request, to me this is a 
reasonable value.

> oozie workflow gets failed throwing error virtual memory limits
> ---
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (OOZIE-3307) oozie workflow gets failed throwing error virtual memory limits

2018-09-04 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned OOZIE-3307:
---

Assignee: Andras Piros

> oozie workflow gets failed throwing error virtual memory limits
> ---
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Assignee: Andras Piros
>Priority: Critical
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3307) oozie workflow gets failed throwing error virtual memory limits

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602847#comment-16602847
 ] 

Peter Bacsko edited comment on OOZIE-3307 at 9/4/18 10:07 AM:
--

That's right, we do set {{-Xmx}} automatically in MR-based Oozie: 
[https://github.com/apache/oozie/blob/branch-4.3/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L382-L415]
 

It's missing from the Oozie-on-YARN based implementation. Perhaps, based on the 
resource request, we could limit {{-Xmx}} as  80-90 % of the requested memory 
to make sure that OOME happens before the container is killed by YARN.


was (Author: pbacsko):
That's right, we do set {{-Xmx}} automatically in MR-based Oozie: 
[https://github.com/apache/oozie/blob/branch-4.3/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L382-L415]

 

It's missing from the Oozie-on-YARN based implementation. Perhaps, based on the 
resource request, we could limit {{-Xmx}} as  80-90 % of the requested memory 
to make sure that OOME happens before the container is killyed by YARN.

> oozie workflow gets failed throwing error virtual memory limits
> ---
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Priority: Critical
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3307) oozie workflow gets failed throwing error virtual memory limits

2018-09-04 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602847#comment-16602847
 ] 

Peter Bacsko commented on OOZIE-3307:
-

That's right, we do set {{-Xmx}} automatically in MR-based Oozie: 
[https://github.com/apache/oozie/blob/branch-4.3/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L382-L415]

 

It's missing from the Oozie-on-YARN based implementation. Perhaps, based on the 
resource request, we could limit {{-Xmx}} as  80-90 % of the requested memory 
to make sure that OOME happens before the container is killyed by YARN.

> oozie workflow gets failed throwing error virtual memory limits
> ---
>
> Key: OOZIE-3307
> URL: https://issues.apache.org/jira/browse/OOZIE-3307
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.0.0
>Reporter: Sabir Naikwadi
>Priority: Critical
>
> Application application_1531909575787_0039 failed 2 times due to AM Container 
> for appattempt_1531909575787_0039_02 exited with exitCode: -103
>  Failing this attempt.Diagnostics: Container 
> [pid=11516,containerID=container_1531909575787_0039_02_01] is running 
> beyond virtual memory limits. Current usage: 469.8 MB of 2 GB physical memory 
> used; 10.0 GB of 10 GB virtual memory used. Killing container.
>  Dump of the process-tree for container_1531909575787_0039_02_01 :
> | - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE|
> | - 11516 11514 11516 11516 (bash) 1 3 115863552 682 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM 
> 1>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stdout
>  
> 2>/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01/stderr|
> | - 11755 11516 11516 11516 (java) 1142 71 10658242560 119576 
> /usr/lib/jvm/java-openjdk/bin/java 
> -Dlog4j.configuration=container-log4j.properties -Dlog4j.debug=true 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1531909575787_0039/container_1531909575787_0039_02_01
>  -Dyarn.app.container.log.filesize=1048576 -Dhadoop.root.logger=INFO,CLA 
> -Dhadoop.root.logfile=syslog -Dsubmitter.user=dev 
> org.apache.oozie.action.hadoop.LauncherAM
>  Container killed on request. Exit code is 143
>  Container exited with a non-zero exit code 143
>  For more detailed output, check the application tracking page: 
> [http://ip-10-20-201-36.us-gov-west-1.compute.internal:8088/cluster/app/application_1531909575787_0039]
>  Then click on links to logs of each attempt.
>  . Failing the application.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-08-17 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583753#comment-16583753
 ] 

Peter Bacsko edited comment on OOZIE-3160 at 8/17/18 10:44 AM:
---

[~rohini] [~puru] [~satishsaley] - do you guys have some time to take a look at 
this patch? It's a pretty substantial change (for this reason, the original 
code is kept, so users can switch back if they experience any issues). I really 
should have involved you earlier, sorry for that - but the code is still not on 
master.


was (Author: pbacsko):
[~rohini] [~puru] [~satishsaley] - do you guys have some time to take a look at 
this patch? It's a pretty substantial change (for this reason, the original 
code is kept, so users can switch back if they experience any issues).

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, 
> PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-08-17 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583753#comment-16583753
 ] 

Peter Bacsko commented on OOZIE-3160:
-

[~rohini] [~puru] [~satishsaley] - do you guys have some time to take a look at 
this patch? It's a pretty substantial change (for this reason, the original 
code is kept, so users can switch back if they experience any issues).

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, 
> PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-08-17 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-003.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-003.patch, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, 
> PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-08-16 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-002.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-002.patch, OOZIE-3160-POC01.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, 
> OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3324) Cannot compile with findbugs check

2018-08-15 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581022#comment-16581022
 ] 

Peter Bacsko commented on OOZIE-3324:
-

+1

Let's ignore the missing JIRA comment for now - let's hope it'll get fixed 
after moving findbugs-filter.xml.

Thanks [~asalamon74], committed to master.

> Cannot compile with findbugs check
> --
>
> Key: OOZIE-3324
> URL: https://issues.apache.org/jira/browse/OOZIE-3324
> Project: Oozie
>  Issue Type: Bug
>Affects Versions: 5.1.0
>Reporter: Andras Salamon
>Assignee: Andras Salamon
>Priority: Critical
> Attachments: OOZIE-3324-1.patch
>
>
> Latest snapshot compilation fails because of missing findbugs-filter.xml file:
> {noformat}
> $ mvn clean install -DskipTests
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 11.852 s
> [INFO] Finished at: 2018-08-09T09:11:41+02:00
> [INFO] 
> 
> [ERROR] Could not find resource 
> '/Users/andrassalamon/src/oozie/fluent-job/fluent-job-api/findbugs-filter.xml'.
>  -> [Help 1]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-08-15 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-001.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-001.patch, OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC02.patch, OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, 
> OOZIE-3160-POC05.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-2877) Oozie Git Action

2018-08-07 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571592#comment-16571592
 ] 

Peter Bacsko commented on OOZIE-2877:
-

I added some comments to patch 14.3. Now it looks much better and more 
readable. So good job [~clayb].

> Oozie Git Action
> 
>
> Key: OOZIE-2877
> URL: https://issues.apache.org/jira/browse/OOZIE-2877
> Project: Oozie
>  Issue Type: Sub-task
>  Components: action
>Reporter: Clay B.
>Assignee: Clay B.
>Priority: Major
>  Labels: action
> Fix For: trunk
>
> Attachments: 0001-OOZIE-2877-Oozie-Git-Action.patch, 
> 0002-OOZIE-2877-Oozie-Git-Action.patch, 
> 0003-OOZIE-2877-Oozie-Git-Action.patch, 
> 0004-OOZIE-2877-Oozie-Git-Action.patch, 
> 0005-OOZIE-2877-Oozie-Git-Action.patch, 
> 0006-OOZIE-2877-Oozie-Git-Action.patch, 
> 0007-OOZIE-2877-Oozie-Git-Action.patch, 
> 0008-OOZIE-2877-Oozie-Git-Action.patch, 
> 0009-OOZIE-2877-Oozie-Git-Action.patch, OOZIE-2877.010.patch, 
> OOZIE-2877.011.patch, OOZIE-2877.012.patch, OOZIE-2877.013-1.patch, 
> OOZIE-2877.013.patch, OOZIE-2877.014-1.patch, OOZIE-2877.014-2.patch, 
> OOZIE-2877.014-3.patch
>
>
> To aide in deploying ASCII artifacts to clusters, let's provide a tie-in for 
> a source-code management system. Git would be my preferred choice.
> Ideally, this could handle a user's key material e.g. for an ssh key to pull 
> down from a secured repository. This would free users from handling their own 
> key staging and clean-up on YARN nodes and only require them to store the key 
> secured in HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-08-06 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-POC05.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, OOZIE-3160-POC05.patch, 
> PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (OOZIE-3314) Remove findbugs-filter.xml and convert its contents to annotations

2018-08-02 Thread Peter Bacsko (JIRA)

Peter Bacsko created OOZIE-3314:
---

 Summary: Remove findbugs-filter.xml and convert its contents to 
annotations
 Key: OOZIE-3314
 URL: https://issues.apache.org/jira/browse/OOZIE-3314
 Project: Oozie
  Issue Type: Bug
  Components: core
Reporter: Peter Bacsko
Assignee: Andras Salamon


In oozie-core, we have a file called "findbugs-filter.xml" which tells findbugs 
that it shluld ignore certain problems in a couple of classes.

However if we try to compile a sub-module or run findbugs directly (let's say 
in a sharelib project), the build will fail because it won't be able to open 
findbugs.xml. It's not straightforward how to define the path of the XML in a 
way that it makes this xml accessible regardless of what module you're 
compiling.

It's better to just convert its contents to annotations - especially since we 
already use this method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-07-31 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-POC04.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, OOZIE-3160-POC04.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-07-12 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541944#comment-16541944
 ] 

Peter Bacsko commented on OOZIE-3160:
-

Added queue dump implementation + fixed a problem which appeared during 
testConcurrencyReachedAndChooseNextEligible.

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-07-12 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-POC03.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> OOZIE-3160-POC03.patch, PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-07-11 Thread Peter Bacsko (JIRA)



 [ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-3160:

Attachment: OOZIE-3160-POC02.patch

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, OOZIE-3160-POC02.patch, 
> PriorityDelayQueue improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OOZIE-3160) PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting

2018-07-11 Thread Peter Bacsko (JIRA)



[ 
https://issues.apache.org/jira/browse/OOZIE-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540122#comment-16540122
 ] 

Peter Bacsko commented on OOZIE-3160:
-

[~rkanter] [~andras.piros] please add your comments to the POC.

Who else we should include? Rohini might be interested I guess.

> PriorityDelayQueue put()/take() can cause significant CPU load due to busy 
> waiting
> --
>
> Key: OOZIE-3160
> URL: https://issues.apache.org/jira/browse/OOZIE-3160
> Project: Oozie
>  Issue Type: Bug
>  Components: core
> Environment: all platforms
>Reporter: jj
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: 11.png, 22.png, 
> OOZIE-3160-POC01.patch, OOZIE-3160-POC02.patch, PriorityDelayQueue 
> improvement - OOZIE-3160.pdf
>
>
> oozie process always  consume  high cpu. in my mechine,around 10%. 
> I check the source code，find take() method in PriorityDelayQueue class。
> code:
> {code:java}
> public QueueElement take() throws InterruptedException {
> QueueElement e = poll();
> while (e == null) {
> Thread.sleep(10);
> e = poll();
> }
> return e;
> }
> {code}
> i think it's the reason of this problem. it's keep while, not await.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 6 7 8 9 >

1 - 100 of 886 matches

Mail list logo