[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2015-10-20 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966019#comment-14966019
 ] 

Mona Chitnis commented on OOZIE-1976:
-

Thanks [~puru] for your patch. I did a first pass as well and have few 
comments. Waiting for your replies

> Specifying coordinator input datasets in more logical ways
> --
>
> Key: OOZIE-1976
> URL: https://issues.apache.org/jira/browse/OOZIE-1976
> Project: Oozie
>  Issue Type: New Feature
>  Components: coordinator
>Affects Versions: trunk
>Reporter: Mona Chitnis
>Assignee: Purshotam Shah
> Fix For: trunk
>
> Attachments: Input-check.docx, OOZIE-1976-WIP.patch, 
> OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf
>
>
> All dataset instances specified as input to coordinator, currently work on 
> AND logic i.e. ALL of them should be available for workflow to start. We 
> should enhance this to include more logical ways of specifying availability 
> criteria e.g.
>  * OR between instances
>  * minimum N out of K instances
>  * delta datasets (process data incrementally)
> Use-cases for this:
>  * Different datasets are BCP, and workflow can run with either, whichever 
> arrives earlier.
>  * Data is not guaranteed, and while $coord:latest allows skipping to 
> available ones, workflow will never trigger unless mentioned number of 
> instances are found.
>  * Workflow is like a ‘refining’ algorithm which should run after minimum 
> required datasets are ready, and should only process the delta for efficiency.
> This JIRA is to discuss the design and then the review the implementation for 
> some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2015-04-26 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513118#comment-14513118
 ] 

Mona Chitnis commented on OOZIE-1976:
-

Thanks for taking it up Jaydeep. I will keep a watch on this jira when it's 
ready for review

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Jaydeep Vishwakarma
 Fix For: trunk

 Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, 
 OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly

2014-11-03 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195508#comment-14195508
 ] 

Mona Chitnis commented on OOZIE-1913:
-

Reviewboard revision is fairly up-to-date except a couple of unit tests. I will 
be updating that and then would appreciate a review

 Devise a way to turn off SLA alerts for bundle/coordinator flexibly
 ---

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-2034) Disable SSLv3 (POODLEbleed vulnerability)

2014-10-24 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183074#comment-14183074
 ] 

Mona Chitnis commented on OOZIE-2034:
-

starting to look at this now..

 Disable SSLv3 (POODLEbleed vulnerability)
 -

 Key: OOZIE-2034
 URL: https://issues.apache.org/jira/browse/OOZIE-2034
 Project: Oozie
  Issue Type: Bug
  Components: security
Affects Versions: 4.0.1
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker
 Fix For: 4.1.0

 Attachments: OOZIE-2034.patch, OOZIE-2034.patch


 We should disable SSLv3 to protect against the POODLEbleed vulnerability.
 See 
 [CVE-2014-3566|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566]
 We have {{sslProtocol=TLS}} set to only allow TLS in ssl-server.xml, but 
 when I checked, I could still connect with SSLv3.  From what I can tell, 
 there's some ambiguity in the tomcat configs between {{sslProtocol}}, 
 {{sslProtocols}}, and {{sslEnabledProtocols}} so we probably have the wrong 
 thing here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-2034) Disable SSLv3 (POODLEbleed vulnerability)

2014-10-24 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183087#comment-14183087
 ] 

Mona Chitnis commented on OOZIE-2034:
-

+1. Pretty straightforward. Thanks for checking the bit about support of TLSv1, 
not TLSv1.1. Can you paste your doc references here for record? 

 Disable SSLv3 (POODLEbleed vulnerability)
 -

 Key: OOZIE-2034
 URL: https://issues.apache.org/jira/browse/OOZIE-2034
 Project: Oozie
  Issue Type: Bug
  Components: security
Affects Versions: 4.0.1
Reporter: Robert Kanter
Assignee: Robert Kanter
Priority: Blocker
 Fix For: 4.1.0

 Attachments: OOZIE-2034.patch, OOZIE-2034.patch


 We should disable SSLv3 to protect against the POODLEbleed vulnerability.
 See 
 [CVE-2014-3566|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566]
 We have {{sslProtocol=TLS}} set to only allow TLS in ssl-server.xml, but 
 when I checked, I could still connect with SSLv3.  From what I can tell, 
 there's some ambiguity in the tomcat configs between {{sslProtocol}}, 
 {{sslProtocols}}, and {{sslEnabledProtocols}} so we probably have the wrong 
 thing here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-1954) Add a way for the MapReduce action to be configured by Java code

2014-09-30 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153721#comment-14153721
 ] 

Mona Chitnis commented on OOZIE-1954:
-

Good work Robert!

 Add a way for the MapReduce action to be configured by Java code
 

 Key: OOZIE-1954
 URL: https://issues.apache.org/jira/browse/OOZIE-1954
 Project: Oozie
  Issue Type: New Feature
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Robert Kanter
 Fix For: trunk

 Attachments: OOZIE-1954.patch, OOZIE-1954.patch, OOZIE-1954.patch


 With certain other components (e.g. Avro, HFileOutputFormat (HBase), etc), it 
 becomes impractical to use the MapReduce action and users must instead use 
 the Java action. The problem is that these components require a lot of extra 
 configuration that is often hidden from the user in Java code (e.g. 
 HFileOutputFormat.configureIncrementalLoad(job, table); which can also 
 include decision logic, serialization, and other things that we can't do in 
 an XML file directly.
 One way to solve this problem is to allow the user to give the MR action some 
 Java code that would do this configuration, similar to how we allow the 
 {{job-xml}} field to specify an external XML file of configuration 
 properties.
 In more detail, we could have an interface; something like this:
 {code}
 public interface OozieActionConfigurator {
  public void updateOozieActionConfiguration(Configuration conf);
 }
 {code}
 that the user can implement, create a jar, and include with their MR action 
 (i.e. add a {{config-class}} field that let's them specify the class 
 name). To protect the Oozie server from running user code (which could do 
 anything it wants really), it would have to be run in the Launcher Job. The 
 Launcher Job could call this method after it loads the configuration prepared 
 by the Oozie server.
 Another thing this will be helpful is with users who use the Java action to 
 launch MR jobs and expect a bunch of things to be done for them that are not 
 (e.g. delegation token propagation, config loading, returning the hadoop job 
 to Oozie, etc). These are all done with the MR action, so the more users we 
 can move to the MR action from the Java action, the less they'll run into 
 these difficulties.
 Some of this may change slightly as I try to actually implement this (e.g. 
 have to handle throwing exceptions etc).  And one thing I may do is keep this 
 general enough that it should be compatible with all action types in case we 
 want to add this to any of them in the future; though for now, the schema 
 would only accept it for the MapReduce action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-09-25 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147856#comment-14147856
 ] 

Mona Chitnis commented on OOZIE-1976:
-

Thanks [~rkanter] for comments. 
* We are thinking of using a serialize/deserialize technique (protobuf is 
one option) to convert back and forth from the object. I've created a class 
LogicalDependencySet for this object which either contains the subclass objects 
LogicalDependencyAndSet or LogicalDependencyOrSet and the leaf level is 
Dependency which has the lists of resolved and unresolved instances. Yet to see 
what is the cost of protobuf serde here.
   * Yes it is possible to do nested combinations, but will limit it to a depth 
of 2. i.e. both your examples are depth 2 and the most common cases that we 
should satisfy in the first go. An important thing to note here is the case of 
OR can have two 'strategies' :-
   ** 'Combined' : In case of {{A || B}}, instances of A and B can be 
interleaved to give the final combined set of total instances. For this, the 
requirement is that user considers both as equivalent, and they have the same 
frequency, initial instance etc.
   ** 'Exclusive' : In same case as above, either A should be completely used 
or B completely used. No interleaving.
   * Yes a better API output will be to display the action is waiting on which 
OR datasets' instances.

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, 
 OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-09-24 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1976:

Attachment: OOZIE-1976-WIP.patch

attaching WIP patch for records sake. I will upload the v-1 patch when I have a 
fairly working version ready by tomorrow

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, 
 OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-09-22 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143595#comment-14143595
 ] 

Mona Chitnis commented on OOZIE-1932:
-

{quote}
. -1 the patch does not add/modify any testcase
{quote}
This is a simple config change in oozie-default.xml and there is no applicable 
test-case just to check relative order of loading services

{quote}
. The patch failed the following testcases:

. 
testBundleStatusNotTransitionFromKilled(org.apache.oozie.service.TestStatusTransitService)
. 
testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService)
{quote}
These test failures are unrelated to my patch. I reran the tests in my local 
env and they pass consistently

Committed patch to trunk and branch-4.1. Thanks Puru for review!

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, 
 OOZIE-1932-4.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-09-22 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1932:

Attachment: OOZIE-1932-4-amendment.patch

{code}
2014-09-22 22:02:34,148  INFO ShareLibService:539 [main] - USER[-] GROUP[-]
oozie-hadoop-utils-2.3.0.oozie-4.4.1.1.jar uploaded to
hdfs:/tmp/hdfs_shared_lib_path/launcher_2014090233/oozie
2014-09-22 22:02:34,198  INFO ShareLibService:539 [main] - USER[-] GROUP[-]
oozie-sharelib-hcatalog-4.4.1.1.jar uploaded to
hdfs:/tmp/hdfs_shared_lib_path/launcher_2014090233/oozie
2014-09-22 22:02:34,199 ERROR ShareLibService:536 [main] - USER[-] GROUP[-]
Sharelib initialization fails
java.lang.NullPointerException
at
org.apache.oozie.service.ShareLibService.setupLauncherLibPath(ShareLibService.java:178)
at
org.apache.oozie.service.ShareLibService.updateLauncherLib(ShareLibService.java:158)
at
org.apache.oozie.service.ShareLibService.init(ShareLibService.java:111)
at
org.apache.oozie.service.Services.setServiceInternal(Services.java:368)



ShareLibService is dependent on ActionService.

private void setupLauncherLibPath(FileSystem fs, Path tmpLauncherLibPath)
throws IOException {

ActionService actionService = Services.get().get(ActionService.class);
ListClass classes = JavaActionExecutor.getCommonLauncherClasses();
Path baseDir = new Path(tmpLauncherLibPath,
JavaActionExecutor.OOZIE_COMMON_LIBDIR);
copyJarContainingClasses(classes, fs, baseDir,
JavaActionExecutor.OOZIE_COMMON_LIBDIR);
SetString actionTypes = actionService.getActionTypes();
{code}

Attaching amendment patch

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, 
 OOZIE-1932-4-amendment.patch, OOZIE-1932-4.patch, OOZIE-1932-addendum.patch, 
 OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-09-09 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1932:

Attachment: OOZIE-1932-3.patch

uploaded new patch OOZIE-1932-3.patch. 

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, 
 OOZIE-1932-addendum.patch, OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-09-09 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1932:

Attachment: OOZIE-1932-4.patch

addressed Puru's comment

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, 
 OOZIE-1932-4.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OOZIE-1984) SLACalculator in HA mode performs duplicate operations on records with completed jobs

2014-08-28 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1984:
---

 Summary: SLACalculator in HA mode performs duplicate operations on 
records with completed jobs
 Key: OOZIE-1984
 URL: https://issues.apache.org/jira/browse/OOZIE-1984
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk, 4.1.0


Scenario:

SLA periodic run has already processed start,duration and end for a job's sla 
entry. But job notification for that job came after this, and triggers the sla 
listener.

Buggy part:
{code}
SLACalculatorMemory.java

else if 
(Services.get().get(JobsConcurrencyService.class).isHighlyAvailableMode()) {
// jobid might not exist in slaMap in HA Setting
SLARegistrationBean slaRegBean = 
SLARegistrationQueryExecutor.getInstance().get(
SLARegQuery.GET_SLA_REG_ALL, jobId);
if (slaRegBean != null) { // filter out jobs picked by SLA job 
event listener
  // but not actually configured for SLA
SLASummaryBean slaSummaryBean = 
SLASummaryQueryExecutor.getInstance().get(
SLASummaryQuery.GET_SLA_SUMMARY, jobId);
slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
if (slaCalc.getEventProcessed()  7) {
slaMap.put(jobId, slaCalc);
}
}
}
}
if (slaCalc != null) {
..
Object eventProcObj = ((SLASummaryQueryExecutor) 
SLASummaryQueryExecutor.getInstance())

.getSingleValue(SLASummaryQuery.GET_SLA_SUMMARY_EVENTPROCESSED, jobId);
byte eventProc = ((Byte) eventProcObj).byteValue();
..
processJobEndSuccessSLA(slaCalc, startTime, endTime);
{code}

method processJobEndSuccesSLA goes ahead and checks second LSB bit of eventProc 
and sends duration event _again_. So the bug here is two-fold:
 * if all events are already processed, still invokes this function
 * event processed is 8 (1000), so second LSB bit is unset and hence duration 
processed.

Fix - not invoke function when eventProc = 1000



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (OOZIE-1984) SLACalculator in HA mode performs duplicate operations on records with completed jobs

2014-08-28 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis resolved OOZIE-1984.
-

Resolution: Fixed

Committed to trunk and 4.1.0. Thanks for review Ryota

 SLACalculator in HA mode performs duplicate operations on records with 
 completed jobs
 -

 Key: OOZIE-1984
 URL: https://issues.apache.org/jira/browse/OOZIE-1984
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk, 4.1.0

 Attachments: OOZIE-1984-1.patch, OOZIE-1984.patch


 Scenario:
 SLA periodic run has already processed start,duration and end for a job's sla 
 entry. But job notification for that job came after this, and triggers the 
 sla listener.
 Buggy part:
 {code}
 SLACalculatorMemory.java
 else if 
 (Services.get().get(JobsConcurrencyService.class).isHighlyAvailableMode()) {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 if (slaRegBean != null) { // filter out jobs picked by SLA 
 job event listener
   // but not actually configured for 
 SLA
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(
 SLASummaryQuery.GET_SLA_SUMMARY, jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 if (slaCalc.getEventProcessed()  7) {
 slaMap.put(jobId, slaCalc);
 }
 }
 }
 }
 if (slaCalc != null) {
 ..
 Object eventProcObj = ((SLASummaryQueryExecutor) 
 SLASummaryQueryExecutor.getInstance())
 
 .getSingleValue(SLASummaryQuery.GET_SLA_SUMMARY_EVENTPROCESSED, jobId);
 byte eventProc = ((Byte) eventProcObj).byteValue();
 ..
 processJobEndSuccessSLA(slaCalc, startTime, endTime);
 {code}
 method processJobEndSuccesSLA goes ahead and checks second LSB bit of 
 eventProc and sends duration event _again_. So the bug here is two-fold:
  * if all events are already processed, still invokes this function
  * event processed is 8 (1000), so second LSB bit is unset and hence duration 
 processed.
 Fix - not invoke function when eventProc = 1000



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1847) HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure

2014-08-27 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112331#comment-14112331
 ] 

Mona Chitnis commented on OOZIE-1847:
-

pretty straightforward patch and agree its needed. But in addition to printing 
in logs, should we bubble it up to action error message too? That way reason 
for a workflow failing can be pulled up from any of the client-facing APIs too 
- e.g. job-info, web-console, RESTful aPI etc

 HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure
 -

 Key: OOZIE-1847
 URL: https://issues.apache.org/jira/browse/OOZIE-1847
 Project: Oozie
  Issue Type: Bug
  Components: HA
Reporter: Purshotam Shah
Assignee: Purshotam Shah
 Attachments: OOZIE-1847-V1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1847) HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure

2014-08-27 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112333#comment-14112333
 ] 

Mona Chitnis commented on OOZIE-1847:
-

^^ in case of timeout  3 seconds resulting in server shutdown and job failure

 HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure
 -

 Key: OOZIE-1847
 URL: https://issues.apache.org/jira/browse/OOZIE-1847
 Project: Oozie
  Issue Type: Bug
  Components: HA
Reporter: Purshotam Shah
Assignee: Purshotam Shah
 Attachments: OOZIE-1847-V1.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1885) Query optimization for StatusTransitService

2014-08-27 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112357#comment-14112357
 ] 

Mona Chitnis commented on OOZIE-1885:
-

A join query is always more CPU and memory intensive. But it will probably cut 
down on the overall time it takes, because of the multiple queries in loop 
right now. Approach is fine but we should vet it with end-to-end test 
performance gains

 Query optimization for StatusTransitService
 ---

 Key: OOZIE-1885
 URL: https://issues.apache.org/jira/browse/OOZIE-1885
 Project: Oozie
  Issue Type: Bug
Reporter: Purshotam Shah

 {code}
  private void coordTransit() throws JPAExecutorException, CommandException {
 ListCoordinatorJobBean pendingJobCheckList = null;
 if (lastInstanceStartTime == null) {
 LOG.info(Running coordinator status service first instance);
 // this is the first instance, we need to check for all 
 pending jobs;
 pendingJobCheckList = jpaService.execute(new 
 CoordJobsGetPendingJPAExecutor(limit));
 }
 else {
 LOG.info(Running coordinator status service from last 
 instance time =  
 + DateUtils.formatDateOozieTZ(lastInstanceStartTime));
 // this is not the first instance, we should only check jobs
 // that have actions or jobs been
 // updated = start time of last service run;
 ListCoordinatorActionBean actionsList = 
 CoordActionQueryExecutor.getInstance().getList(
 
 CoordActionQuery.GET_COORD_ACTIONS_BY_LAST_MODIFIED_TIME, 
 lastInstanceStartTime);
 SetString coordIds = new HashSetString();
 for (CoordinatorActionBean action : actionsList) {
 coordIds.add(action.getJobId());
 }
 pendingJobCheckList = new ArrayListCoordinatorJobBean();
 for (String coordId : coordIds.toArray(new 
 String[coordIds.size()])) {
 CoordinatorJobBean coordJob;
 try {
 coordJob = 
 CoordJobQueryExecutor.getInstance().get(CoordJobQuery.GET_COORD_JOB, coordId);
 }
 catch (JPAExecutorException jpaee) {
 if (jpaee.getErrorCode().equals(ErrorCode.E0604)) {
 LOG.warn(Exception happened during 
 StatusTransitRunnable; Coordinator Job doesn't exist, jpaee);
 continue;
 } else {
 throw jpaee;
 }
 }
 // Running coord job might have pending false
 Job.Status coordJobStatus = coordJob.getStatus();
 if ((coordJob.isPending() || 
 coordJobStatus.equals(Job.Status.PAUSED)
 || coordJobStatus.equals(Job.Status.RUNNING)
 || 
 coordJobStatus.equals(Job.Status.RUNNINGWITHERROR)
 || 
 coordJobStatus.equals(Job.Status.PAUSEDWITHERROR))
  !coordJobStatus.equals(Job.Status.IGNORED)) {
 pendingJobCheckList.add(coordJob);
 }
 }
 
 pendingJobCheckList.addAll(CoordJobQueryExecutor.getInstance().getList(
 CoordJobQuery.GET_COORD_JOBS_CHANGED, 
 lastInstanceStartTime));
 }
 aggregateCoordJobsStatus(pendingJobCheckList);
 }
 }
 {code}
 This could be done in one sql, something like 
 select w.id, w.status, w.pending from CoordinatorJobBean w where 
 w.startTimestamp = :matTime AND (w.statusStr = 'PREP' OR w.statusStr = 
 'RUNNING' or w.statusStr = 'RUNNINGWITHERROR' or w.statusStr= 
 'PAUSEDWITHERROR' and w.statusStr  'IGNORED') w.id in  ( select a.jobId 
 from CoordinatorActionBean a where a.lastModifiedTimestamp = 
 :lastModifiedTime groupby a.jobId)
 Same for bundleTransit().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1940) StatusTransitService has race condition

2014-08-27 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112359#comment-14112359
 ] 

Mona Chitnis commented on OOZIE-1940:
-

linking this as dependent of OOZIE-1885

 StatusTransitService has race condition
 ---

 Key: OOZIE-1940
 URL: https://issues.apache.org/jira/browse/OOZIE-1940
 Project: Oozie
  Issue Type: Bug
Reporter: Purshotam Shah

 StatusTransitService doesn't acquire lock while updating DB. 
 We noticed one such issue while doing HA testing, thanks to [~mchiang]
 We issue a change command to change pause time, which got executed on one 
 server. While change command was running on one server, other server started 
 executing StatusTransitService.
 Server 1 log
 {code}
 2014-07-16 17:28:05,268  INFO StatusTransitService$StatusTransitRunnable:539 
 [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for 
 [org.apache.oozie.service.StatusTransitService]
 2014-07-16 17:28:09,694  INFO StatusTransitService$StatusTransitRunnable:539 
 [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job 
 [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 
 2014-07-16 17:28:15,416  INFO StatusTransitService$StatusTransitRunnable:539 
 [pool-1-thread-13] - USER[-] GROUP[-] Released lock for 
 [org.apache.oozie.service.StatusTransitService]
 {code}
 Server 2 log
 {code}
 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - 
 USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] 
 JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed 
 Jul 16 17:30:00 UTC 2014 and last action number is : 3
 2014-07-16 17:28:06,508  INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - 
 USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] 
 JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand 
 for jobId=0011385-140716042555-oozie-oozi-C
 {code}
 CoordMaterializeTransitionXCommand has created all actions( few were in 
 waiting and few were in running state) and set doneMaterialization to true.
 Change command deletes all waiting coords, except 3 running/SUCCEEDED action 
 and reset doneMaterialization.
 StatusTransitService first loads a set of pending jobs and for each job it 
 make DB calls to check coord action status. Coord jobs are loaded only once 
 in beginning.
 This is what happened.
 1.StatusTransitService loads the coord job which doneMaterialization is set 
 to true at 17:28:05,268 (server 1)
 2.Change command deletes waiting cation and reset  doneMaterialization at  
 17:28:06,508 (server 2)
 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. 
 It never reload the doneMaterialization at 17:28:09,694 (server 1)
 StatusTransitService overrides set job status to SUCCEEDED, bcz it's 
 doneMaterialization and all action are SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1940) StatusTransitService has race condition

2014-08-27 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112362#comment-14112362
 ] 

Mona Chitnis commented on OOZIE-1940:
-

Agree with approach. Currently each run of Status Transit Service takes 
multiple seconds I believe. If it is going to hold the lock for that long, we 
have to asses the consequences on the other commands waiting for lock. E.g. 
Change command appearing to hang on user-facing CLI, because its 
synchronously trying to acquire lock held by STS. OOZIE-1885 should ideally 
reduce this overall time the lock is to be held by STS

 StatusTransitService has race condition
 ---

 Key: OOZIE-1940
 URL: https://issues.apache.org/jira/browse/OOZIE-1940
 Project: Oozie
  Issue Type: Bug
Reporter: Purshotam Shah

 StatusTransitService doesn't acquire lock while updating DB. 
 We noticed one such issue while doing HA testing, thanks to [~mchiang]
 We issue a change command to change pause time, which got executed on one 
 server. While change command was running on one server, other server started 
 executing StatusTransitService.
 Server 1 log
 {code}
 2014-07-16 17:28:05,268  INFO StatusTransitService$StatusTransitRunnable:539 
 [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for 
 [org.apache.oozie.service.StatusTransitService]
 2014-07-16 17:28:09,694  INFO StatusTransitService$StatusTransitRunnable:539 
 [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job 
 [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 
 2014-07-16 17:28:15,416  INFO StatusTransitService$StatusTransitRunnable:539 
 [pool-1-thread-13] - USER[-] GROUP[-] Released lock for 
 [org.apache.oozie.service.StatusTransitService]
 {code}
 Server 2 log
 {code}
 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - 
 USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] 
 JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed 
 Jul 16 17:30:00 UTC 2014 and last action number is : 3
 2014-07-16 17:28:06,508  INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - 
 USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] 
 JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand 
 for jobId=0011385-140716042555-oozie-oozi-C
 {code}
 CoordMaterializeTransitionXCommand has created all actions( few were in 
 waiting and few were in running state) and set doneMaterialization to true.
 Change command deletes all waiting coords, except 3 running/SUCCEEDED action 
 and reset doneMaterialization.
 StatusTransitService first loads a set of pending jobs and for each job it 
 make DB calls to check coord action status. Coord jobs are loaded only once 
 in beginning.
 This is what happened.
 1.StatusTransitService loads the coord job which doneMaterialization is set 
 to true at 17:28:05,268 (server 1)
 2.Change command deletes waiting cation and reset  doneMaterialization at  
 17:28:06,508 (server 2)
 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. 
 It never reload the doneMaterialization at 17:28:09,694 (server 1)
 StatusTransitService overrides set job status to SUCCEEDED, bcz it's 
 doneMaterialization and all action are SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-20 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104227#comment-14104227
 ] 

Mona Chitnis commented on OOZIE-1976:
-

Thanks Puru and Ryota. Will incorporate your comments and come up with new 
design specification. As for the 'explain', this can be done as part of 'info' 
command displaying missing dependency itself, rather than introducing another 
command

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-20 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104618#comment-14104618
 ] 

Mona Chitnis commented on OOZIE-1976:
-

For Ryota's comment about priority, I think it complicates the missing 
dependencies field, now we require a structure to indicate something like 
{{P0=dep1,dep2#P1=dep3,dep4}} which in turn is nested under the AND/OR 
structure. So when dependencies are checked and found to exist, action will 
start only when all P0's are satisfied etc. I think this is essentially same as 
putting them in the AND block instead of optional OR block. For the N out 
of M case, it will start when _any_ instances =n are available, using all M if 
all there, and not limit to N there.

Good pointer about EL functions, that one's going to be important and will 
probably need a few new ones.

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-20 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1976:


Attachment: OOZIE-1976-rough-design-2.pdf

New design spec uploaded (rought-design-2) with additions about
 * Wait-for in action
 * EL functions initial thoughts - implementation details will follow in code 
patch
 * HCatDependencyCache changes (for the in-memory push-based hcat dependencies)
 * Job info API (coord-action) changes for displaying Missing Dependency. It 
runs the risk of being verbose if optional dataset has lot of instances. Needs 
thought about how to possibly truncate there.
 

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1976-rough-design-2.pdf, 
 OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-18 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1976:
---

 Summary: Specifying coordinator input datasets in more logical ways
 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


All dataset instances specified as input to coordinator, currently work on AND 
logic i.e. ALL of them should be available for workflow to start. We should 
enhance this to include more logical ways of specifying availability criteria 
e.g.
 * OR between instances
 * minimum N out of K instances
 * delta datasets (process data incrementally)

This JIRA is to discuss the design and then the review the implementation for 
some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-18 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1976:


Description: 
All dataset instances specified as input to coordinator, currently work on AND 
logic i.e. ALL of them should be available for workflow to start. We should 
enhance this to include more logical ways of specifying availability criteria 
e.g.
 * OR between instances
 * minimum N out of K instances
 * delta datasets (process data incrementally)

Use-cases for this:
Different datasets are BCP, and workflow can run with either, whichever arrives 
earlier.
Data is not guaranteed, and while $coord:latest allows skipping to available 
ones, workflow will never trigger unless mentioned number of instances are 
found.
Workflow is like a ‘refining’ algorithm which should run after minimum required 
datasets are ready, and should only process the delta for efficiency.

This JIRA is to discuss the design and then the review the implementation for 
some or all of the above features.

  was:
All dataset instances specified as input to coordinator, currently work on AND 
logic i.e. ALL of them should be available for workflow to start. We should 
enhance this to include more logical ways of specifying availability criteria 
e.g.
 * OR between instances
 * minimum N out of K instances
 * delta datasets (process data incrementally)

This JIRA is to discuss the design and then the review the implementation for 
some or all of the above features.


 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
 Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
 Data is not guaranteed, and while $coord:latest allows skipping to available 
 ones, workflow will never trigger unless mentioned number of instances are 
 found.
 Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-18 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1976:


Description: 
All dataset instances specified as input to coordinator, currently work on AND 
logic i.e. ALL of them should be available for workflow to start. We should 
enhance this to include more logical ways of specifying availability criteria 
e.g.
 * OR between instances
 * minimum N out of K instances
 * delta datasets (process data incrementally)

Use-cases for this:
 * Different datasets are BCP, and workflow can run with either, whichever 
arrives earlier.
 * Data is not guaranteed, and while $coord:latest allows skipping to available 
ones, workflow will never trigger unless mentioned number of instances are 
found.
 * Workflow is like a ‘refining’ algorithm which should run after minimum 
required datasets are ready, and should only process the delta for efficiency.

This JIRA is to discuss the design and then the review the implementation for 
some or all of the above features.

  was:
All dataset instances specified as input to coordinator, currently work on AND 
logic i.e. ALL of them should be available for workflow to start. We should 
enhance this to include more logical ways of specifying availability criteria 
e.g.
 * OR between instances
 * minimum N out of K instances
 * delta datasets (process data incrementally)

Use-cases for this:
Different datasets are BCP, and workflow can run with either, whichever arrives 
earlier.
Data is not guaranteed, and while $coord:latest allows skipping to available 
ones, workflow will never trigger unless mentioned number of instances are 
found.
Workflow is like a ‘refining’ algorithm which should run after minimum required 
datasets are ready, and should only process the delta for efficiency.

This JIRA is to discuss the design and then the review the implementation for 
some or all of the above features.


 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

2014-08-18 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1976:


Attachment: OOZIE-1976-rough-design.pdf

Attaching rough design doc (pdf)

 Specifying coordinator input datasets in more logical ways
 --

 Key: OOZIE-1976
 URL: https://issues.apache.org/jira/browse/OOZIE-1976
 Project: Oozie
  Issue Type: New Feature
  Components: coordinator
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1976-rough-design.pdf


 All dataset instances specified as input to coordinator, currently work on 
 AND logic i.e. ALL of them should be available for workflow to start. We 
 should enhance this to include more logical ways of specifying availability 
 criteria e.g.
  * OR between instances
  * minimum N out of K instances
  * delta datasets (process data incrementally)
 Use-cases for this:
  * Different datasets are BCP, and workflow can run with either, whichever 
 arrives earlier.
  * Data is not guaranteed, and while $coord:latest allows skipping to 
 available ones, workflow will never trigger unless mentioned number of 
 instances are found.
  * Workflow is like a ‘refining’ algorithm which should run after minimum 
 required datasets are ready, and should only process the delta for efficiency.
 This JIRA is to discuss the design and then the review the implementation for 
 some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly

2014-08-13 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096118#comment-14096118
 ] 

Mona Chitnis commented on OOZIE-1913:
-

Want to mention another point:

this API also allows to disable for ALL sla instances for a coordinator or 
bundle. For bundle, that would mean all coordinators' all actions. 
SLARegistrationBean stores 'parentId' if the sla object pertains to 
coord-action/wf-action/bundle-action. To avoid heavy dB query in case of the 
suspend ALL for bundle(s) case, I want to change this 'parentId' to point to 
bundle jobId directly, if coordinator is part of a bundle. If not, it will be 
coord job id as it is now.

The impact this has is in JMSSLAEventListener, where topicName is set to this 
parentId. So topicName will get set to top-level bundle-id, and user will have 
to change topic name being listened to. Please give feedback if this is a 
reasonable approach. I will make sure appropriate JMS selector options are 
available, if user gives this bundle id topicName, but still wants to limit per 
coordinator job id. 

 Devise a way to turn off SLA alerts for bundle/coordinator flexibly
 ---

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly

2014-08-08 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090999#comment-14090999
 ] 

Mona Chitnis commented on OOZIE-1913:
-

okay let me remove -id requirement. Regarding treating as job operation, I 
think it becomes ambiguous what type of alerts it means, so better to be clear 
with 'sla' command. Also, it removes the need to add additional param 'actions'.

But can rework this if there's a consensus about what api usage is more 
intuitive. Asking feedback from users too

 Devise a way to turn off SLA alerts for bundle/coordinator flexibly
 ---

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly

2014-08-07 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1913:


Summary: Devise a way to turn off SLA alerts for bundle/coordinator 
flexibly  (was: Devise a way to turn off SLA alerts when bundle/coordinator 
suspended)

 Devise a way to turn off SLA alerts for bundle/coordinator flexibly
 ---

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1939) Incorrect job information is set while logging

2014-08-01 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082317#comment-14082317
 ] 

Mona Chitnis commented on OOZIE-1939:
-

Yet it will work with threadlocal params too. Fix was done to minimize overall 
change and just clear prefix and set it to what object the thread is handling 
now. same will apply with threadlocal params too

 Incorrect job information is set while logging
 --

 Key: OOZIE-1939
 URL: https://issues.apache.org/jira/browse/OOZIE-1939
 Project: Oozie
  Issue Type: Bug
Reporter: Purshotam Shah
Assignee: Azrael
 Attachments: OOZIE-1939.1.patch, OOZIE-1939.2.patch


 {code}
 2014-07-16 17:28:06,422 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - 
 USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB236] 
 JOB[0011514-140716042555-oozie-oozi-C] ACTION[-] Acquired lock for 
 [0011385-140716042555-oozie-oozi-C] in [coord_change]
 2014-07-16 17:28:06,422 TRACE CoordChangeXCommand:548 [http-0.0.0.0-4443-5] - 
 USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB236] 
 JOB[0011514-140716042555-oozie-oozi-C] ACTION[-] Load state for 
 [0011385-140716042555-oozie-oozi-C]
 {code}
 {code}
 protected void loadState() throws CommandException {
 jpaService = Services.get().get(JPAService.class);
 if (jpaService == null) {
 LOG.error(ErrorCode.E0610);
 }
 try {
 coordJob = 
 CoordJobQueryExecutor.getInstance().get(CoordJobQuery.GET_COORD_JOB_MATERIALIZE,
  jobId);
 prevStatus = coordJob.getStatus();
 }
 catch (JPAExecutorException jex) {
 throw new CommandException(jex);
 }
 // calculate start materialize and end materialize time
 calcMatdTime();
 LogUtils.setLogInfo(coordJob, logInfo);
 }
 {code}
 Most of the commands set jobinfo after loadstate, because of that few log 
 statements ( like acquiring lock, load state) logs with previous jobinfo. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-08-01 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: OOZIE-1811-3.patch

addressed review comments and fixed couple of classes missed in earlier patch - 
BatchQueryExecutor, SLA*QueryExecutors to be consistent

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, 
 OOZIE-1811-3.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-08-01 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: (was: OOZIE-1811-3.patch)

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, 
 OOZIE-1811-3.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-08-01 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: OOZIE-1811-3.patch

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, 
 OOZIE-1811-3.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-08-01 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: OOZIE-1811-3.patch

good catch! uploaded new patch

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, 
 OOZIE-1811-3.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-08-01 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: (was: OOZIE-1811-3.patch)

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, 
 OOZIE-1811-3.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-08-01 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082590#comment-14082590
 ] 

Mona Chitnis commented on OOZIE-1932:
-

okay thanks. will revise the order 

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1932-2.patch, OOZIE-1932-addendum.patch, 
 OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1811) Current test failures in trunk

2014-07-31 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081820#comment-14081820
 ] 

Mona Chitnis commented on OOZIE-1811:
-

{{. -1 the patch contains 2 line(s) with trailing spaces}} located and fixed in 
the xml file - {{coord-action-sla.xml}}

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-07-29 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: OOZIE-1811-2.patch

updated patch to apply cleanly to trunk HEAD

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-07-29 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis reopened OOZIE-1932:
-


Reopening issue to fix similar issue with URIHandlerService should be loaded 
before CallableQueueService, so that its closed before. This JIRA's scope to 
include a permanent fix to the services ordering to work for all cases, and 
avoid all NPEs and other issues with the services during server shutdown/startup

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1932-2.patch, OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1811) Current test failures in trunk

2014-07-29 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078158#comment-14078158
 ] 

Mona Chitnis commented on OOZIE-1811:
-

above failures due to strange network error on the host. Happened before at 
https://builds.apache.org/job/oozie-trunk-precommit-build/1363/ too.

Ran the whole suit locally and only 1 failed, which I've mentioned is going to 
be part of OOZIE-1952.

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1952) Cleanup duplicate/obsolete code - Command, StoreService

2014-07-28 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1952:
---

 Summary: Cleanup duplicate/obsolete code - Command, StoreService
 Key: OOZIE-1952
 URL: https://issues.apache.org/jira/browse/OOZIE-1952
 Project: Oozie
  Issue Type: Task
Reporter: Mona Chitnis


StoreService has been superceded by JPAService, and Command has been superceded 
by XCommand. These old classes have been lying around long enough and probably 
only referenced through unit tests, creating some confusion when tests have to 
be fixed for flaky failures



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1811) Current test failures in trunk

2014-07-28 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1811:


Attachment: OOZIE-1811-1.patch

attaching patch which fixes the QueryExecutors and TestSLAEventGeneration. 
Errors related to StoreService usage in tests can be fixed as part of overall 
StoreService fix in OOZIE-1952

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical
 Attachments: OOZIE-1811-1.patch


 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1811) Current test failures in trunk

2014-07-25 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074117#comment-14074117
 ] 

Mona Chitnis commented on OOZIE-1811:
-

I'd suggest getting rid of the static reference to JPAService in each of the 
Query Executors. We can always get the reference to it from the Services 
singleton, while executing the query. By keeping another static reference and 
manipulating it through the constructor and destroy(), we run the risk of 
nullifying it inadvertently. This is why suddenly so many tests are becoming 
flaky and it is very tough to detect exact patterns or even fix tests in a 
foolproof way. 

I ran the whole suit with the static reference removed and only 2 tests failed 
- which is quite an improvement!
{code}
Results :

Failed tests:   
testBundleId(org.apache.oozie.servlet.TestBulkMonitorWebServiceAPI): 
expected:SUCCEEDED but was:RUNNING

Tests in error: 
  testSucCoordPurgeXCommand(org.apache.oozie.command.TestPurgeXCommand): E0604: 
Job does not exist [000-140724213655573-oozie-chit-C]
{code}

Test#2 here is failing with error StoreService cannot work without JPAService. 
We can replace usage of StoreService completely as it is superceded by 
JPAService anyway.
Test #1 doesnt really have any error except random assert fail, and this test 
is not usually flaky so can ignore


 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical

 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf

2014-07-24 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1944:


Attachment: OOZIE-1944-2.patch

adding null check for configDefault which was causing TestWorkflowAppParser 
tests to fail

 Recursive variable resolution broken when same parameter name in 
 config-default and action conf
 ---

 Key: OOZIE-1944
 URL: https://issues.apache.org/jira/browse/OOZIE-1944
 Project: Oozie
  Issue Type: Bug
  Components: workflow
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1944-1.patch, OOZIE-1944-2.patch


 Hitting error
 {code}
 can not create DagEngine for submitting jobs
 org.apache.oozie.DagEngineException: E0803: IO error, Variable
 substitution depth too large: 20 ${param}/000
 {code}
 when config-default.xml has
 {{param=default}}
 and action conf has
 {code}
 action name=my-wf-action
 ...
 configuration
 property
 nameparam/name
 value${param}/000/value
 /property
 /configuration
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1872) TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime is failing for past couple of builds

2014-07-23 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1872:


Component/s: tests

 TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime
  is failing for past couple of builds
 ---

 Key: OOZIE-1872
 URL: https://issues.apache.org/jira/browse/OOZIE-1872
 Project: Oozie
  Issue Type: Bug
  Components: tests
Affects Versions: trunk, 4.1.0
Reporter: Rohini Palaniswamy
 Fix For: trunk, 4.1.0

 Attachments: OOZIE-1872-1.patch


 https://builds.apache.org/job/oozie-trunk-precommit-build/1291/testReport/junit/org.apache.oozie.command.coord/TestCoordActionInputCheckXCommand/testActionInputCheckLatestActionCreationTime/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1872) TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime is failing for past couple of builds

2014-07-23 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1872:


Fix Version/s: (was: trunk)

 TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime
  is failing for past couple of builds
 ---

 Key: OOZIE-1872
 URL: https://issues.apache.org/jira/browse/OOZIE-1872
 Project: Oozie
  Issue Type: Bug
  Components: tests
Affects Versions: trunk, 4.1.0
Reporter: Rohini Palaniswamy
 Fix For: 4.1.0

 Attachments: OOZIE-1872-1.patch


 https://builds.apache.org/job/oozie-trunk-precommit-build/1291/testReport/junit/org.apache.oozie.command.coord/TestCoordActionInputCheckXCommand/testActionInputCheckLatestActionCreationTime/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf

2014-07-23 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1944:


Fix Version/s: (was: trunk)

 Recursive variable resolution broken when same parameter name in 
 config-default and action conf
 ---

 Key: OOZIE-1944
 URL: https://issues.apache.org/jira/browse/OOZIE-1944
 Project: Oozie
  Issue Type: Bug
  Components: workflow
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0


 Hitting error
 {code}
 can not create DagEngine for submitting jobs
 org.apache.oozie.DagEngineException: E0803: IO error, Variable
 substitution depth too large: 20 ${param}/000
 {code}
 when config-default.xml has
 {{param=default}}
 and action conf has
 {code}
 action name=my-wf-action
 ...
 configuration
 property
 nameparam/name
 value${param}/000/value
 /property
 /configuration
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf

2014-07-23 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1944:


Attachment: OOZIE-1944-1.patch

Attaching patch. approach is to switch from using 
XConfiguration.injectDefaults() method to copy(), since the former does a 
Configuration.get() which tries to recursively resolve params. So simply, copy 
over defaults, global, and finally action configuration, in this order of 
precedence

 Recursive variable resolution broken when same parameter name in 
 config-default and action conf
 ---

 Key: OOZIE-1944
 URL: https://issues.apache.org/jira/browse/OOZIE-1944
 Project: Oozie
  Issue Type: Bug
  Components: workflow
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: 4.1.0

 Attachments: OOZIE-1944-1.patch


 Hitting error
 {code}
 can not create DagEngine for submitting jobs
 org.apache.oozie.DagEngineException: E0803: IO error, Variable
 substitution depth too large: 20 ${param}/000
 {code}
 when config-default.xml has
 {{param=default}}
 and action conf has
 {code}
 action name=my-wf-action
 ...
 configuration
 property
 nameparam/name
 value${param}/000/value
 /property
 /configuration
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1536) Coordinator action reruns start a new workflow

2014-07-22 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1536:


Assignee: (was: Mona Chitnis)

 Coordinator action reruns start a new workflow
 --

 Key: OOZIE-1536
 URL: https://issues.apache.org/jira/browse/OOZIE-1536
 Project: Oozie
  Issue Type: Improvement
Reporter: Srikanth Sundarrajan

 Coordinator action reruns start a new workflow and if existing workflow for 
 the action is in running state, the same is not checked. Coord rerun can 
 possibly do a workflow re-run to prevent this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1945) NPE in JaveActionExecutor#check()

2014-07-22 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1945:
---

 Summary: NPE in JaveActionExecutor#check()
 Key: OOZIE-1945
 URL: https://issues.apache.org/jira/browse/OOZIE-1945
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Priority: Trivial
 Fix For: trunk, 4.1.0


in method check()
{code}
 String errorCode = props.getProperty(error.code);
if (errorCode.equals(0)) {
errorCode = JA018;
}
if (errorCode.equals(-1)) {
errorCode = JA019;
}
errorReason = props.getProperty(error.reason);
{code}
if error.code is null, these leads to NPEs
easy fix
{code}
if (0.equals(errorCode))
...
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-21 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis resolved OOZIE-1933.
-

Resolution: Fixed

failing unit tests fix committed to trunk after review. Thanks!

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, 
 OOZIE-1933-unit-tests-fix.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-21 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: (was: sla_unit_tests.patch)

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, 
 OOZIE-1933-unit-tests-fix.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-21 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: (was: sla_unit_tests-1.patch)

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, 
 OOZIE-1933-unit-tests-fix.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-21 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: OOZIE-1933-unit-tests-fix.patch

updated patch for cleanly apply to trunk

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, 
 OOZIE-1933-unit-tests-fix.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf

2014-07-21 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1944:
---

 Summary: Recursive variable resolution broken when same parameter 
name in config-default and action conf
 Key: OOZIE-1944
 URL: https://issues.apache.org/jira/browse/OOZIE-1944
 Project: Oozie
  Issue Type: Bug
  Components: workflow
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


Hitting error
{code}
can not create DagEngine for submitting jobs
org.apache.oozie.DagEngineException: E0803: IO error, Variable
substitution depth too large: 20 ${param}/000
{code}

when config-default.xml has
{{param=default}}
and action conf has
{code}
action name=my-wf-action
...
configuration
property
nameparam/name
value${param}/000/value
/property
/configuration
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf

2014-07-21 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1944:


Fix Version/s: 4.1.0

 Recursive variable resolution broken when same parameter name in 
 config-default and action conf
 ---

 Key: OOZIE-1944
 URL: https://issues.apache.org/jira/browse/OOZIE-1944
 Project: Oozie
  Issue Type: Bug
  Components: workflow
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk, 4.1.0


 Hitting error
 {code}
 can not create DagEngine for submitting jobs
 org.apache.oozie.DagEngineException: E0803: IO error, Variable
 substitution depth too large: 20 ${param}/000
 {code}
 when config-default.xml has
 {{param=default}}
 and action conf has
 {code}
 action name=my-wf-action
 ...
 configuration
 property
 nameparam/name
 value${param}/000/value
 /property
 /configuration
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1811) Current test failures in trunk

2014-07-18 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066771#comment-14066771
 ] 

Mona Chitnis commented on OOZIE-1811:
-

{{org.apache.oozie.command.coord.TestCoordActionInputCheckXCommand.testActionInputCheckLatestCurrentTime}}
 also failing because JPAService null. Same class test but using latest 
calculation with rest to action creation time (old behavior) 
{{org.apache.oozie.command.coord.TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime}}
 however, failing with a dependency mismatch problem - OOZIE-1872

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical

 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-17 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis resolved OOZIE-1933.
-

Resolution: Fixed

committed to trunk. thanks for review!

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-17 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis reopened OOZIE-1933:
-


adding test cases broken by the patch

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-17 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: sla_unit_tests.patch

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, 
 sla_unit_tests.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-17 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: sla_unit_tests-1.patch

updated patch to include another broken testcase. All other failed tests pass 
locally and are known to be flaky

Test run:
{code}
Results :

Failed tests:   
testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration)
  testRecovery(org.apache.oozie.action.hadoop.TestJavaActionExecutor): 
expected:[SUCCEED]ED but was:[FAILED/KILL]ED
  
testCoordStatusTransitServiceBackwardSupport(org.apache.oozie.service.TestStatusTransitService)

Tests in error: 
  testOnJobEvent(org.apache.oozie.sla.TestSLAJobEventListener): invalid child 
id [wa1]
  
testActionReuseWfJobAppPath(org.apache.oozie.command.wf.TestActionStartXCommand):
 E0607: Other error in operation [action.start], null
  testWorkflowRun(org.apache.oozie.command.wf.TestLastModified): 
org.apache.oozie.DagEngineException: E0607: Other error in operation [start], 
null
  testSucJobPurgeXCommand(org.apache.oozie.command.TestPurgeXCommand): E0604: 
Job does not exist [001-140717193440158-oozie-chit-W]
  testSucCoordPurgeXCommand(org.apache.oozie.command.TestPurgeXCommand): E0604: 
Job does not exist [000-140717193442386-oozie-chit-C]
{code}


 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, 
 sla_unit_tests-1.patch, sla_unit_tests.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1938) Fork-join job does not execute join node sometimes during HA failover

2014-07-16 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1938:
---

 Summary: Fork-join job does not execute join node sometimes during 
HA failover
 Key: OOZIE-1938
 URL: https://issues.apache.org/jira/browse/OOZIE-1938
 Project: Oozie
  Issue Type: Bug
  Components: HA
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk


Reported by Michelle Chiang (Yahoo Oozie QE)

Scenario: (2 Oozie HA servers)
21:38:56 submit job at oozie client
21:41:42 shut down server1
21:46:52 shut down server2
21:47:30 start server1
22:15:05 start server2

the last fork path end time is 21:52:53.
22:36:48 the job is still RUNNING, not moving to join node.

Digging into the logs, the locking part seems to work fine with forked action 
processing distributed amongst the two servers when both running or when one of 
them is down. The issue seems to be why even RecoveryService fails to pick up 
the job after all the forks had completed



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-16 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis resolved OOZIE-1935.
-

Resolution: Fixed

committed to trunk. thanks for review!

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch


 Demoting log level in ActionStartXCommand for non-WARN type messages.
 And,
 Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
 section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
 coord:tableOut



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (OOZIE-1775) TestEventGeneration.testCoordinatorActionEvent is failing and CoordRerunX should generate event

2014-07-16 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis resolved OOZIE-1775.
-

Resolution: Fixed

All flaky tests will be fixes as part of OOZIE-1811

 TestEventGeneration.testCoordinatorActionEvent is failing and CoordRerunX 
 should generate event
 ---

 Key: OOZIE-1775
 URL: https://issues.apache.org/jira/browse/OOZIE-1775
 Project: Oozie
  Issue Type: Task
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1175-2.patch, OOZIE-1775-1.patch, 
 OOZIE-1775-3.patch, OOZIE-1775-4.patch


 test is not flaky but consistently failing. this is due to 
 CoordResumeXCommand not publishing job event afterwards, for the resumed 
 coord actions.
 Also fixed flaky test 
 {{TestSLAEventGeneration#testCoordinatorActionCommandSubmitAndStart}} by 
 limiting the materialization



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1842) publish oozie artifacts

2014-07-16 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1842:


Assignee: (was: Mona Chitnis)

 publish oozie artifacts
 ---

 Key: OOZIE-1842
 URL: https://issues.apache.org/jira/browse/OOZIE-1842
 Project: Oozie
  Issue Type: New Feature
Reporter: Bowen Zhang





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-16 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: OOZIE-1933-3.patch

attaching patch reviewed and updated from ReviewBoard

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-16 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Attachment: OOZIE-1933-4-1.patch

 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1938) Fork-join job does not execute join node sometimes during HA failover

2014-07-16 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064458#comment-14064458
 ] 

Mona Chitnis commented on OOZIE-1938:
-

More context - all actions are completed, some via server 1 others via server 
2. 

1) Checking the SignalXCommand code also against the WF_ACTIONS table for all 
actions for this job, all of them have pending=0. This probably explains why 
they weren't recovered by ActionCheckerRunnable.

2) As each forked action finishes, two signals are sent - signal value OK and 
signal value :sync:. The 'sync' is needed to maintain the fork-join count, so 
increment on initial forks sending signal :sync:, and then decrement on joins 
sending signal :sync:. I think because of the time when one of the servers was 
down, these :sync:'s were lost or failed to get processed. We dont see this 
problem in a different scenario when both servers were up before actions 
finished and started signaling :sync:.

Not very confident about changing the way we handle the :sync:, so would like 
to discuss the best approach here. The easier approach would be to set the 
action's pending flag in this process so that recovery will pick up action and 
help restore correct :sync: count.

Feedback/corrections?

 Fork-join job does not execute join node sometimes during HA failover
 -

 Key: OOZIE-1938
 URL: https://issues.apache.org/jira/browse/OOZIE-1938
 Project: Oozie
  Issue Type: Bug
  Components: HA
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk


 Reported by [~mchiang].
 Scenario: (2 Oozie HA servers)
 21:38:56 submit job at oozie client
 21:41:42 shut down server1
 21:46:52 shut down server2
 21:47:30 start server1
 22:15:05 start server2
 the last fork path end time is 21:52:53.
 22:36:48 the job is still RUNNING, not moving to join node.
 Digging into the logs, the locking part seems to work fine with forked action 
 processing distributed amongst the two servers when both running or when one 
 of them is down. The issue seems to be why even RecoveryService fails to pick 
 up the job after all the forks had completed



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1932:


Attachment: OOZIE-1932.patch

Attaching patch

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-15 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1933:
---

 Summary: SLACalculatorMemory HA changes assume SLARegistrationBean 
exists for all jobs
 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


SLACalculatorMemory.addJobStatus()
{code}
else {
// jobid might not exist in slaMap in HA Setting
SLARegistrationBean slaRegBean = 
SLARegistrationQueryExecutor.getInstance().get(
SLARegQuery.GET_SLA_REG_ALL, jobId);
SLASummaryBean slaSummaryBean = 
SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
jobId);
slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
{code}

Because of SLA Listener, job notification event triggers this even for jobs 
with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
annoying exception stacktraces in logs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1932:


Attachment: OOZIE-1932-2.patch

removed testcase in updated patch

 Services should load CallableQueueService after MemoryLocksService
 --

 Key: OOZIE-1932
 URL: https://issues.apache.org/jira/browse/OOZIE-1932
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1932-2.patch, OOZIE-1932.patch


 This is not a problem during startup but is during shutdown, as services are 
 destroyed in reverse order of initialization. Hence, when MemoryLocksService 
 destroy sets it to null, and commands are still executing due to 
 CallableQueueService still active, they all encounter NPEs during locking. 
 This is a simple fix in oozie-default.xml to set MemoryLocksService before in 
 the order of services loading.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1933:


Description: 
SLACalculatorMemory.addJobStatus()
{code}
else {
// jobid might not exist in slaMap in HA Setting
SLARegistrationBean slaRegBean = 
SLARegistrationQueryExecutor.getInstance().get(
SLARegQuery.GET_SLA_REG_ALL, jobId);
SLASummaryBean slaSummaryBean = 
SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
jobId);
slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
{code}

Because of SLA Listener, job notification event triggers this even for jobs 
with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
annoying exception stacktraces in logs

Patch to also include log prefix addition to some SLACalculator log line

  was:
SLACalculatorMemory.addJobStatus()
{code}
else {
// jobid might not exist in slaMap in HA Setting
SLARegistrationBean slaRegBean = 
SLARegistrationQueryExecutor.getInstance().get(
SLARegQuery.GET_SLA_REG_ALL, jobId);
SLASummaryBean slaSummaryBean = 
SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
jobId);
slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
{code}

Because of SLA Listener, job notification event triggers this even for jobs 
with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
annoying exception stacktraces in logs


 SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
 -

 Key: OOZIE-1933
 URL: https://issues.apache.org/jira/browse/OOZIE-1933
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 SLACalculatorMemory.addJobStatus()
 {code}
 else {
 // jobid might not exist in slaMap in HA Setting
 SLARegistrationBean slaRegBean = 
 SLARegistrationQueryExecutor.getInstance().get(
 SLARegQuery.GET_SLA_REG_ALL, jobId);
 SLASummaryBean slaSummaryBean = 
 SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY,
 jobId);
 slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean);
 {code}
 Because of SLA Listener, job notification event triggers this even for jobs 
 with no SLA configured - leading to NPE in the SLACalcStatus constructor and 
 annoying exception stacktraces in logs
 Patch to also include log prefix addition to some SLACalculator log line



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1935) Demoting log level in ActionStartXCommand for non-WARN type messages

2014-07-15 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1935:
---

 Summary: Demoting log level in ActionStartXCommand for non-WARN 
type messages
 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Demoting log level in ActionStartXCommand for non-WARN type messages

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Attachment: OOZIE-1935.patch

attached patch. trivial change for two statements from warn to info

 Demoting log level in ActionStartXCommand for non-WARN type messages
 

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Summary: Log level (ActionStartXCommand) and Doc fix 
(CoordinatorFunctionalSpec)  (was: Demoting log level in ActionStartXCommand 
for non-WARN type messages)

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Attachment: (was: OOZIE-1935.patch)

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch


 Demoting log level in ActionStartXCommand for non-WARN type messages.
 And,
 Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
 section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
 coord:tableOut



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Attachment: OOZIE-1935.patch

reattaching patch with both code and doc change

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch


 Demoting log level in ActionStartXCommand for non-WARN type messages.
 And,
 Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
 section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
 coord:tableOut



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Description: 
Demoting log level in ActionStartXCommand for non-WARN type messages.
And,
Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
coord:tableOut

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch


 Demoting log level in ActionStartXCommand for non-WARN type messages.
 And,
 Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
 section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
 coord:tableOut



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Attachment: (was: OOZIE-1935.patch)

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch


 Demoting log level in ActionStartXCommand for non-WARN type messages.
 And,
 Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
 section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
 coord:tableOut



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)

2014-07-15 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1935:


Attachment: OOZIE-1935.patch

fixing patch to include both code and doc and fixed example

 Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
 ---

 Key: OOZIE-1935
 URL: https://issues.apache.org/jira/browse/OOZIE-1935
 Project: Oozie
  Issue Type: Bug
Reporter: Mona Chitnis
Assignee: Mona Chitnis
Priority: Trivial
 Attachments: OOZIE-1935.patch


 Demoting log level in ActionStartXCommand for non-WARN type messages.
 And,
 Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions 
 section for coord:databaseIn, coord:databaseOut, coord:tableIn and 
 coord:tableOut



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1911) SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration'

2014-07-15 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062883#comment-14062883
 ] 

Mona Chitnis commented on OOZIE-1911:
-

Linking OOZIE-1933 which has a fixes over this JIRA regarding the logging

 SLA calculation in HA mode does wrong bit comparison for 'start' and 
 'duration'
 ---

 Key: OOZIE-1911
 URL: https://issues.apache.org/jira/browse/OOZIE-1911
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1911-4.patch


 In chronological order:
 Server 1:
 Job's SLA eventProcessed set to 0101 = Start and End sla processed.
 Server 2:
 Receives above job's status event, processes remaining 'duration' sla. 
 eventProcessed now = 0111, but incremented to 1000 due to
 {code}
 SLACalculatorMemory.addJobStatus() : 762
 if (slaCalc.getEventProcessed() == 7) {
   slaInfo.setEventProcessed(8);
  slaMap.remove(jobId);
 }
 {code}
 Back to Server 1: (doing periodic SLA checks)
 {code}
 SLACalculatorMemory.updateJobSla() : 483
 if ((eventProc  1) == 0) { // first bit (start-processed) unset
if (reg.getExpectedStart() != null) {
  if (reg.getExpectedStart().getTime() + jobEventLatency  
 System.currentTimeMillis()) {
// goes ahead and enqueues another START_MISS event and 
 DURATION_MET event
 {code}
 Conclusion, need to fix that check for least significant bit (and next to it) 
 for 'start' and 'duration' to avoid duplicate events



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1934) coordinator action repeatedly picked up by cachePurgeWorker of PartitionDependencyManagerService

2014-07-15 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062891#comment-14062891
 ] 

Mona Chitnis commented on OOZIE-1934:
-

Doesnt this encounter a ConcurrentModificationException due to removing from 
iterator during the while loop iteration?

 coordinator action repeatedly picked up by cachePurgeWorker of 
 PartitionDependencyManagerService
 

 Key: OOZIE-1934
 URL: https://issues.apache.org/jira/browse/OOZIE-1934
 Project: Oozie
  Issue Type: Bug
Reporter: Ryota Egashira
Assignee: Ryota Egashira
 Attachments: OOZIE-1934.patch


 cachePurgeWorker trying to remove the same coordinator action repeatedly. 
 also log message need to be addd for better debugging



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1921) Curator client reports connection loss to ZK under high load

2014-07-10 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058087#comment-14058087
 ] 

Mona Chitnis commented on OOZIE-1921:
-

Okay makes sense to fix any issues in curator itself. But currently not yet 
isolated exact scenarios when such connection losses start. Will do that and 
update on this JIRA.

 Curator client reports connection loss to ZK under high load
 

 Key: OOZIE-1921
 URL: https://issues.apache.org/jira/browse/OOZIE-1921
 Project: Oozie
  Issue Type: Bug
  Components: HA
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk


 Seeing two types of Connection Loss exceptions via Curator when running Oozie 
 in high load (specifically workflows with ~80 forked actions)
 h5. [1] (znode transaction type: delete)
 {code}
 org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
 ConnectionLoss
 at 
 org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
 at 
 org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
 at 
 org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
 {code}
 h5. [2]
 {code}
 org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for
 /oozie/locks/0037706-140704041907-oozie-oozi-W
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
 {code}
 Tracking a particular job between the ZK trace logs reporting NoNode 
 KeeperExceptions and Oozie logs, found that after encountering the zookeeper 
 exceptions with 'delete' of job
 lock znode, that particular job never succeeds in getting lock and proceeding.
 Not that familiar with when Oozie via Curator tries to delete znodes. 
 OOZIE-1906 will introduce the Reaper.
 Exception stacktrace pointing to Curator code:
 ConnectionState.getZookeeper() {
 ...
 boolean localIsConnected = isConnected.get();
 if ( !localIsConnected )
 {
 checkTimeouts();
 }
 ..
 }
 isConnected is FALSE and so exception is getting thrown from checkTimeouts(). 
 Wasn't able to find any good docs or benchmarks explaining timeout issues 
 Curator would face due to high load. My suspicion is Curator might have 
 limitations in how many concurrent requests for same lock it can handle. In 
 this particular stress test, there are 85 forked actions all contending for 
 same job lock. Hence we should implement some fallback mechanism in Oozie 
 while invoking Curator APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1911) SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration'

2014-07-09 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1911:


Attachment: OOZIE-1911-4.patch

Final patch attached reviewed from Reviewboard

 SLA calculation in HA mode does wrong bit comparison for 'start' and 
 'duration'
 ---

 Key: OOZIE-1911
 URL: https://issues.apache.org/jira/browse/OOZIE-1911
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk

 Attachments: OOZIE-1911-4.patch


 In chronological order:
 Server 1:
 Job's SLA eventProcessed set to 0101 = Start and End sla processed.
 Server 2:
 Receives above job's status event, processes remaining 'duration' sla. 
 eventProcessed now = 0111, but incremented to 1000 due to
 {code}
 SLACalculatorMemory.addJobStatus() : 762
 if (slaCalc.getEventProcessed() == 7) {
   slaInfo.setEventProcessed(8);
  slaMap.remove(jobId);
 }
 {code}
 Back to Server 1: (doing periodic SLA checks)
 {code}
 SLACalculatorMemory.updateJobSla() : 483
 if ((eventProc  1) == 0) { // first bit (start-processed) unset
if (reg.getExpectedStart() != null) {
  if (reg.getExpectedStart().getTime() + jobEventLatency  
 System.currentTimeMillis()) {
// goes ahead and enqueues another START_MISS event and 
 DURATION_MET event
 {code}
 Conclusion, need to fix that check for least significant bit (and next to it) 
 for 'start' and 'duration' to avoid duplicate events



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1906) Service to periodically remove ZK lock

2014-07-09 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056918#comment-14056918
 ] 

Mona Chitnis commented on OOZIE-1906:
-

typo in property name.
+public static final String REAPING_THRESHOLD = CONF_PREFIX + 
ZKLocksService.locks.reaper.threshlod;

Can you make the test more transparent by checking the children of lock once 
you request lock, and then comparing it with zero children after 'x' threshold 
seconds?

 Service to periodically remove ZK lock
 --

 Key: OOZIE-1906
 URL: https://issues.apache.org/jira/browse/OOZIE-1906
 Project: Oozie
  Issue Type: Bug
  Components: HA
Reporter: Purshotam Shah
Assignee: Purshotam Shah
 Attachments: OOZIE-1906-V1.patch


 We might have a case where lock are held-up and it never released bcz of 
 exception. Service should remove those locks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1921) Curator client reports connection loss to ZK under high load

2014-07-09 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056929#comment-14056929
 ] 

Mona Chitnis commented on OOZIE-1921:
-

[~rkanter] did you happen to face similar issues? 

 Curator client reports connection loss to ZK under high load
 

 Key: OOZIE-1921
 URL: https://issues.apache.org/jira/browse/OOZIE-1921
 Project: Oozie
  Issue Type: Bug
  Components: HA
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk


 Seeing two types of Connection Loss exceptions via Curator when running Oozie 
 in high load
 h5. [1] (znode transaction type: delete)
 {code}
 org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
 ConnectionLoss
 at 
 org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
 at 
 org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
 at 
 org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
 {code}
 h5. [2]
 {code}
 org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for
 /oozie/locks/0037706-140704041907-oozie-oozi-W
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
 {code}
 We should probably implement a fallback approach in Oozie while invoking 
 Curator library to handle any inherent limitations. But not able to find much 
 documentation about Curator benchmarks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1921) Curator client reports connection loss to ZK under high load

2014-07-09 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1921:
---

 Summary: Curator client reports connection loss to ZK under high 
load
 Key: OOZIE-1921
 URL: https://issues.apache.org/jira/browse/OOZIE-1921
 Project: Oozie
  Issue Type: Bug
  Components: HA
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk


Seeing two types of Connection Loss exceptions via Curator when running Oozie 
in high load

h5. [1] (znode transaction type: delete)
{code}
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
{code}

h5. [2]
{code}
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/oozie/locks/0037706-140704041907-oozie-oozi-W
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
{code}

We should probably implement a fallback approach in Oozie while invoking 
Curator library to handle any inherent limitations. But not able to find much 
documentation about Curator benchmarks.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1921) Curator client reports connection loss to ZK under high load

2014-07-09 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1921:


Description: 
Seeing two types of Connection Loss exceptions via Curator when running Oozie 
in high load (specifically workflows with ~80 forked actions)

h5. [1] (znode transaction type: delete)
{code}
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
{code}

h5. [2]
{code}
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/oozie/locks/0037706-140704041907-oozie-oozi-W
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
{code}

Tracking a particular job between the ZK trace logs reporting NoNode 
KeeperExceptions and Oozie logs, found that after encountering the zookeeper 
exceptions with 'delete' of job
lock znode, that particular job never succeeds in getting lock and proceeding.
Not that familiar with when Oozie via Curator tries to delete znodes. 
OOZIE-1906 will introduce the Reaper.

Exception stacktrace pointing to Curator code:

ConnectionState.getZookeeper() {
...
boolean localIsConnected = isConnected.get();
if ( !localIsConnected )
{
checkTimeouts();
}
..
}

isConnected is FALSE and so exception is getting thrown from checkTimeouts(). 
Wasn't able to find any good docs or benchmarks explaining timeout issues 
Curator would face due to high load. My suspicion is Curator might have 
limitations in how many concurrent requests for same lock it can handle. In 
this particular stress test, there are 85 forked actions all contending for 
same job lock. Hence we should implement some fallback mechanism in Oozie while 
invoking Curator APIs.




  was:
Seeing two types of Connection Loss exceptions via Curator when running Oozie 
in high load

h5. [1] (znode transaction type: delete)
{code}
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
{code}

h5. [2]
{code}
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/oozie/locks/0037706-140704041907-oozie-oozi-W
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
{code}

We should probably implement a fallback approach in Oozie while invoking 
Curator library to handle any inherent limitations. But not able to find much 
documentation about Curator benchmarks.



 Curator client reports connection loss to ZK under high load
 

 Key: OOZIE-1921
 URL: https://issues.apache.org/jira/browse/OOZIE-1921
 Project: Oozie
  Issue Type: Bug
  Components: HA
Affects Versions: trunk
Reporter: Mona Chitnis
 Fix For: trunk


 Seeing two types of Connection Loss exceptions via Curator when running Oozie 
 in high load (specifically workflows with ~80 forked actions)
 h5. [1] (znode transaction type: delete)
 {code}
 org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
 ConnectionLoss
 at 
 org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
 at 
 org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
 at 
 org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
 {code}
 h5. [2]
 {code}
 org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for
 /oozie/locks/0037706-140704041907-oozie-oozi-W
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
 {code}
 Tracking a particular job between the ZK trace logs reporting NoNode 
 KeeperExceptions and Oozie logs, found that after encountering the zookeeper 
 exceptions with 'delete' of job
 lock znode, that particular job never succeeds in getting lock and 

[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended

2014-07-02 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050625#comment-14050625
 ] 

Mona Chitnis commented on OOZIE-1913:
-

Some discussion points:

h5. Approach 1:
Change SLA behavior for all jobs on suspend. i.e. not track SLA for suspended 
jobs. However this was originally put into place because users need to be 
notified of their job SLAs in the event of suspension caused by system (Oozie 
server restart/ transient errors from Hadoop cluster). So making this change 
across all suspended jobs would not be ideal.

h5. Approach 2:
Add a command line option like {{-ignoresla}} along with suspend command, which 
will flag it accordingly in the memory map of the SLA calculator. This then 
entails two sub-approaches

h6. 2A]
On seeing {{-ignoresla}}, mark the eventProcessed byte of the SLA entry to 
{{1000 (8) }} to remove it from being tracked anymore for SLA. The resume 
command will also need an option like {{-resumesla}} to then add this job back 
into SLA map for tracking, along with more options for revised expected end 
time and expected duration of job.

h6. 2B]
If we dont wish to change the eventProcessed byte so that we dont have to 
recalculate it, we can add a flag to the job, to indicate to ignore SLA for 
this job till unset. However, this requires adding a column to the Sla_Summary 
table schema to be able to retain this information across Oozie server restarts 
and in HA mode.

2A seems to be preferable to me. Thoughts?


 Devise a way to turn off SLA alerts when bundle/coordinator suspended
 -

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended

2014-07-02 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050910#comment-14050910
 ] 

Mona Chitnis commented on OOZIE-1913:
-

Expanding the scope of this problem:

Following are the scenarios and use-cases which can be tied in with turning off 
SLA alerts:

h6. [1] Suspend: 
User-initiated suspend of bundle/coordinator and specifying option to turn off 
sla alerts. For coordinator, this can be followed by a list of coordinator 
actions or will be applied to ALL. For bundle, currently no way to enumerate 
coordinators so sla alerting to be turned off for all children coordinators'.

h6. [2] Rerun: For reprocessing purposes, same option as above can be given 
while rerunning a coordinator and turning off sla-alerts for some or all of its 
actions.

h6. [3] Catchup jobs: In backlogged situations, similar to comment#5, SLA 
service should identify that coordinator is catching up and disable alerting 
automatically. Of course, there should be a job-level minimum threshold in 
terms of time for SLA to mark jobs as catchup e.g.  only turn off alerts if 
nominal time more than 3 days old, or something to that effect. Need to think 
about whether to specify this threshold like 
oozie.coordinator.sla.alert.disable.threshold as part of job.properties at 
submission time.

In cases [1]-[3], SLA calculation will go ahead and mark eventual MET/MISS 
status for the jobs, only no alerts will be generated. Use case [4] enlists an 
option if you would like to resume SLA tracking for jobs that you are 
re-processing.

h6. [4] On-the-fly update of SLA expected-start, expected-end and 
expected-duration: Similar to the new feature in Oozie where you can change 
certain coordinator config such as concurrency, throttle etc on the fly, one 
should be able to change the SLA limits given. This can be made as options to 
'Rerun' of terminated coordinators, 'Resume' of suspended coordinators, or 
'Change' command.

Tagging a certain SLA entry for 'disabling alerts' is not going to require 
any XML changes on the part of the user. This is due to a good design choice we 
made while implementing SLA.

 Devise a way to turn off SLA alerts when bundle/coordinator suspended
 -

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended

2014-07-02 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050910#comment-14050910
 ] 

Mona Chitnis edited comment on OOZIE-1913 at 7/3/14 12:24 AM:
--

Expanding the scope of this problem:

Following are the scenarios and use-cases which can be tied in with turning off 
SLA alerts:

h6. [1] Suspend: 
User-initiated suspend of bundle/coordinator and specifying option to turn off 
sla alerts. For coordinator, this can be followed by a list of coordinator 
actions or will be applied to ALL. For bundle, currently no way to enumerate 
coordinators so sla alerting to be turned off for all children coordinators'.

h6. [2] Rerun:
For reprocessing purposes, same option as above can be given while rerunning a 
coordinator and turning off sla-alerts for some or all of its actions.

h6. [3] Catchup jobs:
In backlogged situations, similar to comment#5, SLA service should identify 
that coordinator is catching up and disable alerting automatically. Of course, 
there should be a job-level minimum threshold in terms of time for SLA to 
mark jobs as catchup e.g.  only turn off alerts if nominal time more than 3 
days old, or something to that effect. Need to think about whether to specify 
this threshold like oozie.coordinator.sla.alert.disable.threshold as part of 
job.properties at submission time.

In cases [1]-[3], SLA calculation will go ahead and mark eventual MET/MISS 
status for the jobs, only no alerts will be generated. Use case [4] enlists an 
option if you would like to resume SLA tracking for jobs that you are 
re-processing.

h6. [4] On-the-fly update of SLA expected-start, expected-end and 
expected-duration:
Similar to the new feature in Oozie where you can change certain coordinator 
config such as concurrency, throttle etc on the fly, one should be able to 
change the SLA limits given. This can be made as options to 'Rerun' of 
terminated coordinators, 'Resume' of suspended coordinators, or 'Change' 
command.

Tagging a certain SLA entry for 'disabling alerts' is not going to require 
any XML changes on the part of the user. This is due to a good design choice we 
made while implementing SLA.


was (Author: chitnis):
Expanding the scope of this problem:

Following are the scenarios and use-cases which can be tied in with turning off 
SLA alerts:

h6. [1] Suspend: 
User-initiated suspend of bundle/coordinator and specifying option to turn off 
sla alerts. For coordinator, this can be followed by a list of coordinator 
actions or will be applied to ALL. For bundle, currently no way to enumerate 
coordinators so sla alerting to be turned off for all children coordinators'.

h6. [2] Rerun: For reprocessing purposes, same option as above can be given 
while rerunning a coordinator and turning off sla-alerts for some or all of its 
actions.

h6. [3] Catchup jobs: In backlogged situations, similar to comment#5, SLA 
service should identify that coordinator is catching up and disable alerting 
automatically. Of course, there should be a job-level minimum threshold in 
terms of time for SLA to mark jobs as catchup e.g.  only turn off alerts if 
nominal time more than 3 days old, or something to that effect. Need to think 
about whether to specify this threshold like 
oozie.coordinator.sla.alert.disable.threshold as part of job.properties at 
submission time.

In cases [1]-[3], SLA calculation will go ahead and mark eventual MET/MISS 
status for the jobs, only no alerts will be generated. Use case [4] enlists an 
option if you would like to resume SLA tracking for jobs that you are 
re-processing.

h6. [4] On-the-fly update of SLA expected-start, expected-end and 
expected-duration: Similar to the new feature in Oozie where you can change 
certain coordinator config such as concurrency, throttle etc on the fly, one 
should be able to change the SLA limits given. This can be made as options to 
'Rerun' of terminated coordinators, 'Resume' of suspended coordinators, or 
'Change' command.

Tagging a certain SLA entry for 'disabling alerts' is not going to require 
any XML changes on the part of the user. This is due to a good design choice we 
made while implementing SLA.

 Devise a way to turn off SLA alerts when bundle/coordinator suspended
 -

 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


 From user:
 Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
 grid upgrades and similar work so that when it's resumed we aren't flooded 
 with a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1909) log prefix information missing in JavaActionExecutor.check

2014-07-01 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049172#comment-14049172
 ] 

Mona Chitnis commented on OOZIE-1909:
-

Checked that this is needed because check() can be invoked directly form 
ActionCheckXCommand. For check() invoked via start(), this wouldn't occur.

+1

 log prefix information missing in JavaActionExecutor.check
 --

 Key: OOZIE-1909
 URL: https://issues.apache.org/jira/browse/OOZIE-1909
 Project: Oozie
  Issue Type: Bug
Reporter: Ryota Egashira
Assignee: Ryota Egashira
 Attachments: OOZIE-1909.patch


 log prefix information (e.g. USER,GROUP, TOKEN, APP, JOB) are missing, as 
 shown in the example.   due to this issue,  this log message does not show up 
 at CLI command ($ oozie job -log 0003855-140611203720-oozie-oozi-W )
 {noformat}
 2014-06-12 05:00:45,270  WARN PigActionExecutor:542 [pool-2-thread-214] -
 USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-]
 ACTION[0003855-140611203720-oozie-oozi-W@pig1] Launcher exception: Creation of
 partition is not supported for
 hcat://.yahoo.com:/outdatabase1402549124488/outtable1402549124488/datestamp=201302152320;region=usa
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1911) SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration'

2014-07-01 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1911:
---

 Summary: SLA calculation in HA mode does wrong bit comparison for 
'start' and 'duration'
 Key: OOZIE-1911
 URL: https://issues.apache.org/jira/browse/OOZIE-1911
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


In chronological order:

Server 1:
Job's SLA eventProcessed set to 0101 = Start and End sla processed.

Server 2:
Receives above job's status event, processes remaining 'duration' sla. 
eventProcessed now = 0111, but incremented to 1000 due to
{code}
SLACalculatorMemory.addJobStatus() : 762
if (slaCalc.getEventProcessed() == 7) {
  slaInfo.setEventProcessed(8);
 slaMap.remove(jobId);
}
{code}

Back to Server 1: (doing periodic SLA checks)
{code}
SLACalculatorMemory.updateJobSla() : 483
if ((eventProc  1) == 0) { // first bit (start-processed) unset
   if (reg.getExpectedStart() != null) {
 if (reg.getExpectedStart().getTime() + jobEventLatency  
System.currentTimeMillis()) {
   // goes ahead and enqueues another START_MISS event and 
DURATION_MET event
{code}

Conclusion, need to fix that check for least significant bit (and next to it) 
for 'start' and 'duration' to avoid duplicate events



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended

2014-07-01 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1913:
---

 Summary: Devise a way to turn off SLA alerts when 
bundle/coordinator suspended
 Key: OOZIE-1913
 URL: https://issues.apache.org/jira/browse/OOZIE-1913
 Project: Oozie
  Issue Type: Improvement
Affects Versions: trunk
Reporter: Mona Chitnis
Assignee: Mona Chitnis
 Fix For: trunk


From user:
Need to turn off the SLA miss alerts in jobs when the bundle is suspended for
grid upgrades and similar work so that when it's resumed we aren't flooded with 
a bunch of alerts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1811) Current test failures in trunk

2014-07-01 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049506#comment-14049506
 ] 

Mona Chitnis commented on OOZIE-1811:
-

Taking this up. Thanks for shortlisting the subset of tests to reproduce the 
issue Robert :)

 Current test failures in trunk
 --

 Key: OOZIE-1811
 URL: https://issues.apache.org/jira/browse/OOZIE-1811
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Mona Chitnis
Priority: Critical

 There's a bunch of test failures currently in trunk; I'm not sure what 
 commit(s) is the cause, but I think it was somewhat recent.
 e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/
 Reproducible by running these tests, instead of having to run them all, which 
 takes a lot longer :)
 {noformat}
 mvn clean test 
 -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration
 {noformat}
 {noformat}
 Results :
 Failed tests:   
 testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration):
  expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2]
   
 testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService):
  expected:DONEWITHERROR but was:KILLED
   
 testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService):
  expected:RUNNING but was:KILLED
 Tests in error: 
   testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
   testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor)
 Tests run: 62, Failures: 3, Errors: 2, Skipped: 0
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (OOZIE-1881) Docs fixes

2014-06-20 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038960#comment-14038960
 ] 

Mona Chitnis commented on OOZIE-1881:
-

Yes lets make this central. Point 2 is right but for shell and java actions 
since they run in launcher, it should be indicated more prominently to use 
'oozie.launcher...' properties since 'mapred.x.y.z' properties will not have 
any consequence

 Docs fixes
 --

 Key: OOZIE-1881
 URL: https://issues.apache.org/jira/browse/OOZIE-1881
 Project: Oozie
  Issue Type: Task
  Components: docs
Affects Versions: trunk, 4.0.0, 4.0.1
Reporter: Mona Chitnis
  Labels: hcatalog
 Fix For: trunk


 Couple of things to include
 [1] HCatalog integration doc should include walkthrough of example and useful 
 FAQ
 [2] Shell action doc should mention inline conf prefix oozie.launcher



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size

2014-06-19 Thread Mona Chitnis (JIRA)
Mona Chitnis created OOZIE-1892:
---

 Summary: Oozie web-console WF actions window should be auto-size 
and not fixed size
 Key: OOZIE-1892
 URL: https://issues.apache.org/jira/browse/OOZIE-1892
 Project: Oozie
  Issue Type: Bug
Affects Versions: 4.0.1, trunk
Reporter: Mona Chitnis
 Fix For: trunk


A user reported the following:
facing some issue with oozie web console UI
1. If I click on a pig action it opens a window with no body
2. If a coordinator has more than 15 instances invoked there is no scroller to
see the below 15th row - had to resize the window (very non-intuitive) to get 
the scroller bar
facing same issue on two browsers chrome and firefox.
3. Weird that the window also does not have a horizontal scrollbar.  My mouse
supports horizontal scrolling and even with that I can't scroll right enough to
see the vertical scrollbar.

Some suggestion from another UI developer:
The problems are largely due to the static height and width applied to some of
the divs wrapping the whole table.  Playing around in Chrome's developer tools
I was able to get the scrollbars working correctly by:
 - remove fixed width from div with class x-panel
 - remove fixed height and width from div with class x-panel-body
 - remove fixed height and width from div with class x-grid3
 - remove fixed width from div with class x-grid3-scroller

We can fix the oozie-console.js to incorporate either the above suggestions or 
others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size

2014-06-19 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1892:


Description: 
A user reported the following:
facing some issue with oozie web console UI
1. If I click on a pig action it opens a window with no body
2. If a coordinator has more than 15 instances invoked there is no scroller to
see the below 15th row - had to resize the window (very non-intuitive) to get 
the scroller bar
facing same issue on two browsers chrome and firefox.
3. Weird that the window also does not have a horizontal scrollbar.  My mouse
supports horizontal scrolling and even with that I can't scroll right enough to
see the vertical scrollbar.

Some suggestion from another UI developer (Kendall Thrapp):

The problems are largely due to the static height and width applied to some of
the divs wrapping the whole table.  Playing around in Chrome's developer tools
I was able to get the scrollbars working correctly by:
 - remove fixed width from div with class x-panel
 - remove fixed height and width from div with class x-panel-body
 - remove fixed height and width from div with class x-grid3
 - remove fixed width from div with class x-grid3-scroller

We can fix the oozie-console.js to incorporate either the above suggestions or 
others.

  was:
A user reported the following:
facing some issue with oozie web console UI
1. If I click on a pig action it opens a window with no body
2. If a coordinator has more than 15 instances invoked there is no scroller to
see the below 15th row - had to resize the window (very non-intuitive) to get 
the scroller bar
facing same issue on two browsers chrome and firefox.
3. Weird that the window also does not have a horizontal scrollbar.  My mouse
supports horizontal scrolling and even with that I can't scroll right enough to
see the vertical scrollbar.

Some suggestion from another UI developer:
The problems are largely due to the static height and width applied to some of
the divs wrapping the whole table.  Playing around in Chrome's developer tools
I was able to get the scrollbars working correctly by:
 - remove fixed width from div with class x-panel
 - remove fixed height and width from div with class x-panel-body
 - remove fixed height and width from div with class x-grid3
 - remove fixed width from div with class x-grid3-scroller

We can fix the oozie-console.js to incorporate either the above suggestions or 
others.


 Oozie web-console WF actions window should be auto-size and not fixed size
 --

 Key: OOZIE-1892
 URL: https://issues.apache.org/jira/browse/OOZIE-1892
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk, 4.0.1
Reporter: Mona Chitnis
 Fix For: trunk


 A user reported the following:
 facing some issue with oozie web console UI
 1. If I click on a pig action it opens a window with no body
 2. If a coordinator has more than 15 instances invoked there is no scroller to
 see the below 15th row - had to resize the window (very non-intuitive) to get 
 the scroller bar
 facing same issue on two browsers chrome and firefox.
 3. Weird that the window also does not have a horizontal scrollbar.  My mouse
 supports horizontal scrolling and even with that I can't scroll right enough 
 to
 see the vertical scrollbar.
 Some suggestion from another UI developer (Kendall Thrapp):
 The problems are largely due to the static height and width applied to some of
 the divs wrapping the whole table.  Playing around in Chrome's developer tools
 I was able to get the scrollbars working correctly by:
  - remove fixed width from div with class x-panel
  - remove fixed height and width from div with class x-panel-body
  - remove fixed height and width from div with class x-grid3
  - remove fixed width from div with class x-grid3-scroller
 We can fix the oozie-console.js to incorporate either the above suggestions 
 or others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size

2014-06-19 Thread Mona Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/OOZIE-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037691#comment-14037691
 ] 

Mona Chitnis edited comment on OOZIE-1892 at 6/19/14 7:06 PM:
--

Attaching some screenshots

!window_without_body.JPG!
!window_without_scroller.JPG!


was (Author: chitnis):
Attaching some screenshots

!window_without_body!
!window_without_scroller!

 Oozie web-console WF actions window should be auto-size and not fixed size
 --

 Key: OOZIE-1892
 URL: https://issues.apache.org/jira/browse/OOZIE-1892
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk, 4.0.1
Reporter: Mona Chitnis
 Fix For: trunk

 Attachments: window_without_body.JPG, window_without_scroller.JPG


 A user reported the following:
 facing some issue with oozie web console UI
 1. If I click on a pig action it opens a window with no body
 2. If a coordinator has more than 15 instances invoked there is no scroller to
 see the below 15th row - had to resize the window (very non-intuitive) to get 
 the scroller bar
 facing same issue on two browsers chrome and firefox.
 3. Weird that the window also does not have a horizontal scrollbar.  My mouse
 supports horizontal scrolling and even with that I can't scroll right enough 
 to
 see the vertical scrollbar.
 Some suggestion from another UI developer (Kendall Thrapp):
 The problems are largely due to the static height and width applied to some of
 the divs wrapping the whole table.  Playing around in Chrome's developer tools
 I was able to get the scrollbars working correctly by:
  - remove fixed width from div with class x-panel
  - remove fixed height and width from div with class x-panel-body
  - remove fixed height and width from div with class x-grid3
  - remove fixed width from div with class x-grid3-scroller
 We can fix the oozie-console.js to incorporate either the above suggestions 
 or others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size

2014-06-19 Thread Mona Chitnis (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mona Chitnis updated OOZIE-1892:


Attachment: window_without_scroller.JPG
window_without_body.JPG

Attaching some screenshots

!window_without_body!
!window_without_scroller!

 Oozie web-console WF actions window should be auto-size and not fixed size
 --

 Key: OOZIE-1892
 URL: https://issues.apache.org/jira/browse/OOZIE-1892
 Project: Oozie
  Issue Type: Bug
Affects Versions: trunk, 4.0.1
Reporter: Mona Chitnis
 Fix For: trunk

 Attachments: window_without_body.JPG, window_without_scroller.JPG


 A user reported the following:
 facing some issue with oozie web console UI
 1. If I click on a pig action it opens a window with no body
 2. If a coordinator has more than 15 instances invoked there is no scroller to
 see the below 15th row - had to resize the window (very non-intuitive) to get 
 the scroller bar
 facing same issue on two browsers chrome and firefox.
 3. Weird that the window also does not have a horizontal scrollbar.  My mouse
 supports horizontal scrolling and even with that I can't scroll right enough 
 to
 see the vertical scrollbar.
 Some suggestion from another UI developer (Kendall Thrapp):
 The problems are largely due to the static height and width applied to some of
 the divs wrapping the whole table.  Playing around in Chrome's developer tools
 I was able to get the scrollbars working correctly by:
  - remove fixed width from div with class x-panel
  - remove fixed height and width from div with class x-panel-body
  - remove fixed height and width from div with class x-grid3
  - remove fixed width from div with class x-grid3-scroller
 We can fix the oozie-console.js to incorporate either the above suggestions 
 or others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   3   4   5   6   7   >