[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966019#comment-14966019 ] Mona Chitnis commented on OOZIE-1976: - Thanks [~puru] for your patch. I did a first pass as well and have few comments. Waiting for your replies > Specifying coordinator input datasets in more logical ways > -- > > Key: OOZIE-1976 > URL: https://issues.apache.org/jira/browse/OOZIE-1976 > Project: Oozie > Issue Type: New Feature > Components: coordinator >Affects Versions: trunk >Reporter: Mona Chitnis >Assignee: Purshotam Shah > Fix For: trunk > > Attachments: Input-check.docx, OOZIE-1976-WIP.patch, > OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf > > > All dataset instances specified as input to coordinator, currently work on > AND logic i.e. ALL of them should be available for workflow to start. We > should enhance this to include more logical ways of specifying availability > criteria e.g. > * OR between instances > * minimum N out of K instances > * delta datasets (process data incrementally) > Use-cases for this: > * Different datasets are BCP, and workflow can run with either, whichever > arrives earlier. > * Data is not guaranteed, and while $coord:latest allows skipping to > available ones, workflow will never trigger unless mentioned number of > instances are found. > * Workflow is like a ‘refining’ algorithm which should run after minimum > required datasets are ready, and should only process the delta for efficiency. > This JIRA is to discuss the design and then the review the implementation for > some or all of the above features. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513118#comment-14513118 ] Mona Chitnis commented on OOZIE-1976: - Thanks for taking it up Jaydeep. I will keep a watch on this jira when it's ready for review Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Jaydeep Vishwakarma Fix For: trunk Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195508#comment-14195508 ] Mona Chitnis commented on OOZIE-1913: - Reviewboard revision is fairly up-to-date except a couple of unit tests. I will be updating that and then would appreciate a review Devise a way to turn off SLA alerts for bundle/coordinator flexibly --- Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-2034) Disable SSLv3 (POODLEbleed vulnerability)
[ https://issues.apache.org/jira/browse/OOZIE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183074#comment-14183074 ] Mona Chitnis commented on OOZIE-2034: - starting to look at this now.. Disable SSLv3 (POODLEbleed vulnerability) - Key: OOZIE-2034 URL: https://issues.apache.org/jira/browse/OOZIE-2034 Project: Oozie Issue Type: Bug Components: security Affects Versions: 4.0.1 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker Fix For: 4.1.0 Attachments: OOZIE-2034.patch, OOZIE-2034.patch We should disable SSLv3 to protect against the POODLEbleed vulnerability. See [CVE-2014-3566|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{sslProtocol=TLS}} set to only allow TLS in ssl-server.xml, but when I checked, I could still connect with SSLv3. From what I can tell, there's some ambiguity in the tomcat configs between {{sslProtocol}}, {{sslProtocols}}, and {{sslEnabledProtocols}} so we probably have the wrong thing here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-2034) Disable SSLv3 (POODLEbleed vulnerability)
[ https://issues.apache.org/jira/browse/OOZIE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183087#comment-14183087 ] Mona Chitnis commented on OOZIE-2034: - +1. Pretty straightforward. Thanks for checking the bit about support of TLSv1, not TLSv1.1. Can you paste your doc references here for record? Disable SSLv3 (POODLEbleed vulnerability) - Key: OOZIE-2034 URL: https://issues.apache.org/jira/browse/OOZIE-2034 Project: Oozie Issue Type: Bug Components: security Affects Versions: 4.0.1 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Blocker Fix For: 4.1.0 Attachments: OOZIE-2034.patch, OOZIE-2034.patch We should disable SSLv3 to protect against the POODLEbleed vulnerability. See [CVE-2014-3566|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{sslProtocol=TLS}} set to only allow TLS in ssl-server.xml, but when I checked, I could still connect with SSLv3. From what I can tell, there's some ambiguity in the tomcat configs between {{sslProtocol}}, {{sslProtocols}}, and {{sslEnabledProtocols}} so we probably have the wrong thing here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-1954) Add a way for the MapReduce action to be configured by Java code
[ https://issues.apache.org/jira/browse/OOZIE-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153721#comment-14153721 ] Mona Chitnis commented on OOZIE-1954: - Good work Robert! Add a way for the MapReduce action to be configured by Java code Key: OOZIE-1954 URL: https://issues.apache.org/jira/browse/OOZIE-1954 Project: Oozie Issue Type: New Feature Affects Versions: trunk Reporter: Robert Kanter Assignee: Robert Kanter Fix For: trunk Attachments: OOZIE-1954.patch, OOZIE-1954.patch, OOZIE-1954.patch With certain other components (e.g. Avro, HFileOutputFormat (HBase), etc), it becomes impractical to use the MapReduce action and users must instead use the Java action. The problem is that these components require a lot of extra configuration that is often hidden from the user in Java code (e.g. HFileOutputFormat.configureIncrementalLoad(job, table); which can also include decision logic, serialization, and other things that we can't do in an XML file directly. One way to solve this problem is to allow the user to give the MR action some Java code that would do this configuration, similar to how we allow the {{job-xml}} field to specify an external XML file of configuration properties. In more detail, we could have an interface; something like this: {code} public interface OozieActionConfigurator { public void updateOozieActionConfiguration(Configuration conf); } {code} that the user can implement, create a jar, and include with their MR action (i.e. add a {{config-class}} field that let's them specify the class name). To protect the Oozie server from running user code (which could do anything it wants really), it would have to be run in the Launcher Job. The Launcher Job could call this method after it loads the configuration prepared by the Oozie server. Another thing this will be helpful is with users who use the Java action to launch MR jobs and expect a bunch of things to be done for them that are not (e.g. delegation token propagation, config loading, returning the hadoop job to Oozie, etc). These are all done with the MR action, so the more users we can move to the MR action from the Java action, the less they'll run into these difficulties. Some of this may change slightly as I try to actually implement this (e.g. have to handle throwing exceptions etc). And one thing I may do is keep this general enough that it should be compatible with all action types in case we want to add this to any of them in the future; though for now, the schema would only accept it for the MapReduce action. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147856#comment-14147856 ] Mona Chitnis commented on OOZIE-1976: - Thanks [~rkanter] for comments. * We are thinking of using a serialize/deserialize technique (protobuf is one option) to convert back and forth from the object. I've created a class LogicalDependencySet for this object which either contains the subclass objects LogicalDependencyAndSet or LogicalDependencyOrSet and the leaf level is Dependency which has the lists of resolved and unresolved instances. Yet to see what is the cost of protobuf serde here. * Yes it is possible to do nested combinations, but will limit it to a depth of 2. i.e. both your examples are depth 2 and the most common cases that we should satisfy in the first go. An important thing to note here is the case of OR can have two 'strategies' :- ** 'Combined' : In case of {{A || B}}, instances of A and B can be interleaved to give the final combined set of total instances. For this, the requirement is that user considers both as equivalent, and they have the same frequency, initial instance etc. ** 'Exclusive' : In same case as above, either A should be completely used or B completely used. No interleaving. * Yes a better API output will be to display the action is waiting on which OR datasets' instances. Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1976: Attachment: OOZIE-1976-WIP.patch attaching WIP patch for records sake. I will upload the v-1 patch when I have a fairly working version ready by tomorrow Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143595#comment-14143595 ] Mona Chitnis commented on OOZIE-1932: - {quote} . -1 the patch does not add/modify any testcase {quote} This is a simple config change in oozie-default.xml and there is no applicable test-case just to check relative order of loading services {quote} . The patch failed the following testcases: . testBundleStatusNotTransitionFromKilled(org.apache.oozie.service.TestStatusTransitService) . testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService) {quote} These test failures are unrelated to my patch. I reran the tests in my local env and they pass consistently Committed patch to trunk and branch-4.1. Thanks Puru for review! Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, OOZIE-1932-4.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1932: Attachment: OOZIE-1932-4-amendment.patch {code} 2014-09-22 22:02:34,148 INFO ShareLibService:539 [main] - USER[-] GROUP[-] oozie-hadoop-utils-2.3.0.oozie-4.4.1.1.jar uploaded to hdfs:/tmp/hdfs_shared_lib_path/launcher_2014090233/oozie 2014-09-22 22:02:34,198 INFO ShareLibService:539 [main] - USER[-] GROUP[-] oozie-sharelib-hcatalog-4.4.1.1.jar uploaded to hdfs:/tmp/hdfs_shared_lib_path/launcher_2014090233/oozie 2014-09-22 22:02:34,199 ERROR ShareLibService:536 [main] - USER[-] GROUP[-] Sharelib initialization fails java.lang.NullPointerException at org.apache.oozie.service.ShareLibService.setupLauncherLibPath(ShareLibService.java:178) at org.apache.oozie.service.ShareLibService.updateLauncherLib(ShareLibService.java:158) at org.apache.oozie.service.ShareLibService.init(ShareLibService.java:111) at org.apache.oozie.service.Services.setServiceInternal(Services.java:368) ShareLibService is dependent on ActionService. private void setupLauncherLibPath(FileSystem fs, Path tmpLauncherLibPath) throws IOException { ActionService actionService = Services.get().get(ActionService.class); ListClass classes = JavaActionExecutor.getCommonLauncherClasses(); Path baseDir = new Path(tmpLauncherLibPath, JavaActionExecutor.OOZIE_COMMON_LIBDIR); copyJarContainingClasses(classes, fs, baseDir, JavaActionExecutor.OOZIE_COMMON_LIBDIR); SetString actionTypes = actionService.getActionTypes(); {code} Attaching amendment patch Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, OOZIE-1932-4-amendment.patch, OOZIE-1932-4.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1932: Attachment: OOZIE-1932-3.patch uploaded new patch OOZIE-1932-3.patch. Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1932: Attachment: OOZIE-1932-4.patch addressed Puru's comment Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1932-2.patch, OOZIE-1932-3.patch, OOZIE-1932-4.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OOZIE-1984) SLACalculator in HA mode performs duplicate operations on records with completed jobs
Mona Chitnis created OOZIE-1984: --- Summary: SLACalculator in HA mode performs duplicate operations on records with completed jobs Key: OOZIE-1984 URL: https://issues.apache.org/jira/browse/OOZIE-1984 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk, 4.1.0 Scenario: SLA periodic run has already processed start,duration and end for a job's sla entry. But job notification for that job came after this, and triggers the sla listener. Buggy part: {code} SLACalculatorMemory.java else if (Services.get().get(JobsConcurrencyService.class).isHighlyAvailableMode()) { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); if (slaRegBean != null) { // filter out jobs picked by SLA job event listener // but not actually configured for SLA SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get( SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); if (slaCalc.getEventProcessed() 7) { slaMap.put(jobId, slaCalc); } } } } if (slaCalc != null) { .. Object eventProcObj = ((SLASummaryQueryExecutor) SLASummaryQueryExecutor.getInstance()) .getSingleValue(SLASummaryQuery.GET_SLA_SUMMARY_EVENTPROCESSED, jobId); byte eventProc = ((Byte) eventProcObj).byteValue(); .. processJobEndSuccessSLA(slaCalc, startTime, endTime); {code} method processJobEndSuccesSLA goes ahead and checks second LSB bit of eventProc and sends duration event _again_. So the bug here is two-fold: * if all events are already processed, still invokes this function * event processed is 8 (1000), so second LSB bit is unset and hence duration processed. Fix - not invoke function when eventProc = 1000 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (OOZIE-1984) SLACalculator in HA mode performs duplicate operations on records with completed jobs
[ https://issues.apache.org/jira/browse/OOZIE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis resolved OOZIE-1984. - Resolution: Fixed Committed to trunk and 4.1.0. Thanks for review Ryota SLACalculator in HA mode performs duplicate operations on records with completed jobs - Key: OOZIE-1984 URL: https://issues.apache.org/jira/browse/OOZIE-1984 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk, 4.1.0 Attachments: OOZIE-1984-1.patch, OOZIE-1984.patch Scenario: SLA periodic run has already processed start,duration and end for a job's sla entry. But job notification for that job came after this, and triggers the sla listener. Buggy part: {code} SLACalculatorMemory.java else if (Services.get().get(JobsConcurrencyService.class).isHighlyAvailableMode()) { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); if (slaRegBean != null) { // filter out jobs picked by SLA job event listener // but not actually configured for SLA SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get( SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); if (slaCalc.getEventProcessed() 7) { slaMap.put(jobId, slaCalc); } } } } if (slaCalc != null) { .. Object eventProcObj = ((SLASummaryQueryExecutor) SLASummaryQueryExecutor.getInstance()) .getSingleValue(SLASummaryQuery.GET_SLA_SUMMARY_EVENTPROCESSED, jobId); byte eventProc = ((Byte) eventProcObj).byteValue(); .. processJobEndSuccessSLA(slaCalc, startTime, endTime); {code} method processJobEndSuccesSLA goes ahead and checks second LSB bit of eventProc and sends duration event _again_. So the bug here is two-fold: * if all events are already processed, still invokes this function * event processed is 8 (1000), so second LSB bit is unset and hence duration processed. Fix - not invoke function when eventProc = 1000 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1847) HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure
[ https://issues.apache.org/jira/browse/OOZIE-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112331#comment-14112331 ] Mona Chitnis commented on OOZIE-1847: - pretty straightforward patch and agree its needed. But in addition to printing in logs, should we bubble it up to action error message too? That way reason for a workflow failing can be pulled up from any of the client-facing APIs too - e.g. job-info, web-console, RESTful aPI etc HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure - Key: OOZIE-1847 URL: https://issues.apache.org/jira/browse/OOZIE-1847 Project: Oozie Issue Type: Bug Components: HA Reporter: Purshotam Shah Assignee: Purshotam Shah Attachments: OOZIE-1847-V1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1847) HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure
[ https://issues.apache.org/jira/browse/OOZIE-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112333#comment-14112333 ] Mona Chitnis commented on OOZIE-1847: - ^^ in case of timeout 3 seconds resulting in server shutdown and job failure HA - Oozie servers should shutdown (or go in safe mode) in case of ZK failure - Key: OOZIE-1847 URL: https://issues.apache.org/jira/browse/OOZIE-1847 Project: Oozie Issue Type: Bug Components: HA Reporter: Purshotam Shah Assignee: Purshotam Shah Attachments: OOZIE-1847-V1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1885) Query optimization for StatusTransitService
[ https://issues.apache.org/jira/browse/OOZIE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112357#comment-14112357 ] Mona Chitnis commented on OOZIE-1885: - A join query is always more CPU and memory intensive. But it will probably cut down on the overall time it takes, because of the multiple queries in loop right now. Approach is fine but we should vet it with end-to-end test performance gains Query optimization for StatusTransitService --- Key: OOZIE-1885 URL: https://issues.apache.org/jira/browse/OOZIE-1885 Project: Oozie Issue Type: Bug Reporter: Purshotam Shah {code} private void coordTransit() throws JPAExecutorException, CommandException { ListCoordinatorJobBean pendingJobCheckList = null; if (lastInstanceStartTime == null) { LOG.info(Running coordinator status service first instance); // this is the first instance, we need to check for all pending jobs; pendingJobCheckList = jpaService.execute(new CoordJobsGetPendingJPAExecutor(limit)); } else { LOG.info(Running coordinator status service from last instance time = + DateUtils.formatDateOozieTZ(lastInstanceStartTime)); // this is not the first instance, we should only check jobs // that have actions or jobs been // updated = start time of last service run; ListCoordinatorActionBean actionsList = CoordActionQueryExecutor.getInstance().getList( CoordActionQuery.GET_COORD_ACTIONS_BY_LAST_MODIFIED_TIME, lastInstanceStartTime); SetString coordIds = new HashSetString(); for (CoordinatorActionBean action : actionsList) { coordIds.add(action.getJobId()); } pendingJobCheckList = new ArrayListCoordinatorJobBean(); for (String coordId : coordIds.toArray(new String[coordIds.size()])) { CoordinatorJobBean coordJob; try { coordJob = CoordJobQueryExecutor.getInstance().get(CoordJobQuery.GET_COORD_JOB, coordId); } catch (JPAExecutorException jpaee) { if (jpaee.getErrorCode().equals(ErrorCode.E0604)) { LOG.warn(Exception happened during StatusTransitRunnable; Coordinator Job doesn't exist, jpaee); continue; } else { throw jpaee; } } // Running coord job might have pending false Job.Status coordJobStatus = coordJob.getStatus(); if ((coordJob.isPending() || coordJobStatus.equals(Job.Status.PAUSED) || coordJobStatus.equals(Job.Status.RUNNING) || coordJobStatus.equals(Job.Status.RUNNINGWITHERROR) || coordJobStatus.equals(Job.Status.PAUSEDWITHERROR)) !coordJobStatus.equals(Job.Status.IGNORED)) { pendingJobCheckList.add(coordJob); } } pendingJobCheckList.addAll(CoordJobQueryExecutor.getInstance().getList( CoordJobQuery.GET_COORD_JOBS_CHANGED, lastInstanceStartTime)); } aggregateCoordJobsStatus(pendingJobCheckList); } } {code} This could be done in one sql, something like select w.id, w.status, w.pending from CoordinatorJobBean w where w.startTimestamp = :matTime AND (w.statusStr = 'PREP' OR w.statusStr = 'RUNNING' or w.statusStr = 'RUNNINGWITHERROR' or w.statusStr= 'PAUSEDWITHERROR' and w.statusStr 'IGNORED') w.id in ( select a.jobId from CoordinatorActionBean a where a.lastModifiedTimestamp = :lastModifiedTime groupby a.jobId) Same for bundleTransit(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1940) StatusTransitService has race condition
[ https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112359#comment-14112359 ] Mona Chitnis commented on OOZIE-1940: - linking this as dependent of OOZIE-1885 StatusTransitService has race condition --- Key: OOZIE-1940 URL: https://issues.apache.org/jira/browse/OOZIE-1940 Project: Oozie Issue Type: Bug Reporter: Purshotam Shah StatusTransitService doesn't acquire lock while updating DB. We noticed one such issue while doing HA testing, thanks to [~mchiang] We issue a change command to change pause time, which got executed on one server. While change command was running on one server, other server started executing StatusTransitService. Server 1 log {code} 2014-07-16 17:28:05,268 INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.StatusTransitService] 2014-07-16 17:28:09,694 INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 2014-07-16 17:28:15,416 INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Released lock for [org.apache.oozie.service.StatusTransitService] {code} Server 2 log {code} 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed Jul 16 17:30:00 UTC 2014 and last action number is : 3 2014-07-16 17:28:06,508 INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand for jobId=0011385-140716042555-oozie-oozi-C {code} CoordMaterializeTransitionXCommand has created all actions( few were in waiting and few were in running state) and set doneMaterialization to true. Change command deletes all waiting coords, except 3 running/SUCCEEDED action and reset doneMaterialization. StatusTransitService first loads a set of pending jobs and for each job it make DB calls to check coord action status. Coord jobs are loaded only once in beginning. This is what happened. 1.StatusTransitService loads the coord job which doneMaterialization is set to true at 17:28:05,268 (server 1) 2.Change command deletes waiting cation and reset doneMaterialization at 17:28:06,508 (server 2) 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. It never reload the doneMaterialization at 17:28:09,694 (server 1) StatusTransitService overrides set job status to SUCCEEDED, bcz it's doneMaterialization and all action are SUCCEEDED. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1940) StatusTransitService has race condition
[ https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112362#comment-14112362 ] Mona Chitnis commented on OOZIE-1940: - Agree with approach. Currently each run of Status Transit Service takes multiple seconds I believe. If it is going to hold the lock for that long, we have to asses the consequences on the other commands waiting for lock. E.g. Change command appearing to hang on user-facing CLI, because its synchronously trying to acquire lock held by STS. OOZIE-1885 should ideally reduce this overall time the lock is to be held by STS StatusTransitService has race condition --- Key: OOZIE-1940 URL: https://issues.apache.org/jira/browse/OOZIE-1940 Project: Oozie Issue Type: Bug Reporter: Purshotam Shah StatusTransitService doesn't acquire lock while updating DB. We noticed one such issue while doing HA testing, thanks to [~mchiang] We issue a change command to change pause time, which got executed on one server. While change command was running on one server, other server started executing StatusTransitService. Server 1 log {code} 2014-07-16 17:28:05,268 INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.StatusTransitService] 2014-07-16 17:28:09,694 INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 2014-07-16 17:28:15,416 INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Released lock for [org.apache.oozie.service.StatusTransitService] {code} Server 2 log {code} 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed Jul 16 17:30:00 UTC 2014 and last action number is : 3 2014-07-16 17:28:06,508 INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand for jobId=0011385-140716042555-oozie-oozi-C {code} CoordMaterializeTransitionXCommand has created all actions( few were in waiting and few were in running state) and set doneMaterialization to true. Change command deletes all waiting coords, except 3 running/SUCCEEDED action and reset doneMaterialization. StatusTransitService first loads a set of pending jobs and for each job it make DB calls to check coord action status. Coord jobs are loaded only once in beginning. This is what happened. 1.StatusTransitService loads the coord job which doneMaterialization is set to true at 17:28:05,268 (server 1) 2.Change command deletes waiting cation and reset doneMaterialization at 17:28:06,508 (server 2) 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. It never reload the doneMaterialization at 17:28:09,694 (server 1) StatusTransitService overrides set job status to SUCCEEDED, bcz it's doneMaterialization and all action are SUCCEEDED. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104227#comment-14104227 ] Mona Chitnis commented on OOZIE-1976: - Thanks Puru and Ryota. Will incorporate your comments and come up with new design specification. As for the 'explain', this can be done as part of 'info' command displaying missing dependency itself, rather than introducing another command Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104618#comment-14104618 ] Mona Chitnis commented on OOZIE-1976: - For Ryota's comment about priority, I think it complicates the missing dependencies field, now we require a structure to indicate something like {{P0=dep1,dep2#P1=dep3,dep4}} which in turn is nested under the AND/OR structure. So when dependencies are checked and found to exist, action will start only when all P0's are satisfied etc. I think this is essentially same as putting them in the AND block instead of optional OR block. For the N out of M case, it will start when _any_ instances =n are available, using all M if all there, and not limit to N there. Good pointer about EL functions, that one's going to be important and will probably need a few new ones. Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1976: Attachment: OOZIE-1976-rough-design-2.pdf New design spec uploaded (rought-design-2) with additions about * Wait-for in action * EL functions initial thoughts - implementation details will follow in code patch * HCatDependencyCache changes (for the in-memory push-based hcat dependencies) * Job info API (coord-action) changes for displaying Missing Dependency. It runs the risk of being verbose if optional dataset has lot of instances. Needs thought about how to possibly truncate there. Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
Mona Chitnis created OOZIE-1976: --- Summary: Specifying coordinator input datasets in more logical ways Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1976: Description: All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: Different datasets are BCP, and workflow can run with either, whichever arrives earlier. Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. was: All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) This JIRA is to discuss the design and then the review the implementation for some or all of the above features. Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: Different datasets are BCP, and workflow can run with either, whichever arrives earlier. Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1976: Description: All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. was: All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: Different datasets are BCP, and workflow can run with either, whichever arrives earlier. Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1976) Specifying coordinator input datasets in more logical ways
[ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1976: Attachment: OOZIE-1976-rough-design.pdf Attaching rough design doc (pdf) Specifying coordinator input datasets in more logical ways -- Key: OOZIE-1976 URL: https://issues.apache.org/jira/browse/OOZIE-1976 Project: Oozie Issue Type: New Feature Components: coordinator Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1976-rough-design.pdf All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g. * OR between instances * minimum N out of K instances * delta datasets (process data incrementally) Use-cases for this: * Different datasets are BCP, and workflow can run with either, whichever arrives earlier. * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found. * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency. This JIRA is to discuss the design and then the review the implementation for some or all of the above features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096118#comment-14096118 ] Mona Chitnis commented on OOZIE-1913: - Want to mention another point: this API also allows to disable for ALL sla instances for a coordinator or bundle. For bundle, that would mean all coordinators' all actions. SLARegistrationBean stores 'parentId' if the sla object pertains to coord-action/wf-action/bundle-action. To avoid heavy dB query in case of the suspend ALL for bundle(s) case, I want to change this 'parentId' to point to bundle jobId directly, if coordinator is part of a bundle. If not, it will be coord job id as it is now. The impact this has is in JMSSLAEventListener, where topicName is set to this parentId. So topicName will get set to top-level bundle-id, and user will have to change topic name being listened to. Please give feedback if this is a reasonable approach. I will make sure appropriate JMS selector options are available, if user gives this bundle id topicName, but still wants to limit per coordinator job id. Devise a way to turn off SLA alerts for bundle/coordinator flexibly --- Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090999#comment-14090999 ] Mona Chitnis commented on OOZIE-1913: - okay let me remove -id requirement. Regarding treating as job operation, I think it becomes ambiguous what type of alerts it means, so better to be clear with 'sla' command. Also, it removes the need to add additional param 'actions'. But can rework this if there's a consensus about what api usage is more intuitive. Asking feedback from users too Devise a way to turn off SLA alerts for bundle/coordinator flexibly --- Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1913) Devise a way to turn off SLA alerts for bundle/coordinator flexibly
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1913: Summary: Devise a way to turn off SLA alerts for bundle/coordinator flexibly (was: Devise a way to turn off SLA alerts when bundle/coordinator suspended) Devise a way to turn off SLA alerts for bundle/coordinator flexibly --- Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1939) Incorrect job information is set while logging
[ https://issues.apache.org/jira/browse/OOZIE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082317#comment-14082317 ] Mona Chitnis commented on OOZIE-1939: - Yet it will work with threadlocal params too. Fix was done to minimize overall change and just clear prefix and set it to what object the thread is handling now. same will apply with threadlocal params too Incorrect job information is set while logging -- Key: OOZIE-1939 URL: https://issues.apache.org/jira/browse/OOZIE-1939 Project: Oozie Issue Type: Bug Reporter: Purshotam Shah Assignee: Azrael Attachments: OOZIE-1939.1.patch, OOZIE-1939.2.patch {code} 2014-07-16 17:28:06,422 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB236] JOB[0011514-140716042555-oozie-oozi-C] ACTION[-] Acquired lock for [0011385-140716042555-oozie-oozi-C] in [coord_change] 2014-07-16 17:28:06,422 TRACE CoordChangeXCommand:548 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB236] JOB[0011514-140716042555-oozie-oozi-C] ACTION[-] Load state for [0011385-140716042555-oozie-oozi-C] {code} {code} protected void loadState() throws CommandException { jpaService = Services.get().get(JPAService.class); if (jpaService == null) { LOG.error(ErrorCode.E0610); } try { coordJob = CoordJobQueryExecutor.getInstance().get(CoordJobQuery.GET_COORD_JOB_MATERIALIZE, jobId); prevStatus = coordJob.getStatus(); } catch (JPAExecutorException jex) { throw new CommandException(jex); } // calculate start materialize and end materialize time calcMatdTime(); LogUtils.setLogInfo(coordJob, logInfo); } {code} Most of the commands set jobinfo after loadstate, because of that few log statements ( like acquiring lock, load state) logs with previous jobinfo. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: OOZIE-1811-3.patch addressed review comments and fixed couple of classes missed in earlier patch - BatchQueryExecutor, SLA*QueryExecutors to be consistent Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, OOZIE-1811-3.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: (was: OOZIE-1811-3.patch) Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, OOZIE-1811-3.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: OOZIE-1811-3.patch Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, OOZIE-1811-3.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: OOZIE-1811-3.patch good catch! uploaded new patch Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, OOZIE-1811-3.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: (was: OOZIE-1811-3.patch) Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch, OOZIE-1811-3.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082590#comment-14082590 ] Mona Chitnis commented on OOZIE-1932: - okay thanks. will revise the order Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1932-2.patch, OOZIE-1932-addendum.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081820#comment-14081820 ] Mona Chitnis commented on OOZIE-1811: - {{. -1 the patch contains 2 line(s) with trailing spaces}} located and fixed in the xml file - {{coord-action-sla.xml}} Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: OOZIE-1811-2.patch updated patch to apply cleanly to trunk HEAD Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis reopened OOZIE-1932: - Reopening issue to fix similar issue with URIHandlerService should be loaded before CallableQueueService, so that its closed before. This JIRA's scope to include a permanent fix to the services ordering to work for all cases, and avoid all NPEs and other issues with the services during server shutdown/startup Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1932-2.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078158#comment-14078158 ] Mona Chitnis commented on OOZIE-1811: - above failures due to strange network error on the host. Happened before at https://builds.apache.org/job/oozie-trunk-precommit-build/1363/ too. Ran the whole suit locally and only 1 failed, which I've mentioned is going to be part of OOZIE-1952. Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch, OOZIE-1811-2.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1952) Cleanup duplicate/obsolete code - Command, StoreService
Mona Chitnis created OOZIE-1952: --- Summary: Cleanup duplicate/obsolete code - Command, StoreService Key: OOZIE-1952 URL: https://issues.apache.org/jira/browse/OOZIE-1952 Project: Oozie Issue Type: Task Reporter: Mona Chitnis StoreService has been superceded by JPAService, and Command has been superceded by XCommand. These old classes have been lying around long enough and probably only referenced through unit tests, creating some confusion when tests have to be fixed for flaky failures -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1811: Attachment: OOZIE-1811-1.patch attaching patch which fixes the QueryExecutors and TestSLAEventGeneration. Errors related to StoreService usage in tests can be fixed as part of overall StoreService fix in OOZIE-1952 Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical Attachments: OOZIE-1811-1.patch There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074117#comment-14074117 ] Mona Chitnis commented on OOZIE-1811: - I'd suggest getting rid of the static reference to JPAService in each of the Query Executors. We can always get the reference to it from the Services singleton, while executing the query. By keeping another static reference and manipulating it through the constructor and destroy(), we run the risk of nullifying it inadvertently. This is why suddenly so many tests are becoming flaky and it is very tough to detect exact patterns or even fix tests in a foolproof way. I ran the whole suit with the static reference removed and only 2 tests failed - which is quite an improvement! {code} Results : Failed tests: testBundleId(org.apache.oozie.servlet.TestBulkMonitorWebServiceAPI): expected:SUCCEEDED but was:RUNNING Tests in error: testSucCoordPurgeXCommand(org.apache.oozie.command.TestPurgeXCommand): E0604: Job does not exist [000-140724213655573-oozie-chit-C] {code} Test#2 here is failing with error StoreService cannot work without JPAService. We can replace usage of StoreService completely as it is superceded by JPAService anyway. Test #1 doesnt really have any error except random assert fail, and this test is not usually flaky so can ignore Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf
[ https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1944: Attachment: OOZIE-1944-2.patch adding null check for configDefault which was causing TestWorkflowAppParser tests to fail Recursive variable resolution broken when same parameter name in config-default and action conf --- Key: OOZIE-1944 URL: https://issues.apache.org/jira/browse/OOZIE-1944 Project: Oozie Issue Type: Bug Components: workflow Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1944-1.patch, OOZIE-1944-2.patch Hitting error {code} can not create DagEngine for submitting jobs org.apache.oozie.DagEngineException: E0803: IO error, Variable substitution depth too large: 20 ${param}/000 {code} when config-default.xml has {{param=default}} and action conf has {code} action name=my-wf-action ... configuration property nameparam/name value${param}/000/value /property /configuration {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1872) TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime is failing for past couple of builds
[ https://issues.apache.org/jira/browse/OOZIE-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1872: Component/s: tests TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime is failing for past couple of builds --- Key: OOZIE-1872 URL: https://issues.apache.org/jira/browse/OOZIE-1872 Project: Oozie Issue Type: Bug Components: tests Affects Versions: trunk, 4.1.0 Reporter: Rohini Palaniswamy Fix For: trunk, 4.1.0 Attachments: OOZIE-1872-1.patch https://builds.apache.org/job/oozie-trunk-precommit-build/1291/testReport/junit/org.apache.oozie.command.coord/TestCoordActionInputCheckXCommand/testActionInputCheckLatestActionCreationTime/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1872) TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime is failing for past couple of builds
[ https://issues.apache.org/jira/browse/OOZIE-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1872: Fix Version/s: (was: trunk) TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime is failing for past couple of builds --- Key: OOZIE-1872 URL: https://issues.apache.org/jira/browse/OOZIE-1872 Project: Oozie Issue Type: Bug Components: tests Affects Versions: trunk, 4.1.0 Reporter: Rohini Palaniswamy Fix For: 4.1.0 Attachments: OOZIE-1872-1.patch https://builds.apache.org/job/oozie-trunk-precommit-build/1291/testReport/junit/org.apache.oozie.command.coord/TestCoordActionInputCheckXCommand/testActionInputCheckLatestActionCreationTime/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf
[ https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1944: Fix Version/s: (was: trunk) Recursive variable resolution broken when same parameter name in config-default and action conf --- Key: OOZIE-1944 URL: https://issues.apache.org/jira/browse/OOZIE-1944 Project: Oozie Issue Type: Bug Components: workflow Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Hitting error {code} can not create DagEngine for submitting jobs org.apache.oozie.DagEngineException: E0803: IO error, Variable substitution depth too large: 20 ${param}/000 {code} when config-default.xml has {{param=default}} and action conf has {code} action name=my-wf-action ... configuration property nameparam/name value${param}/000/value /property /configuration {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf
[ https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1944: Attachment: OOZIE-1944-1.patch Attaching patch. approach is to switch from using XConfiguration.injectDefaults() method to copy(), since the former does a Configuration.get() which tries to recursively resolve params. So simply, copy over defaults, global, and finally action configuration, in this order of precedence Recursive variable resolution broken when same parameter name in config-default and action conf --- Key: OOZIE-1944 URL: https://issues.apache.org/jira/browse/OOZIE-1944 Project: Oozie Issue Type: Bug Components: workflow Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: 4.1.0 Attachments: OOZIE-1944-1.patch Hitting error {code} can not create DagEngine for submitting jobs org.apache.oozie.DagEngineException: E0803: IO error, Variable substitution depth too large: 20 ${param}/000 {code} when config-default.xml has {{param=default}} and action conf has {code} action name=my-wf-action ... configuration property nameparam/name value${param}/000/value /property /configuration {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1536) Coordinator action reruns start a new workflow
[ https://issues.apache.org/jira/browse/OOZIE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1536: Assignee: (was: Mona Chitnis) Coordinator action reruns start a new workflow -- Key: OOZIE-1536 URL: https://issues.apache.org/jira/browse/OOZIE-1536 Project: Oozie Issue Type: Improvement Reporter: Srikanth Sundarrajan Coordinator action reruns start a new workflow and if existing workflow for the action is in running state, the same is not checked. Coord rerun can possibly do a workflow re-run to prevent this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1945) NPE in JaveActionExecutor#check()
Mona Chitnis created OOZIE-1945: --- Summary: NPE in JaveActionExecutor#check() Key: OOZIE-1945 URL: https://issues.apache.org/jira/browse/OOZIE-1945 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Priority: Trivial Fix For: trunk, 4.1.0 in method check() {code} String errorCode = props.getProperty(error.code); if (errorCode.equals(0)) { errorCode = JA018; } if (errorCode.equals(-1)) { errorCode = JA019; } errorReason = props.getProperty(error.reason); {code} if error.code is null, these leads to NPEs easy fix {code} if (0.equals(errorCode)) ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis resolved OOZIE-1933. - Resolution: Fixed failing unit tests fix committed to trunk after review. Thanks! SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, OOZIE-1933-unit-tests-fix.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: (was: sla_unit_tests.patch) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, OOZIE-1933-unit-tests-fix.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: (was: sla_unit_tests-1.patch) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, OOZIE-1933-unit-tests-fix.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: OOZIE-1933-unit-tests-fix.patch updated patch for cleanly apply to trunk SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, OOZIE-1933-unit-tests-fix.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf
Mona Chitnis created OOZIE-1944: --- Summary: Recursive variable resolution broken when same parameter name in config-default and action conf Key: OOZIE-1944 URL: https://issues.apache.org/jira/browse/OOZIE-1944 Project: Oozie Issue Type: Bug Components: workflow Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Hitting error {code} can not create DagEngine for submitting jobs org.apache.oozie.DagEngineException: E0803: IO error, Variable substitution depth too large: 20 ${param}/000 {code} when config-default.xml has {{param=default}} and action conf has {code} action name=my-wf-action ... configuration property nameparam/name value${param}/000/value /property /configuration {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1944) Recursive variable resolution broken when same parameter name in config-default and action conf
[ https://issues.apache.org/jira/browse/OOZIE-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1944: Fix Version/s: 4.1.0 Recursive variable resolution broken when same parameter name in config-default and action conf --- Key: OOZIE-1944 URL: https://issues.apache.org/jira/browse/OOZIE-1944 Project: Oozie Issue Type: Bug Components: workflow Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk, 4.1.0 Hitting error {code} can not create DagEngine for submitting jobs org.apache.oozie.DagEngineException: E0803: IO error, Variable substitution depth too large: 20 ${param}/000 {code} when config-default.xml has {{param=default}} and action conf has {code} action name=my-wf-action ... configuration property nameparam/name value${param}/000/value /property /configuration {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066771#comment-14066771 ] Mona Chitnis commented on OOZIE-1811: - {{org.apache.oozie.command.coord.TestCoordActionInputCheckXCommand.testActionInputCheckLatestCurrentTime}} also failing because JPAService null. Same class test but using latest calculation with rest to action creation time (old behavior) {{org.apache.oozie.command.coord.TestCoordActionInputCheckXCommand.testActionInputCheckLatestActionCreationTime}} however, failing with a dependency mismatch problem - OOZIE-1872 Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis resolved OOZIE-1933. - Resolution: Fixed committed to trunk. thanks for review! SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis reopened OOZIE-1933: - adding test cases broken by the patch SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: sla_unit_tests.patch SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, sla_unit_tests.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: sla_unit_tests-1.patch updated patch to include another broken testcase. All other failed tests pass locally and are known to be flaky Test run: {code} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration) testRecovery(org.apache.oozie.action.hadoop.TestJavaActionExecutor): expected:[SUCCEED]ED but was:[FAILED/KILL]ED testCoordStatusTransitServiceBackwardSupport(org.apache.oozie.service.TestStatusTransitService) Tests in error: testOnJobEvent(org.apache.oozie.sla.TestSLAJobEventListener): invalid child id [wa1] testActionReuseWfJobAppPath(org.apache.oozie.command.wf.TestActionStartXCommand): E0607: Other error in operation [action.start], null testWorkflowRun(org.apache.oozie.command.wf.TestLastModified): org.apache.oozie.DagEngineException: E0607: Other error in operation [start], null testSucJobPurgeXCommand(org.apache.oozie.command.TestPurgeXCommand): E0604: Job does not exist [001-140717193440158-oozie-chit-W] testSucCoordPurgeXCommand(org.apache.oozie.command.TestPurgeXCommand): E0604: Job does not exist [000-140717193442386-oozie-chit-C] {code} SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch, sla_unit_tests-1.patch, sla_unit_tests.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1938) Fork-join job does not execute join node sometimes during HA failover
Mona Chitnis created OOZIE-1938: --- Summary: Fork-join job does not execute join node sometimes during HA failover Key: OOZIE-1938 URL: https://issues.apache.org/jira/browse/OOZIE-1938 Project: Oozie Issue Type: Bug Components: HA Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk Reported by Michelle Chiang (Yahoo Oozie QE) Scenario: (2 Oozie HA servers) 21:38:56 submit job at oozie client 21:41:42 shut down server1 21:46:52 shut down server2 21:47:30 start server1 22:15:05 start server2 the last fork path end time is 21:52:53. 22:36:48 the job is still RUNNING, not moving to join node. Digging into the logs, the locking part seems to work fine with forked action processing distributed amongst the two servers when both running or when one of them is down. The issue seems to be why even RecoveryService fails to pick up the job after all the forks had completed -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis resolved OOZIE-1935. - Resolution: Fixed committed to trunk. thanks for review! Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (OOZIE-1775) TestEventGeneration.testCoordinatorActionEvent is failing and CoordRerunX should generate event
[ https://issues.apache.org/jira/browse/OOZIE-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis resolved OOZIE-1775. - Resolution: Fixed All flaky tests will be fixes as part of OOZIE-1811 TestEventGeneration.testCoordinatorActionEvent is failing and CoordRerunX should generate event --- Key: OOZIE-1775 URL: https://issues.apache.org/jira/browse/OOZIE-1775 Project: Oozie Issue Type: Task Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1175-2.patch, OOZIE-1775-1.patch, OOZIE-1775-3.patch, OOZIE-1775-4.patch test is not flaky but consistently failing. this is due to CoordResumeXCommand not publishing job event afterwards, for the resumed coord actions. Also fixed flaky test {{TestSLAEventGeneration#testCoordinatorActionCommandSubmitAndStart}} by limiting the materialization -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1842) publish oozie artifacts
[ https://issues.apache.org/jira/browse/OOZIE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1842: Assignee: (was: Mona Chitnis) publish oozie artifacts --- Key: OOZIE-1842 URL: https://issues.apache.org/jira/browse/OOZIE-1842 Project: Oozie Issue Type: New Feature Reporter: Bowen Zhang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: OOZIE-1933-3.patch attaching patch reviewed and updated from ReviewBoard SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Attachment: OOZIE-1933-4-1.patch SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1933-3.patch, OOZIE-1933-4-1.patch SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1938) Fork-join job does not execute join node sometimes during HA failover
[ https://issues.apache.org/jira/browse/OOZIE-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064458#comment-14064458 ] Mona Chitnis commented on OOZIE-1938: - More context - all actions are completed, some via server 1 others via server 2. 1) Checking the SignalXCommand code also against the WF_ACTIONS table for all actions for this job, all of them have pending=0. This probably explains why they weren't recovered by ActionCheckerRunnable. 2) As each forked action finishes, two signals are sent - signal value OK and signal value :sync:. The 'sync' is needed to maintain the fork-join count, so increment on initial forks sending signal :sync:, and then decrement on joins sending signal :sync:. I think because of the time when one of the servers was down, these :sync:'s were lost or failed to get processed. We dont see this problem in a different scenario when both servers were up before actions finished and started signaling :sync:. Not very confident about changing the way we handle the :sync:, so would like to discuss the best approach here. The easier approach would be to set the action's pending flag in this process so that recovery will pick up action and help restore correct :sync: count. Feedback/corrections? Fork-join job does not execute join node sometimes during HA failover - Key: OOZIE-1938 URL: https://issues.apache.org/jira/browse/OOZIE-1938 Project: Oozie Issue Type: Bug Components: HA Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk Reported by [~mchiang]. Scenario: (2 Oozie HA servers) 21:38:56 submit job at oozie client 21:41:42 shut down server1 21:46:52 shut down server2 21:47:30 start server1 22:15:05 start server2 the last fork path end time is 21:52:53. 22:36:48 the job is still RUNNING, not moving to join node. Digging into the logs, the locking part seems to work fine with forked action processing distributed amongst the two servers when both running or when one of them is down. The issue seems to be why even RecoveryService fails to pick up the job after all the forks had completed -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1932: Attachment: OOZIE-1932.patch Attaching patch Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
Mona Chitnis created OOZIE-1933: --- Summary: SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1932) Services should load CallableQueueService after MemoryLocksService
[ https://issues.apache.org/jira/browse/OOZIE-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1932: Attachment: OOZIE-1932-2.patch removed testcase in updated patch Services should load CallableQueueService after MemoryLocksService -- Key: OOZIE-1932 URL: https://issues.apache.org/jira/browse/OOZIE-1932 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1932-2.patch, OOZIE-1932.patch This is not a problem during startup but is during shutdown, as services are destroyed in reverse order of initialization. Hence, when MemoryLocksService destroy sets it to null, and commands are still executing due to CallableQueueService still active, they all encounter NPEs during locking. This is a simple fix in oozie-default.xml to set MemoryLocksService before in the order of services loading. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1933) SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs
[ https://issues.apache.org/jira/browse/OOZIE-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1933: Description: SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line was: SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs SLACalculatorMemory HA changes assume SLARegistrationBean exists for all jobs - Key: OOZIE-1933 URL: https://issues.apache.org/jira/browse/OOZIE-1933 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk SLACalculatorMemory.addJobStatus() {code} else { // jobid might not exist in slaMap in HA Setting SLARegistrationBean slaRegBean = SLARegistrationQueryExecutor.getInstance().get( SLARegQuery.GET_SLA_REG_ALL, jobId); SLASummaryBean slaSummaryBean = SLASummaryQueryExecutor.getInstance().get(SLASummaryQuery.GET_SLA_SUMMARY, jobId); slaCalc = new SLACalcStatus(slaSummaryBean, slaRegBean); {code} Because of SLA Listener, job notification event triggers this even for jobs with no SLA configured - leading to NPE in the SLACalcStatus constructor and annoying exception stacktraces in logs Patch to also include log prefix addition to some SLACalculator log line -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1935) Demoting log level in ActionStartXCommand for non-WARN type messages
Mona Chitnis created OOZIE-1935: --- Summary: Demoting log level in ActionStartXCommand for non-WARN type messages Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Demoting log level in ActionStartXCommand for non-WARN type messages
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Attachment: OOZIE-1935.patch attached patch. trivial change for two statements from warn to info Demoting log level in ActionStartXCommand for non-WARN type messages Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Summary: Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) (was: Demoting log level in ActionStartXCommand for non-WARN type messages) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Attachment: (was: OOZIE-1935.patch) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Attachment: OOZIE-1935.patch reattaching patch with both code and doc change Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Description: Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Attachment: (was: OOZIE-1935.patch) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1935) Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec)
[ https://issues.apache.org/jira/browse/OOZIE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1935: Attachment: OOZIE-1935.patch fixing patch to include both code and doc and fixed example Log level (ActionStartXCommand) and Doc fix (CoordinatorFunctionalSpec) --- Key: OOZIE-1935 URL: https://issues.apache.org/jira/browse/OOZIE-1935 Project: Oozie Issue Type: Bug Reporter: Mona Chitnis Assignee: Mona Chitnis Priority: Trivial Attachments: OOZIE-1935.patch Demoting log level in ActionStartXCommand for non-WARN type messages. And, Doc clarification fix in CoordinatorFunctionalSpec - HCatalog EL functions section for coord:databaseIn, coord:databaseOut, coord:tableIn and coord:tableOut -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1911) SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration'
[ https://issues.apache.org/jira/browse/OOZIE-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062883#comment-14062883 ] Mona Chitnis commented on OOZIE-1911: - Linking OOZIE-1933 which has a fixes over this JIRA regarding the logging SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration' --- Key: OOZIE-1911 URL: https://issues.apache.org/jira/browse/OOZIE-1911 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1911-4.patch In chronological order: Server 1: Job's SLA eventProcessed set to 0101 = Start and End sla processed. Server 2: Receives above job's status event, processes remaining 'duration' sla. eventProcessed now = 0111, but incremented to 1000 due to {code} SLACalculatorMemory.addJobStatus() : 762 if (slaCalc.getEventProcessed() == 7) { slaInfo.setEventProcessed(8); slaMap.remove(jobId); } {code} Back to Server 1: (doing periodic SLA checks) {code} SLACalculatorMemory.updateJobSla() : 483 if ((eventProc 1) == 0) { // first bit (start-processed) unset if (reg.getExpectedStart() != null) { if (reg.getExpectedStart().getTime() + jobEventLatency System.currentTimeMillis()) { // goes ahead and enqueues another START_MISS event and DURATION_MET event {code} Conclusion, need to fix that check for least significant bit (and next to it) for 'start' and 'duration' to avoid duplicate events -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1934) coordinator action repeatedly picked up by cachePurgeWorker of PartitionDependencyManagerService
[ https://issues.apache.org/jira/browse/OOZIE-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062891#comment-14062891 ] Mona Chitnis commented on OOZIE-1934: - Doesnt this encounter a ConcurrentModificationException due to removing from iterator during the while loop iteration? coordinator action repeatedly picked up by cachePurgeWorker of PartitionDependencyManagerService Key: OOZIE-1934 URL: https://issues.apache.org/jira/browse/OOZIE-1934 Project: Oozie Issue Type: Bug Reporter: Ryota Egashira Assignee: Ryota Egashira Attachments: OOZIE-1934.patch cachePurgeWorker trying to remove the same coordinator action repeatedly. also log message need to be addd for better debugging -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1921) Curator client reports connection loss to ZK under high load
[ https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058087#comment-14058087 ] Mona Chitnis commented on OOZIE-1921: - Okay makes sense to fix any issues in curator itself. But currently not yet isolated exact scenarios when such connection losses start. Will do that and update on this JIRA. Curator client reports connection loss to ZK under high load Key: OOZIE-1921 URL: https://issues.apache.org/jira/browse/OOZIE-1921 Project: Oozie Issue Type: Bug Components: HA Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load (specifically workflows with ~80 forked actions) h5. [1] (znode transaction type: delete) {code} org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) {code} h5. [2] {code} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /oozie/locks/0037706-140704041907-oozie-oozi-W at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) {code} Tracking a particular job between the ZK trace logs reporting NoNode KeeperExceptions and Oozie logs, found that after encountering the zookeeper exceptions with 'delete' of job lock znode, that particular job never succeeds in getting lock and proceeding. Not that familiar with when Oozie via Curator tries to delete znodes. OOZIE-1906 will introduce the Reaper. Exception stacktrace pointing to Curator code: ConnectionState.getZookeeper() { ... boolean localIsConnected = isConnected.get(); if ( !localIsConnected ) { checkTimeouts(); } .. } isConnected is FALSE and so exception is getting thrown from checkTimeouts(). Wasn't able to find any good docs or benchmarks explaining timeout issues Curator would face due to high load. My suspicion is Curator might have limitations in how many concurrent requests for same lock it can handle. In this particular stress test, there are 85 forked actions all contending for same job lock. Hence we should implement some fallback mechanism in Oozie while invoking Curator APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1911) SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration'
[ https://issues.apache.org/jira/browse/OOZIE-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1911: Attachment: OOZIE-1911-4.patch Final patch attached reviewed from Reviewboard SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration' --- Key: OOZIE-1911 URL: https://issues.apache.org/jira/browse/OOZIE-1911 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk Attachments: OOZIE-1911-4.patch In chronological order: Server 1: Job's SLA eventProcessed set to 0101 = Start and End sla processed. Server 2: Receives above job's status event, processes remaining 'duration' sla. eventProcessed now = 0111, but incremented to 1000 due to {code} SLACalculatorMemory.addJobStatus() : 762 if (slaCalc.getEventProcessed() == 7) { slaInfo.setEventProcessed(8); slaMap.remove(jobId); } {code} Back to Server 1: (doing periodic SLA checks) {code} SLACalculatorMemory.updateJobSla() : 483 if ((eventProc 1) == 0) { // first bit (start-processed) unset if (reg.getExpectedStart() != null) { if (reg.getExpectedStart().getTime() + jobEventLatency System.currentTimeMillis()) { // goes ahead and enqueues another START_MISS event and DURATION_MET event {code} Conclusion, need to fix that check for least significant bit (and next to it) for 'start' and 'duration' to avoid duplicate events -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1906) Service to periodically remove ZK lock
[ https://issues.apache.org/jira/browse/OOZIE-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056918#comment-14056918 ] Mona Chitnis commented on OOZIE-1906: - typo in property name. +public static final String REAPING_THRESHOLD = CONF_PREFIX + ZKLocksService.locks.reaper.threshlod; Can you make the test more transparent by checking the children of lock once you request lock, and then comparing it with zero children after 'x' threshold seconds? Service to periodically remove ZK lock -- Key: OOZIE-1906 URL: https://issues.apache.org/jira/browse/OOZIE-1906 Project: Oozie Issue Type: Bug Components: HA Reporter: Purshotam Shah Assignee: Purshotam Shah Attachments: OOZIE-1906-V1.patch We might have a case where lock are held-up and it never released bcz of exception. Service should remove those locks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1921) Curator client reports connection loss to ZK under high load
[ https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056929#comment-14056929 ] Mona Chitnis commented on OOZIE-1921: - [~rkanter] did you happen to face similar issues? Curator client reports connection loss to ZK under high load Key: OOZIE-1921 URL: https://issues.apache.org/jira/browse/OOZIE-1921 Project: Oozie Issue Type: Bug Components: HA Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load h5. [1] (znode transaction type: delete) {code} org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) {code} h5. [2] {code} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /oozie/locks/0037706-140704041907-oozie-oozi-W at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) {code} We should probably implement a fallback approach in Oozie while invoking Curator library to handle any inherent limitations. But not able to find much documentation about Curator benchmarks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1921) Curator client reports connection loss to ZK under high load
Mona Chitnis created OOZIE-1921: --- Summary: Curator client reports connection loss to ZK under high load Key: OOZIE-1921 URL: https://issues.apache.org/jira/browse/OOZIE-1921 Project: Oozie Issue Type: Bug Components: HA Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load h5. [1] (znode transaction type: delete) {code} org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) {code} h5. [2] {code} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /oozie/locks/0037706-140704041907-oozie-oozi-W at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) {code} We should probably implement a fallback approach in Oozie while invoking Curator library to handle any inherent limitations. But not able to find much documentation about Curator benchmarks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1921) Curator client reports connection loss to ZK under high load
[ https://issues.apache.org/jira/browse/OOZIE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1921: Description: Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load (specifically workflows with ~80 forked actions) h5. [1] (znode transaction type: delete) {code} org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) {code} h5. [2] {code} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /oozie/locks/0037706-140704041907-oozie-oozi-W at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) {code} Tracking a particular job between the ZK trace logs reporting NoNode KeeperExceptions and Oozie logs, found that after encountering the zookeeper exceptions with 'delete' of job lock znode, that particular job never succeeds in getting lock and proceeding. Not that familiar with when Oozie via Curator tries to delete znodes. OOZIE-1906 will introduce the Reaper. Exception stacktrace pointing to Curator code: ConnectionState.getZookeeper() { ... boolean localIsConnected = isConnected.get(); if ( !localIsConnected ) { checkTimeouts(); } .. } isConnected is FALSE and so exception is getting thrown from checkTimeouts(). Wasn't able to find any good docs or benchmarks explaining timeout issues Curator would face due to high load. My suspicion is Curator might have limitations in how many concurrent requests for same lock it can handle. In this particular stress test, there are 85 forked actions all contending for same job lock. Hence we should implement some fallback mechanism in Oozie while invoking Curator APIs. was: Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load h5. [1] (znode transaction type: delete) {code} org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) {code} h5. [2] {code} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /oozie/locks/0037706-140704041907-oozie-oozi-W at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) {code} We should probably implement a fallback approach in Oozie while invoking Curator library to handle any inherent limitations. But not able to find much documentation about Curator benchmarks. Curator client reports connection loss to ZK under high load Key: OOZIE-1921 URL: https://issues.apache.org/jira/browse/OOZIE-1921 Project: Oozie Issue Type: Bug Components: HA Affects Versions: trunk Reporter: Mona Chitnis Fix For: trunk Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load (specifically workflows with ~80 forked actions) h5. [1] (znode transaction type: delete) {code} org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) {code} h5. [2] {code} org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /oozie/locks/0037706-140704041907-oozie-oozi-W at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) {code} Tracking a particular job between the ZK trace logs reporting NoNode KeeperExceptions and Oozie logs, found that after encountering the zookeeper exceptions with 'delete' of job lock znode, that particular job never succeeds in getting lock and
[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050625#comment-14050625 ] Mona Chitnis commented on OOZIE-1913: - Some discussion points: h5. Approach 1: Change SLA behavior for all jobs on suspend. i.e. not track SLA for suspended jobs. However this was originally put into place because users need to be notified of their job SLAs in the event of suspension caused by system (Oozie server restart/ transient errors from Hadoop cluster). So making this change across all suspended jobs would not be ideal. h5. Approach 2: Add a command line option like {{-ignoresla}} along with suspend command, which will flag it accordingly in the memory map of the SLA calculator. This then entails two sub-approaches h6. 2A] On seeing {{-ignoresla}}, mark the eventProcessed byte of the SLA entry to {{1000 (8) }} to remove it from being tracked anymore for SLA. The resume command will also need an option like {{-resumesla}} to then add this job back into SLA map for tracking, along with more options for revised expected end time and expected duration of job. h6. 2B] If we dont wish to change the eventProcessed byte so that we dont have to recalculate it, we can add a flag to the job, to indicate to ignore SLA for this job till unset. However, this requires adding a column to the Sla_Summary table schema to be able to retain this information across Oozie server restarts and in HA mode. 2A seems to be preferable to me. Thoughts? Devise a way to turn off SLA alerts when bundle/coordinator suspended - Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050910#comment-14050910 ] Mona Chitnis commented on OOZIE-1913: - Expanding the scope of this problem: Following are the scenarios and use-cases which can be tied in with turning off SLA alerts: h6. [1] Suspend: User-initiated suspend of bundle/coordinator and specifying option to turn off sla alerts. For coordinator, this can be followed by a list of coordinator actions or will be applied to ALL. For bundle, currently no way to enumerate coordinators so sla alerting to be turned off for all children coordinators'. h6. [2] Rerun: For reprocessing purposes, same option as above can be given while rerunning a coordinator and turning off sla-alerts for some or all of its actions. h6. [3] Catchup jobs: In backlogged situations, similar to comment#5, SLA service should identify that coordinator is catching up and disable alerting automatically. Of course, there should be a job-level minimum threshold in terms of time for SLA to mark jobs as catchup e.g. only turn off alerts if nominal time more than 3 days old, or something to that effect. Need to think about whether to specify this threshold like oozie.coordinator.sla.alert.disable.threshold as part of job.properties at submission time. In cases [1]-[3], SLA calculation will go ahead and mark eventual MET/MISS status for the jobs, only no alerts will be generated. Use case [4] enlists an option if you would like to resume SLA tracking for jobs that you are re-processing. h6. [4] On-the-fly update of SLA expected-start, expected-end and expected-duration: Similar to the new feature in Oozie where you can change certain coordinator config such as concurrency, throttle etc on the fly, one should be able to change the SLA limits given. This can be made as options to 'Rerun' of terminated coordinators, 'Resume' of suspended coordinators, or 'Change' command. Tagging a certain SLA entry for 'disabling alerts' is not going to require any XML changes on the part of the user. This is due to a good design choice we made while implementing SLA. Devise a way to turn off SLA alerts when bundle/coordinator suspended - Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended
[ https://issues.apache.org/jira/browse/OOZIE-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050910#comment-14050910 ] Mona Chitnis edited comment on OOZIE-1913 at 7/3/14 12:24 AM: -- Expanding the scope of this problem: Following are the scenarios and use-cases which can be tied in with turning off SLA alerts: h6. [1] Suspend: User-initiated suspend of bundle/coordinator and specifying option to turn off sla alerts. For coordinator, this can be followed by a list of coordinator actions or will be applied to ALL. For bundle, currently no way to enumerate coordinators so sla alerting to be turned off for all children coordinators'. h6. [2] Rerun: For reprocessing purposes, same option as above can be given while rerunning a coordinator and turning off sla-alerts for some or all of its actions. h6. [3] Catchup jobs: In backlogged situations, similar to comment#5, SLA service should identify that coordinator is catching up and disable alerting automatically. Of course, there should be a job-level minimum threshold in terms of time for SLA to mark jobs as catchup e.g. only turn off alerts if nominal time more than 3 days old, or something to that effect. Need to think about whether to specify this threshold like oozie.coordinator.sla.alert.disable.threshold as part of job.properties at submission time. In cases [1]-[3], SLA calculation will go ahead and mark eventual MET/MISS status for the jobs, only no alerts will be generated. Use case [4] enlists an option if you would like to resume SLA tracking for jobs that you are re-processing. h6. [4] On-the-fly update of SLA expected-start, expected-end and expected-duration: Similar to the new feature in Oozie where you can change certain coordinator config such as concurrency, throttle etc on the fly, one should be able to change the SLA limits given. This can be made as options to 'Rerun' of terminated coordinators, 'Resume' of suspended coordinators, or 'Change' command. Tagging a certain SLA entry for 'disabling alerts' is not going to require any XML changes on the part of the user. This is due to a good design choice we made while implementing SLA. was (Author: chitnis): Expanding the scope of this problem: Following are the scenarios and use-cases which can be tied in with turning off SLA alerts: h6. [1] Suspend: User-initiated suspend of bundle/coordinator and specifying option to turn off sla alerts. For coordinator, this can be followed by a list of coordinator actions or will be applied to ALL. For bundle, currently no way to enumerate coordinators so sla alerting to be turned off for all children coordinators'. h6. [2] Rerun: For reprocessing purposes, same option as above can be given while rerunning a coordinator and turning off sla-alerts for some or all of its actions. h6. [3] Catchup jobs: In backlogged situations, similar to comment#5, SLA service should identify that coordinator is catching up and disable alerting automatically. Of course, there should be a job-level minimum threshold in terms of time for SLA to mark jobs as catchup e.g. only turn off alerts if nominal time more than 3 days old, or something to that effect. Need to think about whether to specify this threshold like oozie.coordinator.sla.alert.disable.threshold as part of job.properties at submission time. In cases [1]-[3], SLA calculation will go ahead and mark eventual MET/MISS status for the jobs, only no alerts will be generated. Use case [4] enlists an option if you would like to resume SLA tracking for jobs that you are re-processing. h6. [4] On-the-fly update of SLA expected-start, expected-end and expected-duration: Similar to the new feature in Oozie where you can change certain coordinator config such as concurrency, throttle etc on the fly, one should be able to change the SLA limits given. This can be made as options to 'Rerun' of terminated coordinators, 'Resume' of suspended coordinators, or 'Change' command. Tagging a certain SLA entry for 'disabling alerts' is not going to require any XML changes on the part of the user. This is due to a good design choice we made while implementing SLA. Devise a way to turn off SLA alerts when bundle/coordinator suspended - Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1909) log prefix information missing in JavaActionExecutor.check
[ https://issues.apache.org/jira/browse/OOZIE-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049172#comment-14049172 ] Mona Chitnis commented on OOZIE-1909: - Checked that this is needed because check() can be invoked directly form ActionCheckXCommand. For check() invoked via start(), this wouldn't occur. +1 log prefix information missing in JavaActionExecutor.check -- Key: OOZIE-1909 URL: https://issues.apache.org/jira/browse/OOZIE-1909 Project: Oozie Issue Type: Bug Reporter: Ryota Egashira Assignee: Ryota Egashira Attachments: OOZIE-1909.patch log prefix information (e.g. USER,GROUP, TOKEN, APP, JOB) are missing, as shown in the example. due to this issue, this log message does not show up at CLI command ($ oozie job -log 0003855-140611203720-oozie-oozi-W ) {noformat} 2014-06-12 05:00:45,270 WARN PigActionExecutor:542 [pool-2-thread-214] - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[0003855-140611203720-oozie-oozi-W@pig1] Launcher exception: Creation of partition is not supported for hcat://.yahoo.com:/outdatabase1402549124488/outtable1402549124488/datestamp=201302152320;region=usa {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1911) SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration'
Mona Chitnis created OOZIE-1911: --- Summary: SLA calculation in HA mode does wrong bit comparison for 'start' and 'duration' Key: OOZIE-1911 URL: https://issues.apache.org/jira/browse/OOZIE-1911 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk In chronological order: Server 1: Job's SLA eventProcessed set to 0101 = Start and End sla processed. Server 2: Receives above job's status event, processes remaining 'duration' sla. eventProcessed now = 0111, but incremented to 1000 due to {code} SLACalculatorMemory.addJobStatus() : 762 if (slaCalc.getEventProcessed() == 7) { slaInfo.setEventProcessed(8); slaMap.remove(jobId); } {code} Back to Server 1: (doing periodic SLA checks) {code} SLACalculatorMemory.updateJobSla() : 483 if ((eventProc 1) == 0) { // first bit (start-processed) unset if (reg.getExpectedStart() != null) { if (reg.getExpectedStart().getTime() + jobEventLatency System.currentTimeMillis()) { // goes ahead and enqueues another START_MISS event and DURATION_MET event {code} Conclusion, need to fix that check for least significant bit (and next to it) for 'start' and 'duration' to avoid duplicate events -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1913) Devise a way to turn off SLA alerts when bundle/coordinator suspended
Mona Chitnis created OOZIE-1913: --- Summary: Devise a way to turn off SLA alerts when bundle/coordinator suspended Key: OOZIE-1913 URL: https://issues.apache.org/jira/browse/OOZIE-1913 Project: Oozie Issue Type: Improvement Affects Versions: trunk Reporter: Mona Chitnis Assignee: Mona Chitnis Fix For: trunk From user: Need to turn off the SLA miss alerts in jobs when the bundle is suspended for grid upgrades and similar work so that when it's resumed we aren't flooded with a bunch of alerts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1811) Current test failures in trunk
[ https://issues.apache.org/jira/browse/OOZIE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049506#comment-14049506 ] Mona Chitnis commented on OOZIE-1811: - Taking this up. Thanks for shortlisting the subset of tests to reproduce the issue Robert :) Current test failures in trunk -- Key: OOZIE-1811 URL: https://issues.apache.org/jira/browse/OOZIE-1811 Project: Oozie Issue Type: Bug Affects Versions: trunk Reporter: Robert Kanter Assignee: Mona Chitnis Priority: Critical There's a bunch of test failures currently in trunk; I'm not sure what commit(s) is the cause, but I think it was somewhat recent. e.g. https://builds.apache.org/job/oozie-trunk-precommit-build/1199/ Reproducible by running these tests, instead of having to run them all, which takes a lot longer :) {noformat} mvn clean test -Dtest=TestSubWorkflowActionExecutor,TestBunldeChangeXCommand,TestCoordUpdateXCommand,TestCoordJobQueryExecutor,TestStatusTransitService,TestSLAEventGeneration {noformat} {noformat} Results : Failed tests: testCoordinatorActionCommandsSubmitAndStart(org.apache.oozie.sla.TestSLAEventGeneration): expected:...11921-oozie-rkan-C@1[] but was:...11921-oozie-rkan-C@1[2] testCoordStatusTransitServiceDoneWithError(org.apache.oozie.service.TestStatusTransitService): expected:DONEWITHERROR but was:KILLED testBundleStatusTransitRunningFromKilled(org.apache.oozie.service.TestStatusTransitService): expected:RUNNING but was:KILLED Tests in error: testGetList(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) testInsert(org.apache.oozie.executor.jpa.TestCoordJobQueryExecutor) Tests run: 62, Failures: 3, Errors: 2, Skipped: 0 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (OOZIE-1881) Docs fixes
[ https://issues.apache.org/jira/browse/OOZIE-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038960#comment-14038960 ] Mona Chitnis commented on OOZIE-1881: - Yes lets make this central. Point 2 is right but for shell and java actions since they run in launcher, it should be indicated more prominently to use 'oozie.launcher...' properties since 'mapred.x.y.z' properties will not have any consequence Docs fixes -- Key: OOZIE-1881 URL: https://issues.apache.org/jira/browse/OOZIE-1881 Project: Oozie Issue Type: Task Components: docs Affects Versions: trunk, 4.0.0, 4.0.1 Reporter: Mona Chitnis Labels: hcatalog Fix For: trunk Couple of things to include [1] HCatalog integration doc should include walkthrough of example and useful FAQ [2] Shell action doc should mention inline conf prefix oozie.launcher -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size
Mona Chitnis created OOZIE-1892: --- Summary: Oozie web-console WF actions window should be auto-size and not fixed size Key: OOZIE-1892 URL: https://issues.apache.org/jira/browse/OOZIE-1892 Project: Oozie Issue Type: Bug Affects Versions: 4.0.1, trunk Reporter: Mona Chitnis Fix For: trunk A user reported the following: facing some issue with oozie web console UI 1. If I click on a pig action it opens a window with no body 2. If a coordinator has more than 15 instances invoked there is no scroller to see the below 15th row - had to resize the window (very non-intuitive) to get the scroller bar facing same issue on two browsers chrome and firefox. 3. Weird that the window also does not have a horizontal scrollbar. My mouse supports horizontal scrolling and even with that I can't scroll right enough to see the vertical scrollbar. Some suggestion from another UI developer: The problems are largely due to the static height and width applied to some of the divs wrapping the whole table. Playing around in Chrome's developer tools I was able to get the scrollbars working correctly by: - remove fixed width from div with class x-panel - remove fixed height and width from div with class x-panel-body - remove fixed height and width from div with class x-grid3 - remove fixed width from div with class x-grid3-scroller We can fix the oozie-console.js to incorporate either the above suggestions or others. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size
[ https://issues.apache.org/jira/browse/OOZIE-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1892: Description: A user reported the following: facing some issue with oozie web console UI 1. If I click on a pig action it opens a window with no body 2. If a coordinator has more than 15 instances invoked there is no scroller to see the below 15th row - had to resize the window (very non-intuitive) to get the scroller bar facing same issue on two browsers chrome and firefox. 3. Weird that the window also does not have a horizontal scrollbar. My mouse supports horizontal scrolling and even with that I can't scroll right enough to see the vertical scrollbar. Some suggestion from another UI developer (Kendall Thrapp): The problems are largely due to the static height and width applied to some of the divs wrapping the whole table. Playing around in Chrome's developer tools I was able to get the scrollbars working correctly by: - remove fixed width from div with class x-panel - remove fixed height and width from div with class x-panel-body - remove fixed height and width from div with class x-grid3 - remove fixed width from div with class x-grid3-scroller We can fix the oozie-console.js to incorporate either the above suggestions or others. was: A user reported the following: facing some issue with oozie web console UI 1. If I click on a pig action it opens a window with no body 2. If a coordinator has more than 15 instances invoked there is no scroller to see the below 15th row - had to resize the window (very non-intuitive) to get the scroller bar facing same issue on two browsers chrome and firefox. 3. Weird that the window also does not have a horizontal scrollbar. My mouse supports horizontal scrolling and even with that I can't scroll right enough to see the vertical scrollbar. Some suggestion from another UI developer: The problems are largely due to the static height and width applied to some of the divs wrapping the whole table. Playing around in Chrome's developer tools I was able to get the scrollbars working correctly by: - remove fixed width from div with class x-panel - remove fixed height and width from div with class x-panel-body - remove fixed height and width from div with class x-grid3 - remove fixed width from div with class x-grid3-scroller We can fix the oozie-console.js to incorporate either the above suggestions or others. Oozie web-console WF actions window should be auto-size and not fixed size -- Key: OOZIE-1892 URL: https://issues.apache.org/jira/browse/OOZIE-1892 Project: Oozie Issue Type: Bug Affects Versions: trunk, 4.0.1 Reporter: Mona Chitnis Fix For: trunk A user reported the following: facing some issue with oozie web console UI 1. If I click on a pig action it opens a window with no body 2. If a coordinator has more than 15 instances invoked there is no scroller to see the below 15th row - had to resize the window (very non-intuitive) to get the scroller bar facing same issue on two browsers chrome and firefox. 3. Weird that the window also does not have a horizontal scrollbar. My mouse supports horizontal scrolling and even with that I can't scroll right enough to see the vertical scrollbar. Some suggestion from another UI developer (Kendall Thrapp): The problems are largely due to the static height and width applied to some of the divs wrapping the whole table. Playing around in Chrome's developer tools I was able to get the scrollbars working correctly by: - remove fixed width from div with class x-panel - remove fixed height and width from div with class x-panel-body - remove fixed height and width from div with class x-grid3 - remove fixed width from div with class x-grid3-scroller We can fix the oozie-console.js to incorporate either the above suggestions or others. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size
[ https://issues.apache.org/jira/browse/OOZIE-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037691#comment-14037691 ] Mona Chitnis edited comment on OOZIE-1892 at 6/19/14 7:06 PM: -- Attaching some screenshots !window_without_body.JPG! !window_without_scroller.JPG! was (Author: chitnis): Attaching some screenshots !window_without_body! !window_without_scroller! Oozie web-console WF actions window should be auto-size and not fixed size -- Key: OOZIE-1892 URL: https://issues.apache.org/jira/browse/OOZIE-1892 Project: Oozie Issue Type: Bug Affects Versions: trunk, 4.0.1 Reporter: Mona Chitnis Fix For: trunk Attachments: window_without_body.JPG, window_without_scroller.JPG A user reported the following: facing some issue with oozie web console UI 1. If I click on a pig action it opens a window with no body 2. If a coordinator has more than 15 instances invoked there is no scroller to see the below 15th row - had to resize the window (very non-intuitive) to get the scroller bar facing same issue on two browsers chrome and firefox. 3. Weird that the window also does not have a horizontal scrollbar. My mouse supports horizontal scrolling and even with that I can't scroll right enough to see the vertical scrollbar. Some suggestion from another UI developer (Kendall Thrapp): The problems are largely due to the static height and width applied to some of the divs wrapping the whole table. Playing around in Chrome's developer tools I was able to get the scrollbars working correctly by: - remove fixed width from div with class x-panel - remove fixed height and width from div with class x-panel-body - remove fixed height and width from div with class x-grid3 - remove fixed width from div with class x-grid3-scroller We can fix the oozie-console.js to incorporate either the above suggestions or others. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (OOZIE-1892) Oozie web-console WF actions window should be auto-size and not fixed size
[ https://issues.apache.org/jira/browse/OOZIE-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mona Chitnis updated OOZIE-1892: Attachment: window_without_scroller.JPG window_without_body.JPG Attaching some screenshots !window_without_body! !window_without_scroller! Oozie web-console WF actions window should be auto-size and not fixed size -- Key: OOZIE-1892 URL: https://issues.apache.org/jira/browse/OOZIE-1892 Project: Oozie Issue Type: Bug Affects Versions: trunk, 4.0.1 Reporter: Mona Chitnis Fix For: trunk Attachments: window_without_body.JPG, window_without_scroller.JPG A user reported the following: facing some issue with oozie web console UI 1. If I click on a pig action it opens a window with no body 2. If a coordinator has more than 15 instances invoked there is no scroller to see the below 15th row - had to resize the window (very non-intuitive) to get the scroller bar facing same issue on two browsers chrome and firefox. 3. Weird that the window also does not have a horizontal scrollbar. My mouse supports horizontal scrolling and even with that I can't scroll right enough to see the vertical scrollbar. Some suggestion from another UI developer (Kendall Thrapp): The problems are largely due to the static height and width applied to some of the divs wrapping the whole table. Playing around in Chrome's developer tools I was able to get the scrollbars working correctly by: - remove fixed width from div with class x-panel - remove fixed height and width from div with class x-panel-body - remove fixed height and width from div with class x-grid3 - remove fixed width from div with class x-grid3-scroller We can fix the oozie-console.js to incorporate either the above suggestions or others. -- This message was sent by Atlassian JIRA (v6.2#6252)