[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497547#comment-14497547 ] Jeff Zhang commented on TEZ-714: Thanks [~bikassaha] Committed to master commit d932579b002f14b81836eeed75f4bf92d4ed7fbf (HEAD, master, TEZ-714) Author: Jeff Zhang Date: Thu Apr 16 06:39:34 2015 +0200 TEZ-714. OutputCommitters should not run in the main AM dispatcher thread (zjffdu) > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-10.patch, > TEZ-714-11.patch, TEZ-714-12.patch, TEZ-714-13.patch, TEZ-714-14.patch, > TEZ-714-15.patch, TEZ-714-16.patch, TEZ-714-17.patch, TEZ-714-2.patch, > TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, > TEZ-714-7.patch, TEZ-714-8.patch, TEZ-714-9.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2328) Add tez.runtime.sorter.class & rename tez.runtime.sort.threads to tez.runtime.pipelinedsorter.sort.threads
Rajesh Balamohan created TEZ-2328: - Summary: Add tez.runtime.sorter.class & rename tez.runtime.sort.threads to tez.runtime.pipelinedsorter.sort.threads Key: TEZ-2328 URL: https://issues.apache.org/jira/browse/TEZ-2328 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497415#comment-14497415 ] Siddharth Seth commented on TEZ-1897: - This is fine, just as long as we don't move the central dispatcher to run on multiple threads. Alternately, moving specific events to run on different dispatchers. > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497417#comment-14497417 ] Siddharth Seth commented on TEZ-1897: - btw. findbugs will likely be a new warning, since the checks are based on counts. > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497409#comment-14497409 ] Bikas Saha commented on TEZ-1897: - >From what I see, drained code is dead since noone calls setDrainEventsOnStop() >which actually creates a user for the draining logic. We can remove that code >I think. > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497398#comment-14497398 ] Hitesh Shah commented on TEZ-1897: -- Need to dig a bit deeper. The basic changes look ok but the handling of the drained flag may need to tweaked depending on how the event processing works with multiple threads in play. > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
[ https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497395#comment-14497395 ] Bikas Saha commented on TEZ-2323: - lgtm > Fix TestOrderedWordcount to use MR memory configs > - > > Key: TEZ-2323 > URL: https://issues.apache.org/jira/browse/TEZ-2323 > Project: Apache Tez > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Hitesh Shah > Attachments: TEZ-2323.1.patch, TEZ-2323.2.patch > > > TestOrderedwordcount takes combination of configs from mapred-site.xml and > tez-site.xml. Due to considering mix of the mapred and tez configs, it fails > with below error. > {noformat} > 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from > input stream, eventType=TASK_ATTEMPT_FINISHED > 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from > input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, > taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, > finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, > errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running > task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be > larger than 0 and should be less than the available task memory (MB):246 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) > at > org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events
[ https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497377#comment-14497377 ] Mit Desai commented on TEZ-2282: bq. it seems like we should probably add a timestamp to the log message I thought having a start point to differentiate was the purpose. But I can definitely work on getting the timestamp. I will also do the same when the task completes and in the DAG App Master start/stop logic. > Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt > start/stop events > --- > > Key: TEZ-2282 > URL: https://issues.apache.org/jira/browse/TEZ-2282 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Mit Desai > Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, > TEZ-2282.master.1.patch > > > This could help with debugging in some cases where logging is task specific. > For example GC log is going to stdout, it will be nice to see task attempt > start/stop times -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497375#comment-14497375 ] Rohini Palaniswamy commented on TEZ-2317: - +1 on the patch. Looks good. But let me test it out as well before you commit it. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log, TEZ-2317.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497372#comment-14497372 ] Bikas Saha commented on TEZ-2317: - [~jeagles] [~hitesh] please review > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log, TEZ-2317.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497368#comment-14497368 ] Bikas Saha commented on TEZ-1897: - The findbugs is in existing code not introduced in this patch and is fine to ignore. Since there are no concerns on improving the dispatcher to schedule on multiple threads (even though this patch is not doing so), lets us proceed and review the patch. This essentially still runs the central dispatcher on a single thread but instead of us explicitly creating the thread we use a thread created in a threadpool. So everything stays the same. Having the code in place allows experimentation with scenarios where increasing the threads may help. E.g. speculation events can executed concurrently. [~zjffdu] [~sseth] [~hitesh] [~rajesh.balamohan] Please review. > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497352#comment-14497352 ] TezQA commented on TEZ-2317: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725731/TEZ-2317.1.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/472//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/472//console This message is automatically generated. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log, TEZ-2317.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2317 PreCommit Build #472
Jira: https://issues.apache.org/jira/browse/TEZ-2317 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/472/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2764 lines...] [INFO] Final Memory: 70M/988M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725731/TEZ-2317.1.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/472//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/472//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. dd0519c53a92d3e0ed3a9a3ec755d464331d87ac logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #471 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2550489 bytes Compression is 7.2% Took 1.6 sec Description set: TEZ-2317 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Comment Edited] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events
[ https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497299#comment-14497299 ] Hitesh Shah edited comment on TEZ-2282 at 4/15/15 11:27 PM: [~mitdesai] Looking at [~jeagles]'s description, it seems like we should probably add a timestamp to the log message. Maybe prefix the message with a timestamp? Also, it might be helpful to add a log whenever the task completes ( after calling close() ). In addition to this, I think it might be good to have the same logic in the DAG App Master for each dag start/stop. was (Author: hitesh): [~mitdesai] Looking at [~jeagles]'s description, it seems like we should probably add a timestamp to the log message. Maybe prefix the message with a timestamp? > Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt > start/stop events > --- > > Key: TEZ-2282 > URL: https://issues.apache.org/jira/browse/TEZ-2282 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Mit Desai > Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, > TEZ-2282.master.1.patch > > > This could help with debugging in some cases where logging is task specific. > For example GC log is going to stdout, it will be nice to see task attempt > start/stop times -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events
[ https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497299#comment-14497299 ] Hitesh Shah commented on TEZ-2282: -- [~mitdesai] Looking at [~jeagles]'s description, it seems like we should probably add a timestamp to the log message. Maybe prefix the message with a timestamp? > Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt > start/stop events > --- > > Key: TEZ-2282 > URL: https://issues.apache.org/jira/browse/TEZ-2282 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Mit Desai > Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, > TEZ-2282.master.1.patch > > > This could help with debugging in some cases where logging is task specific. > For example GC log is going to stdout, it will be nice to see task attempt > start/stop times -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2317: Attachment: TEZ-2317.1.patch updating patch with test. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log, TEZ-2317.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2317: Attachment: (was: TEZ-2317.1.patch) > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log, TEZ-2317.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2317: Attachment: TEZ-2317.1.patch [~rohini] Can you please try with your Pig Processor fix and this patch applied? Both together should resolve all the unnecessary kills. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log, TEZ-2317.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2327) NPE in shuffle
[ https://issues.apache.org/jira/browse/TEZ-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated TEZ-2327: -- Attachment: (was: am.log.gz) > NPE in shuffle > -- > > Key: TEZ-2327 > URL: https://issues.apache.org/jira/browse/TEZ-2327 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Siddharth Seth > > {noformat} > 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, > startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, > status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while > running task:java.lang.NullPointerException >at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93) >at java.io.FilterInputStream.close(FilterInputStream.java:181) >at > sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395) >at java.io.BufferedInputStream.close(BufferedInputStream.java:483) >at java.io.FilterInputStream.close(FilterInputStream.java:181) >at > org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629) >at > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759) >at > org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209) >at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:422) >at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) >at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) >at java.util.concurrent.FutureTask.run(FutureTask.java:266) >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >at java.lang.Thread.run(Thread.java:745) > {noformat} > This caused the task in question to fail -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2327) NPE in shuffle
[ https://issues.apache.org/jira/browse/TEZ-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497166#comment-14497166 ] Sergey Shelukhin commented on TEZ-2327: --- logs are too big, will share separately > NPE in shuffle > -- > > Key: TEZ-2327 > URL: https://issues.apache.org/jira/browse/TEZ-2327 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Siddharth Seth > > {noformat} > 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, > startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, > status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while > running task:java.lang.NullPointerException >at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93) >at java.io.FilterInputStream.close(FilterInputStream.java:181) >at > sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395) >at java.io.BufferedInputStream.close(BufferedInputStream.java:483) >at java.io.FilterInputStream.close(FilterInputStream.java:181) >at > org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629) >at > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759) >at > org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209) >at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:422) >at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) >at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) >at java.util.concurrent.FutureTask.run(FutureTask.java:266) >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >at java.lang.Thread.run(Thread.java:745) > {noformat} > This caused the task in question to fail -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2327) NPE in shuffle
[ https://issues.apache.org/jira/browse/TEZ-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated TEZ-2327: -- Attachment: am.log.gz AM logs > NPE in shuffle > -- > > Key: TEZ-2327 > URL: https://issues.apache.org/jira/browse/TEZ-2327 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Siddharth Seth > > {noformat} > 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, > startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, > status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while > running task:java.lang.NullPointerException >at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93) >at java.io.FilterInputStream.close(FilterInputStream.java:181) >at > sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395) >at java.io.BufferedInputStream.close(BufferedInputStream.java:483) >at java.io.FilterInputStream.close(FilterInputStream.java:181) >at > org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634) >at > org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629) >at > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759) >at > org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209) >at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:422) >at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) >at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) >at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) >at java.util.concurrent.FutureTask.run(FutureTask.java:266) >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >at java.lang.Thread.run(Thread.java:745) > {noformat} > This caused the task in question to fail -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-119) The AM-RM heartbeat interval should not be static
[ https://issues.apache.org/jira/browse/TEZ-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-119: --- Assignee: (was: Bikas Saha) > The AM-RM heartbeat interval should not be static > - > > Key: TEZ-119 > URL: https://issues.apache.org/jira/browse/TEZ-119 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Siddharth Seth > > AMs should be more aggressive in heartbeating to the RM - especially soon > after job start to get the initial set of containers, and also in the general > case where allocations are pending. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2327) NPE in shuffle
Sergey Shelukhin created TEZ-2327: - Summary: NPE in shuffle Key: TEZ-2327 URL: https://issues.apache.org/jira/browse/TEZ-2327 Project: Apache Tez Issue Type: Bug Reporter: Sergey Shelukhin Assignee: Siddharth Seth {noformat} 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running task:java.lang.NullPointerException at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93) at java.io.FilterInputStream.close(FilterInputStream.java:181) at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395) at java.io.BufferedInputStream.close(BufferedInputStream.java:483) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278) at org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644) at org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634) at org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629) at org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759) at org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} This caused the task in question to fail -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497163#comment-14497163 ] Bikas Saha commented on TEZ-2317: - bq. Optimize by not sending a commit go/no-go request if there is no hdfs output (DataSink) involved. In the above case, it is always intermediate output Fix in Pig bq. Handle the commit go/no-go request after processing events in the event queue. May be something like ask the task to come back after some time In this jira bq. We saw that for 3058 KilledTaskAttempts TA_KILL_REQUEST events was 383519. This is way high. That is because each canCommit request from the task was resulting in a kill event being enqueued. Not killing (in this jira) will fix that. bq.In the attached AM-taskkill.log which has grepped statements for a single task that was killed, it has 327 repeats of below message. Need to see why so much and fix that. The log happens for each canCommit call from the task that gets denied because the AM task state is not running. Can change to debug in this patch. The pig processor is calling canCommit every 100ms. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-2324) Dynamic heartbeat intervals between RM and AM
[ https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-2324. - Resolution: Duplicate > Dynamic heartbeat intervals between RM and AM > - > > Key: TEZ-2324 > URL: https://issues.apache.org/jira/browse/TEZ-2324 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Bikas Saha > > Currently there is a config (10ms) which can be an issue for large clusters > with many jobs heartbeating to the RM. We should be able to scale it up and > down based on outstanding requests etc. e.g. if there are no outstanding > requests then ping on larger intervals. The heuristics may be more > sophisticated than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-119) The AM-RM heartbeat interval should not be static
[ https://issues.apache.org/jira/browse/TEZ-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-119: --- Issue Type: Sub-task (was: Improvement) Parent: TEZ-753 > The AM-RM heartbeat interval should not be static > - > > Key: TEZ-119 > URL: https://issues.apache.org/jira/browse/TEZ-119 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Bikas Saha > > AMs should be more aggressive in heartbeating to the RM - especially soon > after job start to get the initial set of containers, and also in the general > case where allocations are pending. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2323 PreCommit Build #471
Jira: https://issues.apache.org/jira/browse/TEZ-2323 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/471/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2766 lines...] [INFO] Final Memory: 73M/1113M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725696/TEZ-2323.2.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/471//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/471//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 816b60535179bfaa69f39786e660c5af148afb07 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #467 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2551229 bytes Compression is 7.2% Took 0.58 sec Description set: TEZ-2323 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
[ https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497150#comment-14497150 ] TezQA commented on TEZ-2323: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725696/TEZ-2323.2.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/471//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/471//console This message is automatically generated. > Fix TestOrderedWordcount to use MR memory configs > - > > Key: TEZ-2323 > URL: https://issues.apache.org/jira/browse/TEZ-2323 > Project: Apache Tez > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Hitesh Shah > Attachments: TEZ-2323.1.patch, TEZ-2323.2.patch > > > TestOrderedwordcount takes combination of configs from mapred-site.xml and > tez-site.xml. Due to considering mix of the mapred and tez configs, it fails > with below error. > {noformat} > 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from > input stream, eventType=TASK_ATTEMPT_FINISHED > 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from > input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, > taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, > finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, > errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running > task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be > larger than 0 and should be less than the available task memory (MB):246 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) > at > org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2320) GroupByOrderByMRRTest not functional in branch 0.6
[ https://issues.apache.org/jira/browse/TEZ-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497126#comment-14497126 ] Hitesh Shah commented on TEZ-2320: -- Ran this against the top of branch 0.6. Could not reproduce. [~tiwari] If you are building from source, can you apply the TEZ-2190 patch and re-try the run? > GroupByOrderByMRRTest not functional in branch 0.6 > --- > > Key: TEZ-2320 > URL: https://issues.apache.org/jira/browse/TEZ-2320 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > > Reported by [~tiwari] in TEZ-1581. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497117#comment-14497117 ] TezQA commented on TEZ-1897: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725692/TEZ-1897.3.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/470//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/470//artifact/patchprocess/newPatchFindbugsWarningstez-common.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/470//console This message is automatically generated. > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-1897 PreCommit Build #470
Jira: https://issues.apache.org/jira/browse/TEZ-1897 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/470/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2769 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725692/TEZ-1897.3.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/470//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/470//artifact/patchprocess/newPatchFindbugsWarningstez-common.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/470//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 46267d54a4e88d8295bd4d87f42b26a7f30b6575 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #467 Archived 44 artifacts Archive block size is 32768 Received 26 blocks and 1896974 bytes Compression is 31.0% Took 0.49 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Resolved] (TEZ-2326) Update branch 0.6 version to 0.6.1-SNAPSHOT
[ https://issues.apache.org/jira/browse/TEZ-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah resolved TEZ-2326. -- Resolution: Fixed Fix Version/s: 0.6.1 Committed to branch 0.6 > Update branch 0.6 version to 0.6.1-SNAPSHOT > --- > > Key: TEZ-2326 > URL: https://issues.apache.org/jira/browse/TEZ-2326 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Minor > Fix For: 0.6.1 > > Attachments: TEZ-2326.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2326) Update branch 0.6 version to 0.6.1-SNAPSHOT
[ https://issues.apache.org/jira/browse/TEZ-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2326: - Summary: Update branch 0.6 version to 0.6.1-SNAPSHOT (was: Update branch 0.6 version ) > Update branch 0.6 version to 0.6.1-SNAPSHOT > --- > > Key: TEZ-2326 > URL: https://issues.apache.org/jira/browse/TEZ-2326 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Minor > Attachments: TEZ-2326.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2326) Update branch 0.6 version
[ https://issues.apache.org/jira/browse/TEZ-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2326: - Attachment: TEZ-2326.1.patch Update to 0.6.1-SNAPSHOT > Update branch 0.6 version > -- > > Key: TEZ-2326 > URL: https://issues.apache.org/jira/browse/TEZ-2326 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah >Priority: Minor > Attachments: TEZ-2326.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2326) Update branch 0.6 version
Hitesh Shah created TEZ-2326: Summary: Update branch 0.6 version Key: TEZ-2326 URL: https://issues.apache.org/jira/browse/TEZ-2326 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Hitesh Shah Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
[ https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2323: - Attachment: TEZ-2323.2.patch \cc [~rajesh.balamohan] > Fix TestOrderedWordcount to use MR memory configs > - > > Key: TEZ-2323 > URL: https://issues.apache.org/jira/browse/TEZ-2323 > Project: Apache Tez > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Hitesh Shah > Attachments: TEZ-2323.1.patch, TEZ-2323.2.patch > > > TestOrderedwordcount takes combination of configs from mapred-site.xml and > tez-site.xml. Due to considering mix of the mapred and tez configs, it fails > with below error. > {noformat} > 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from > input stream, eventType=TASK_ATTEMPT_FINISHED > 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from > input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, > taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, > finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, > errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running > task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be > larger than 0 and should be less than the available task memory (MB):246 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) > at > org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1653) Dynamic heartbeat intervals between task and AM
[ https://issues.apache.org/jira/browse/TEZ-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1653: Issue Type: Sub-task (was: Task) Parent: TEZ-753 > Dynamic heartbeat intervals between task and AM > --- > > Key: TEZ-1653 > URL: https://issues.apache.org/jira/browse/TEZ-1653 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Bikas Saha > > The internval is fixed now. Based on load (number of running tasks etc) the > container/task heartbeat could be adjusted. The AM could return the next > heartbeat interval in the response. The interval could be small when num > tasks is small and large when num tasks is high. This will help with AM > scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2324) Dynamic heartbeat intervals between RM and AM
[ https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2324: Issue Type: Sub-task (was: Bug) Parent: TEZ-753 > Dynamic heartbeat intervals between RM and AM > - > > Key: TEZ-2324 > URL: https://issues.apache.org/jira/browse/TEZ-2324 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Bikas Saha > > Currently there is a config (10ms) which can be an issue for large clusters > with many jobs heartbeating to the RM. We should be able to scale it up and > down based on outstanding requests etc. e.g. if there are no outstanding > requests then ping on larger intervals. The heuristics may be more > sophisticated than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2294) Add tez-site-template.xml with description of config properties
[ https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497056#comment-14497056 ] Jonathan Eagles commented on TEZ-2294: -- [~rajesh.balamohan], I think TEZ-963 is trying to accomplish the same effort. Can you close as duplicate if so? > Add tez-site-template.xml with description of config properties > --- > > Key: TEZ-2294 > URL: https://issues.apache.org/jira/browse/TEZ-2294 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2294.wip.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2325) Route status update event directly to the attempt
[ https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497053#comment-14497053 ] Siddharth Seth commented on TEZ-2325: - +1. Long pending, instead of routing everything to the Vertex. > Route status update event directly to the attempt > -- > > Key: TEZ-2325 > URL: https://issues.apache.org/jira/browse/TEZ-2325 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha > > Today, all events from the attempt heartbeat are routed to the vertex. then > the vertex routes (if any) status update events to the attempt. This is > unnecessary and potentially creates out of order scenarios. We could route > the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2324) Dynamic heartbeat intervals between RM and AM
[ https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497052#comment-14497052 ] Siddharth Seth commented on TEZ-2324: - There's already a jira for this, and linked under the scalability improvement jira. > Dynamic heartbeat intervals between RM and AM > - > > Key: TEZ-2324 > URL: https://issues.apache.org/jira/browse/TEZ-2324 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha > > Currently there is a config (10ms) which can be an issue for large clusters > with many jobs heartbeating to the RM. We should be able to scale it up and > down based on outstanding requests etc. e.g. if there are no outstanding > requests then ping on larger intervals. The heuristics may be more > sophisticated than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1897) Allow higher concurrency in AsyncDispatcher
[ https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1897: Attachment: TEZ-1897.3.patch > Allow higher concurrency in AsyncDispatcher > --- > > Key: TEZ-1897 > URL: https://issues.apache.org/jira/browse/TEZ-1897 > Project: Apache Tez > Issue Type: Task >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch > > > Currently, it processes events on a single thread. For events that can be > executed in parallel, e.g. vertex manager events, allowing higher concurrency > may be beneficial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
[ https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496988#comment-14496988 ] TezQA commented on TEZ-2323: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725669/TEZ-2323.1.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestSecureShuffle Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/469//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/469//console This message is automatically generated. > Fix TestOrderedWordcount to use MR memory configs > - > > Key: TEZ-2323 > URL: https://issues.apache.org/jira/browse/TEZ-2323 > Project: Apache Tez > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Hitesh Shah > Attachments: TEZ-2323.1.patch > > > TestOrderedwordcount takes combination of configs from mapred-site.xml and > tez-site.xml. Due to considering mix of the mapred and tez configs, it fails > with below error. > {noformat} > 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from > input stream, eventType=TASK_ATTEMPT_FINISHED > 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from > input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, > taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, > finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, > errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running > task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be > larger than 0 and should be less than the available task memory (MB):246 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) > at > org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2323 PreCommit Build #469
Jira: https://issues.apache.org/jira/browse/TEZ-2323 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/469/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2525 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725669/TEZ-2323.1.patch against master revision 19378d5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestSecureShuffle Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/469//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/469//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 520e3e8e00548ce828beb1715efce99c313a6c02 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #467 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2555813 bytes Compression is 7.1% Took 0.83 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 2 tests failed. REGRESSION: org.apache.tez.test.TestSecureShuffle.testSecureShuffle[test[sslInCluster:true, resultWithTezSSL:0, resultWithoutTezSSL:1]] Error Message: expected:<1> but was:<0> Stack Trace: java.lang.AssertionError: expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.tez.test.TestSecureShuffle.baseTest(TestSecureShuffle.java:144) at org.apache.tez.test.TestSecureShuffle.testSecureShuffle(TestSecureShuffle.java:162) REGRESSION: org.apache.tez.test.TestSecureShuffle.testSecureShuffle[test[sslInCluster:false, resultWithTezSSL:1, resultWithoutTezSSL:0]] Error Message: expected:<1> but was:<0> Stack Trace: java.lang.AssertionError: expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.tez.test.TestSecureShuffle.baseTest(TestSecureShuffle.java:144) at org.apache.tez.test.TestSecureShuffle.testSecureShuffle(TestSecureShuffle.java:157)
[jira] [Created] (TEZ-2325) Route status update event directly to the attempt
Bikas Saha created TEZ-2325: --- Summary: Route status update event directly to the attempt Key: TEZ-2325 URL: https://issues.apache.org/jira/browse/TEZ-2325 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Today, all events from the attempt heartbeat are routed to the vertex. then the vertex routes (if any) status update events to the attempt. This is unnecessary and potentially creates out of order scenarios. We could route the status update events directly to attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2324) Dynamic heartbeat intervals between RM and AM
[ https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2324: Description: Currently there is a config (10ms) which can be an issue for large clusters with many jobs heartbeating to the RM. We should be able to scale it up and down based on outstanding requests etc. e.g. if there are no outstanding requests then ping on larger intervals. The heuristics may be more sophisticated than that. (was: Currently there is a config (10ms) which can be an issue for large clusters with many jobs heartbeating to the RM. We should be able to scale it up and down based on outstanding requests etc. e.g. if there are no outstanding requests then ping on larger intervals.) > Dynamic heartbeat intervals between RM and AM > - > > Key: TEZ-2324 > URL: https://issues.apache.org/jira/browse/TEZ-2324 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha > > Currently there is a config (10ms) which can be an issue for large clusters > with many jobs heartbeating to the RM. We should be able to scale it up and > down based on outstanding requests etc. e.g. if there are no outstanding > requests then ping on larger intervals. The heuristics may be more > sophisticated than that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2294) Add tez-site-template.xml with description of config properties
[ https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2294: - Attachment: TEZ-2294.wip.patch [~rajesh.balamohan] Please take a look. The assembly tarball will generate a conf dir with config files after applying this patch. Pending work: - use @private annotations to filter config properties - change TezConfiguration and TezRuntimeConfiguration to have descriptions in annotations so that they can be used by the maven plugin to generate the description tag in the config file. - make use of scope information - possibly combine both files into a single one > Add tez-site-template.xml with description of config properties > --- > > Key: TEZ-2294 > URL: https://issues.apache.org/jira/browse/TEZ-2294 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2294.wip.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2324) Dynamic heartbeat intervals between RM and AM
Bikas Saha created TEZ-2324: --- Summary: Dynamic heartbeat intervals between RM and AM Key: TEZ-2324 URL: https://issues.apache.org/jira/browse/TEZ-2324 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha Currently there is a config (10ms) which can be an issue for large clusters with many jobs heartbeating to the RM. We should be able to scale it up and down based on outstanding requests etc. e.g. if there are no outstanding requests then ping on larger intervals. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
[ https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2323: - Attachment: TEZ-2323.1.patch [~pramachandran] [~sseth] please review. > Fix TestOrderedWordcount to use MR memory configs > - > > Key: TEZ-2323 > URL: https://issues.apache.org/jira/browse/TEZ-2323 > Project: Apache Tez > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Hitesh Shah > Attachments: TEZ-2323.1.patch > > > TestOrderedwordcount takes combination of configs from mapred-site.xml and > tez-site.xml. Due to considering mix of the mapred and tez configs, it fails > with below error. > {noformat} > 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from > input stream, eventType=TASK_ATTEMPT_FINISHED > 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from > input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, > taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, > finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, > errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running > task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be > larger than 0 and should be less than the available task memory (MB):246 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) > at > org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
[ https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2323: - Fix Version/s: (was: 0.7.0) > Fix TestOrderedWordcount to use MR memory configs > - > > Key: TEZ-2323 > URL: https://issues.apache.org/jira/browse/TEZ-2323 > Project: Apache Tez > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Hitesh Shah > > TestOrderedwordcount takes combination of configs from mapred-site.xml and > tez-site.xml. Due to considering mix of the mapred and tez configs, it fails > with below error. > {noformat} > 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from > input stream, eventType=TASK_ATTEMPT_FINISHED > 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from > input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, > taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, > finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, > errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running > task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be > larger than 0 and should be less than the available task memory (MB):246 > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) > at > org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs
Yesha Vora created TEZ-2323: --- Summary: Fix TestOrderedWordcount to use MR memory configs Key: TEZ-2323 URL: https://issues.apache.org/jira/browse/TEZ-2323 Project: Apache Tez Issue Type: Bug Reporter: Yesha Vora Assignee: Hitesh Shah Fix For: 0.7.0 TestOrderedwordcount takes combination of configs from mapred-site.xml and tez-site.xml. Due to considering mix of the mapred and tez configs, it fails with below error. {noformat} 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from input stream, eventType=TASK_ATTEMPT_FINISHED 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be larger than 0 and should be less than the available task memory (MB):246 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2319) DAG history in HDFS
[ https://issues.apache.org/jira/browse/TEZ-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496800#comment-14496800 ] Jason Lowe commented on TEZ-2319: - MR does not dump the final state all at once, rather it is more like the SimpleHistoryLogger. The JobHistoryEventHandler logs job/task/attempt start/stop events to the .jhist avro file in the staging directory as the job runs. Once the job finishes it copies that jhist file over to the done intermediate directory for the job history server to pick up. It does not dump it all at once from memory when the job completes. Note that the MR AM is building the state over time in memory, not because it's logging to the jhist file along the way but because it has to provide a UI while the job is running. It could dump the contents to the jhist file all at once when the job completes, but it also uses the jhist file as a recovery mechanism in case the AM crashes. I think we'd be OK dumping the events to a file as we get them in a similar way to how JobHistoryEventHandler works in the MR AM. Biggest concern is adding multiple logging mechanisms adds to the failure potential. If we're generating events faster than the two loggers can process them then we'll start buffering events and putting pressure on the AM heap. > DAG history in HDFS > --- > > Key: TEZ-2319 > URL: https://issues.apache.org/jira/browse/TEZ-2319 > Project: Apache Tez > Issue Type: New Feature >Reporter: Rohini Palaniswamy > > We have processes, that parse jobconf.xml and job history details (map and > reduce task details, etc) in avro files from HDFS and load them into hive > tables for analysis for mapreduce jobs. Would like to have Tez also make this > information written to a history file in HDFS when AM or each DAG completes > so that we can do analytics on Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1969) Stop the DAGAppMaster when a local mode client is stopped
[ https://issues.apache.org/jira/browse/TEZ-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496798#comment-14496798 ] Siddharth Seth commented on TEZ-1969: - Mostly looks good. Could you please add a comment in the close method on why this works, even though the framework client is shared between TezClient and DAGClient. Assuming this is making use of the fact that DAGClient close is not invoked, and a single instance of DAGClient will be associated with a TezClient. > Stop the DAGAppMaster when a local mode client is stopped > - > > Key: TEZ-1969 > URL: https://issues.apache.org/jira/browse/TEZ-1969 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Prakash Ramachandran > Attachments: TEZ-1969.1.patch, TEZ-1969.2.patch > > > https://issues.apache.org/jira/browse/TEZ-1661?focusedCommentId=14275366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14275366 > Running multiple local clients in a single JVM will leak DAGAppMaster and > related threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1482) Fix memory issues for Local Mode running concurrent tasks
[ https://issues.apache.org/jira/browse/TEZ-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496763#comment-14496763 ] Siddharth Seth commented on TEZ-1482: - Don't think this needs to go into 0.6. It's really an improvement to local mode, and IIRC there's multiple other dependent jiras which haven't been pulled into 0.6. > Fix memory issues for Local Mode running concurrent tasks > - > > Key: TEZ-1482 > URL: https://issues.apache.org/jira/browse/TEZ-1482 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Chen He >Assignee: Prakash Ramachandran > Fix For: 0.7.0 > > Attachments: TEZ-1482.1.patch, TEZ-1482.2.patch, TEZ-1482.3.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2319) DAG history in HDFS
[ https://issues.apache.org/jira/browse/TEZ-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496715#comment-14496715 ] Rohini Palaniswamy commented on TEZ-2319: - bq. Maybe this should be a primary ask for ATS v2 This is something that we do not want to wait for ATS v2. But it would be good if they captured this as part of the design. bq. make the SimpleHistoryLogger ( to HDFS ) production-ready and tez should allow publishing to multiple loggers. This history only needs to capture the final state of the DAG, its tasks and counters. It does not need to capture intermediate data. I am not sure SimpleHistoryLogger in its current form is a good fit. The job history in MR is in avro format and gives the whole state of the job on its completion. If AM has that in memory, then we can have a config to dump that into HDFS in some format (json/avro) which is the easiest thing. Else will need another Logger to - build the state over time (not preferrable as it will consume lot of memory) and dump on completion. - or write events as it happens, then parse it and construct only relevant information and write another file. Both options with another Logger are not efficient and I don't like the idea myself. [~jlowe]/[~jeagles] , Any better suggestions on how this can be done based on your experience with how it is currently done in MR? > DAG history in HDFS > --- > > Key: TEZ-2319 > URL: https://issues.apache.org/jira/browse/TEZ-2319 > Project: Apache Tez > Issue Type: New Feature >Reporter: Rohini Palaniswamy > > We have processes, that parse jobconf.xml and job history details (map and > reduce task details, etc) in avro files from HDFS and load them into hive > tables for analysis for mapreduce jobs. Would like to have Tez also make this > information written to a history file in HDFS when AM or each DAG completes > so that we can do analytics on Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2310) AM Deadlock in VertexImpl
[ https://issues.apache.org/jira/browse/TEZ-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496654#comment-14496654 ] Bikas Saha commented on TEZ-2310: - Thanks for the verification [~daijy]. [~sseth] [~rajesh.balamohan] [~hitesh] Please review. The change is basically having notifications sent out to listeners on a separate thread. Potentially, we could do multiple of these concurrently via a thread pool but for now sticking to a single thread. Will open a separate jira to do this for task status updates. > AM Deadlock in VertexImpl > - > > Key: TEZ-2310 > URL: https://issues.apache.org/jira/browse/TEZ-2310 > Project: Apache Tez > Issue Type: Bug >Reporter: Daniel Dai >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: TEZ-2310-0.patch, TEZ-2310.1.patch > > > See the following deadlock in testing: > Thread#1: > {code} > Daemon Thread [App Shared Pool - #3] (Suspended) > owns: VertexManager$VertexManagerPluginContextImpl (id=327) > owns: ShuffleVertexManager (id=328) > owns: VertexManager (id=329) > waiting for: VertexManager$VertexManagerPluginContextImpl (id=326) > > VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate) > line: 344 > > StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) > line: 138 > > StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer, > VertexStateUpdate) line: 122 > StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) > line: 116 > StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: > 106 > VertexImpl.maybeSendConfiguredEvent() line: 3385 > VertexImpl.doneReconfiguringVertex() line: 1634 > VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() > line: 339 > ShuffleVertexManager.schedulePendingTasks(int) line: 561 > ShuffleVertexManager.schedulePendingTasks() line: 620 > ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: > 731 > ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744 > VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527 > VertexManager$VertexManagerEvent$1.run() line: 612 > VertexManager$VertexManagerEvent$1.run() line: 607 > AccessController.doPrivileged(PrivilegedExceptionAction, > AccessControlContext) line: not available [native method] > Subject.doAs(Subject, PrivilegedExceptionAction) line: 415 > UserGroupInformation.doAs(PrivilegedExceptionAction) line: 1548 > > VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() > line: 607 > > VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() > line: 596 > ListenableFutureTask(FutureTask).run() line: 262 > ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145 > ThreadPoolExecutor$Worker.run() line: 615 > Thread.run() line: 745 > {code} > Thread #2 > {code} > Daemon Thread [App Shared Pool - #2] (Suspended) > owns: VertexManager$VertexManagerPluginContextImpl (id=326) > owns: PigGraceShuffleVertexManager (id=344) > owns: VertexManager (id=345) > Unsafe.park(boolean, long) line: not available [native method] > LockSupport.park(Object) line: 186 > > ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt() > line: 834 > > ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int) > line: 964 > > ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int) > line: 1282 > ReentrantReadWriteLock$ReadLock.lock() line: 731 > VertexImpl.getTotalTasks() line: 952 > VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) > line: 162 > > PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() > line: 435 > > PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map>) > line: 353 > VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541 > VertexManager$VertexManagerEvent$1.run() line: 612 > VertexManager$VertexManagerEvent$1.run() line: 607 > AccessController.doPrivileged(PrivilegedExceptionAction, > AccessControlContext) line: not available [native method] > Subject.doAs(Subject, PrivilegedExceptionAction) line: 415 > UserGroupInformation.doAs(PrivilegedExceptionAction) line: 1548 >
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496599#comment-14496599 ] Hitesh Shah commented on TEZ-2322: -- The running task count seems fine as there may be cases where succeeded task attempts may not be recovered properly. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > Attachments: attempt1_syslog_dag_1427546104095_0146_1, > attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, > attempt2_syslog_dag_1427546104095_0146_1_post > > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496598#comment-14496598 ] Hitesh Shah commented on TEZ-2322: -- Thanks [~harisekhon]. The command I gave has no relation to Ambari ( or the job history server ) and should work from the command-line if you try it. In any case, it seems like the failed attempt task count is not getting updated on recovery. \cc [~zjffdu] > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > Attachments: attempt1_syslog_dag_1427546104095_0146_1, > attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, > attempt2_syslog_dag_1427546104095_0146_1_post > > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496553#comment-14496553 ] Hari Sekhon edited comment on TEZ-2322 at 4/15/15 5:25 PM: --- Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM and attached to this ticket for you. was (Author: harisekhon): Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > Attachments: attempt1_syslog_dag_1427546104095_0146_1, > attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, > attempt2_syslog_dag_1427546104095_0146_1_post > > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496555#comment-14496555 ] Hari Sekhon commented on TEZ-2322: -- There was a point at which space ran out and kerberos also broke as a result, but I fixed it and the job continued and eventually succeeded. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > Attachments: attempt1_syslog_dag_1427546104095_0146_1, > attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, > attempt2_syslog_dag_1427546104095_0146_1_post > > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Attachment: attempt2_syslog_dag_1427546104095_0146_1_post attempt2_syslog_dag_1427546104095_0146_1 attempt2_syslog attempt1_syslog_dag_1427546104095_0146_1 Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > Attachments: attempt1_syslog_dag_1427546104095_0146_1, > attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, > attempt2_syslog_dag_1427546104095_0146_1_post > > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events
[ https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496473#comment-14496473 ] Mit Desai commented on TEZ-2282: [~hitesh], can you take a look on this patch? > Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt > start/stop events > --- > > Key: TEZ-2282 > URL: https://issues.apache.org/jira/browse/TEZ-2282 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Mit Desai > Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, > TEZ-2282.master.1.patch > > > This could help with debugging in some cases where logging is task specific. > For example GC log is going to stdout, it will be nice to see task attempt > start/stop times -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496452#comment-14496452 ] Rohini Palaniswamy commented on TEZ-2322: - No. Have only seen - TotalTasks come down when a new vertex is starting and tasks reduced due to auto parallelism with ShuffleVertexManager. - If the AM gets killed and a new one is launched, Succeeded goes to 0 and then increases as recovery kicks in. Have not seen Succeeded reduce to a non-zero count. But I have only seen AM relaunch due to OOM or other issues with very big jobs (30K+ tasks). So worthwhile to check if there is a second AM attempt launched. Pig prints that status every 20 secs and it is possible a new AM was launched and recovery recovered 181 tasks by then. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496448#comment-14496448 ] Hitesh Shah commented on TEZ-2322: -- \cc [~daijy] [~rohini] in case either of you have seen this before. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496439#comment-14496439 ] Rohini Palaniswamy edited comment on TEZ-2317 at 4/15/15 4:01 PM: -- Thanks [~bikassaha]. Issue is with PigProcessor calling canCommit. Fixing that in PIG-4508. was (Author: rohini): Ah. Thanks [~bikassaha]. Issue is with PigProcessor calling canCommit. Fixing that in PIG-4508. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2317) Successful task attempts getting killed
[ https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496439#comment-14496439 ] Rohini Palaniswamy commented on TEZ-2317: - Ah. Thanks [~bikassaha]. Issue is with PigProcessor calling canCommit. Fixing that in PIG-4508. > Successful task attempts getting killed > --- > > Key: TEZ-2317 > URL: https://issues.apache.org/jira/browse/TEZ-2317 > Project: Apache Tez > Issue Type: Bug >Reporter: Rohini Palaniswamy >Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: AM-taskkill.log > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496399#comment-14496399 ] Hitesh Shah commented on TEZ-2322: -- Could you please attach the application logs to the jira? (obtained via bin/yarn logs -applicationId ) ? > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.2 > Environment: HDP 2.2 >Reporter: Hari Sekhon >Priority: Minor > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2320) GroupByOrderByMRRTest not functional in branch 0.6
[ https://issues.apache.org/jira/browse/TEZ-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496389#comment-14496389 ] Hitesh Shah commented on TEZ-2320: -- Thanks for the clarification. The test runs fine against master so will need to re-look at why it is failing in branch 0.6 > GroupByOrderByMRRTest not functional in branch 0.6 > --- > > Key: TEZ-2320 > URL: https://issues.apache.org/jira/browse/TEZ-2320 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > > Reported by [~tiwari] in TEZ-1581. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2320) GroupByOrderByMRRTest not functional in branch 0.6
[ https://issues.apache.org/jira/browse/TEZ-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496339#comment-14496339 ] Amit Tiwari commented on TEZ-2320: -- Hello Hitesh, Yes our build contains the fixes for TEZ-2190. As advised, I will deprecate this test in our cluster. thank you --amit > GroupByOrderByMRRTest not functional in branch 0.6 > --- > > Key: TEZ-2320 > URL: https://issues.apache.org/jira/browse/TEZ-2320 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > > Reported by [~tiwari] in TEZ-1581. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 => 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. Hari Sekhon http://www.linkedin.com/in/harisekhon was: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.ex
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 => 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly. was: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 => 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. was: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNN
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 => 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job. was: During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 1
[jira] [Created] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181
Hari Sekhon created TEZ-2322: Summary: Succeeded count wrong for Pig on Tez job, decreased 380 => 181 Key: TEZ-2322 URL: https://issues.apache.org/jira/browse/TEZ-2322 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions, but surely once a task has finished successfully and is marked as succeeded it cannot then be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 => 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1969) Stop the DAGAppMaster when a local mode client is stopped
[ https://issues.apache.org/jira/browse/TEZ-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495995#comment-14495995 ] TezQA commented on TEZ-1969: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725531/TEZ-1969.2.patch against master revision 11b5843. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/468//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/468//console This message is automatically generated. > Stop the DAGAppMaster when a local mode client is stopped > - > > Key: TEZ-1969 > URL: https://issues.apache.org/jira/browse/TEZ-1969 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Prakash Ramachandran > Attachments: TEZ-1969.1.patch, TEZ-1969.2.patch > > > https://issues.apache.org/jira/browse/TEZ-1661?focusedCommentId=14275366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14275366 > Running multiple local clients in a single JVM will leak DAGAppMaster and > related threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-1969 PreCommit Build #468
Jira: https://issues.apache.org/jira/browse/TEZ-1969 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/468/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2765 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725531/TEZ-1969.2.patch against master revision 11b5843. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/468//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/468//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 2365279340c09aa684be8afc1bec9d4750ecb855 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #467 Archived 44 artifacts Archive block size is 32768 Received 8 blocks and 2486175 bytes Compression is 9.5% Took 1.1 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-1969) Stop the DAGAppMaster when a local mode client is stopped
[ https://issues.apache.org/jira/browse/TEZ-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-1969: -- Attachment: TEZ-1969.2.patch reattaching to trigger a pre-commit build > Stop the DAGAppMaster when a local mode client is stopped > - > > Key: TEZ-1969 > URL: https://issues.apache.org/jira/browse/TEZ-1969 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Prakash Ramachandran > Attachments: TEZ-1969.1.patch, TEZ-1969.2.patch > > > https://issues.apache.org/jira/browse/TEZ-1661?focusedCommentId=14275366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14275366 > Running multiple local clients in a single JVM will leak DAGAppMaster and > related threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)