[jira] [Commented] (TEZ-2219) Should verify the input_name/output_name to be unique per vertex
[ https://issues.apache.org/jira/browse/TEZ-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375485#comment-14375485 ] Jeff Zhang commented on TEZ-2219: - Thanks [~hitesh] Committed to master, branch-0.5, branch-0.6 > Should verify the input_name/output_name to be unique per vertex > > > Key: TEZ-2219 > URL: https://issues.apache.org/jira/browse/TEZ-2219 > Project: Apache Tez > Issue Type: Improvement >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Fix For: 0.5.4 > > Attachments: TEZ-2219-1.txt, TEZ-2219-2.patch, TEZ-2219-3.patch > > > RuntimeTask try to get the Input/Output using the input_name/output_name, so > input_name/output_name should be unique per vertex -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-714: --- Attachment: TEZ-714-2.patch > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375512#comment-14375512 ] Jeff Zhang commented on TEZ-714: Upload a new patch. [~bikassaha] Please help review it. * Wrap the commit in the CallableEvent both in DAG & Vertex, but for the abort, still call it inline. Make the abort asyn will complicate the patch, so still keep it a sync call as before. * Introduce new state COMMITTING for Vertex & DAG ** Vertex's COMMITTING means vertex is in the middle of committing, if vertex has no committers or the option of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true, vertex would not to to COMMITTING state. ** DAG's COMMITTING has 2 cases, one is when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true and all the vertices are completed, another case is that TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and all the vertices are completed, but still some vertex group committers are running. * Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions.", I rename it to NonFinalCommitCompletedTransition and FinalCommitCompletetionTransition (maybe there's better names ). One mean the committer when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and the other means TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true. The reason I differentiate them is that for the NonFinalCommitCompletedEvent, we need to log the recovery log of VertexGroupCommitCompletedEvent while it is not necessary for FinalCommitCompletedEvent. * Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher thread ( that means Committer still run in the thread of AsynDispather) so this may hide some potential issues and under this thread mode, it is not possible for test some cases like kill dag while it is in committing. I am trying to think of ways to simulate the shared thread pool in the unit test. * For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR), I am not sure why it go to ERROR directly rather than TERMINATING. Maybe it is to allow the client get the final status as earyl as possible. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375512#comment-14375512 ] Jeff Zhang edited comment on TEZ-714 at 3/23/15 8:04 AM: - Upload a new patch. [~bikassaha] Please help review it. * Wrap the commit in the CallableEvent both in DAG & Vertex, but for the abort, still call it inline. Make the abort asyn will complicate the patch, so still keep it a sync call as before. * Introduce new state COMMITTING for Vertex & DAG ** Vertex's COMMITTING means vertex is in the middle of committing, if vertex has no committers or the option of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true, vertex would not to to COMMITTING state. ** DAG's COMMITTING has 2 cases, one is when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true and all the vertices are completed, another case is that TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and all the vertices are completed, but still some vertex group committers are running. * Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions.", I rename it to NonFinalCommitCompletedTransition and FinalCommitCompletetionTransition (maybe there's better names ). One mean the committer when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and the other means TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true. The reason I differentiate them is that for the NonFinalCommitCompletedEvent, we need to log the recovery log of VertexGroupCommitCompletedEvent while it is not necessary for FinalCommitCompletedEvent. * Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher thread ( that means Committer still run in the thread of AsynDispather) so this may hide some potential issues and under this thread mode, it is not possible for test some cases like kill dag while it is in committing. I am trying to think of ways to simulate the shared thread pool in the unit test. * For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR), I am not sure why it go to ERROR directly rather than TERMINATING. Maybe it is to allow the client get the final status as early as possible. was (Author: zjffdu): Upload a new patch. [~bikassaha] Please help review it. * Wrap the commit in the CallableEvent both in DAG & Vertex, but for the abort, still call it inline. Make the abort asyn will complicate the patch, so still keep it a sync call as before. * Introduce new state COMMITTING for Vertex & DAG ** Vertex's COMMITTING means vertex is in the middle of committing, if vertex has no committers or the option of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true, vertex would not to to COMMITTING state. ** DAG's COMMITTING has 2 cases, one is when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true and all the vertices are completed, another case is that TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and all the vertices are completed, but still some vertex group committers are running. * Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions.", I rename it to NonFinalCommitCompletedTransition and FinalCommitCompletetionTransition (maybe there's better names ). One mean the committer when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and the other means TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true. The reason I differentiate them is that for the NonFinalCommitCompletedEvent, we need to log the recovery log of VertexGroupCommitCompletedEvent while it is not necessary for FinalCommitCompletedEvent. * Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher thread ( that means Committer still run in the thread of AsynDispather) so this may hide some potential issues and under this thread mode, it is not possible for test some cases like kill dag while it is in committing. I am trying to think of ways to simulate the shared thread pool in the unit test. * For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR), I am not sure why it go to ERROR directly rather than TERMINATING. Maybe it is to allow the client get the final status as earyl as possible. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCom
[jira] [Updated] (TEZ-2186) OOM with a simple scatter gather job with re-use
[ https://issues.apache.org/jira/browse/TEZ-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2186: -- Attachment: TEZ-2186-branch-0.6.patch Looks like I didn't upload the branch-0.6 patch in this earlier. > OOM with a simple scatter gather job with re-use > > > Key: TEZ-2186 > URL: https://issues.apache.org/jira/browse/TEZ-2186 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Rajesh Balamohan > Fix For: 0.7.0 > > Attachments: TEZ-2186-branch-0.6.patch, TEZ-2186.1.patch, > TEZ-2186.2.patch, noopexample.txt > > > With a no-op scatter gather job, 20K x 2K, on a 20 node cluster with 20 2GB > containers per node - reducers end up failing with OOM errors. Haven't been > able to generate a heap dump yet. Will add details as they're found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2196) Consider reusing UnorderedPartitionedKVWriter with single output in UnorderedKVOutput
[ https://issues.apache.org/jira/browse/TEZ-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2196: -- Attachment: TEZ-2196.3.patch Addressing review comments. UnorderedKVOutput - instead of using the HashPartitioner - we should introduce a custom partitioner which always returns partition 0. - Fixed. Created CustomParitioner in UnorderedKVOutput itself which would return 0. And marked it as @Private. UnorderedKVOutput - confKeys needs additional properties like the BUFFER_SIZE used by the partitionedWriter, and any other config keys that it uses. - Added TEZ_RUNTIME_UNORDERED_OUTPUT_BUFFER_SIZE_MB, TEZ_RUNTIME_UNORDERED_OUTPUT_MAX_PER_BUFFER_SIZE_BYTES In the test - the HashPartitioner is being setup. Is this required - since the Output sets this up anyway. - Removed it from test cases. Regarding special case in UnorderedPartitionedKVWriter, - In case there is only one partition and when pipelining is disabled, current patch directly appends to IFile. It completely skips the buffers and merge as well. > Consider reusing UnorderedPartitionedKVWriter with single output in > UnorderedKVOutput > - > > Key: TEZ-2196 > URL: https://issues.apache.org/jira/browse/TEZ-2196 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2196.1.patch, TEZ-2196.2.patch, TEZ-2196.3.patch > > > Can possibly get rid of FileBasedKVWriter and reuse > UnorderedPartitionedKVWriter with single partition in UnorderedKVOutput. > This can also benefit from pipelined shuffle changes done in > UnorderedPartitionedKVWriter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2196 PreCommit Build #330
Jira: https://issues.apache.org/jira/browse/TEZ-2196 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/330/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2762 lines...] [INFO] Final Memory: 79M/900M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706508/TEZ-2196.3.patch against master revision aa784be. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/330//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/330//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 1e29b7fd5ef921eb35a85f522e7ef6d3951966d8 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #329 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2526255 bytes Compression is 7.2% Took 1 sec Description set: TEZ-2196 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2196) Consider reusing UnorderedPartitionedKVWriter with single output in UnorderedKVOutput
[ https://issues.apache.org/jira/browse/TEZ-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375762#comment-14375762 ] Hadoop QA commented on TEZ-2196: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706508/TEZ-2196.3.patch against master revision aa784be. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/330//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/330//console This message is automatically generated. > Consider reusing UnorderedPartitionedKVWriter with single output in > UnorderedKVOutput > - > > Key: TEZ-2196 > URL: https://issues.apache.org/jira/browse/TEZ-2196 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2196.1.patch, TEZ-2196.2.patch, TEZ-2196.3.patch > > > Can possibly get rid of FileBasedKVWriter and reuse > UnorderedPartitionedKVWriter with single partition in UnorderedKVOutput. > This can also benefit from pipelined shuffle changes done in > UnorderedPartitionedKVWriter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376374#comment-14376374 ] Hitesh Shah commented on TEZ-2214: -- [~rajesh.balamohan] question on the newly added invocation to "startMemToDiskMerge". What happens when startMemToDiskMerge() is called while a merge is in progress? It seems like startMemToDiskMerge() is a no-op when that happens. > FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses > memToDiskMerging > -- > > Key: TEZ-2214 > URL: https://issues.apache.org/jira/browse/TEZ-2214 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2214.1.patch > > > Scenario: > - commitMemory & usedMemory are beyond their allowed threshold. > - InMemoryMerge kicks off and is in the process of flushing memory contents > to disk > - As it progresses, it releases memory segments as well (but not yet over). > - Fetchers who need memory < maxSingleShuffleLimit, get scheduled. > - If fetchers are fast, this quickly adds up to commitMemory & usedMemory. > Since InMemoryMerge is already in progress, this wouldn't trigger another > merge(). > - Pretty soon all fetchers would be stalled and get into the following state. > {noformat} > Thread 9351: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() > @bci=17, line=337 (Interpreted frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() > @bci=34, line=157 (Interpreted frame) > {noformat} > - Even if InMemoryMerger completes, "commitedMem & usedMem" are beyond their > threshold and no other fetcher threads (all are in stalled state) are there > to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2176) Move all logging to slf4j
[ https://issues.apache.org/jira/browse/TEZ-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2176: Attachment: TEZ-2176.2.1.txt Rebased version of TEZ-2176.2. > Move all logging to slf4j > - > > Key: TEZ-2176 > URL: https://issues.apache.org/jira/browse/TEZ-2176 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Vasanth kumar RJ > Attachments: TEZ-2176.1.patch, TEZ-2176.2.1.txt, TEZ-2176.2.patch, > TEZ-2176.patch > > > SLF4J supports a more comprehensive set of APIs - MDC, Formatted strings. > Also drop commons-logging from the dependency set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2176) Move all logging to slf4j
[ https://issues.apache.org/jira/browse/TEZ-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376404#comment-14376404 ] Siddharth Seth commented on TEZ-2176: - +1. Looks good. Attaching a rebased patch after the last few commits and committing, before this goes stale. Thanks [~vasanthkumar]. > Move all logging to slf4j > - > > Key: TEZ-2176 > URL: https://issues.apache.org/jira/browse/TEZ-2176 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Vasanth kumar RJ > Attachments: TEZ-2176.1.patch, TEZ-2176.2.1.txt, TEZ-2176.2.patch, > TEZ-2176.patch > > > SLF4J supports a more comprehensive set of APIs - MDC, Formatted strings. > Also drop commons-logging from the dependency set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2176) Move all logging to slf4j
[ https://issues.apache.org/jira/browse/TEZ-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376493#comment-14376493 ] Bikas Saha commented on TEZ-2176: - There should probably be a follow up jira to remove instances of LOG.isDebugEnabled() from the code based on http://www.slf4j.org/faq.html#logging_performance [~vasanthkumar] Do you think you can take a crack at it? > Move all logging to slf4j > - > > Key: TEZ-2176 > URL: https://issues.apache.org/jira/browse/TEZ-2176 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Vasanth kumar RJ > Fix For: 0.7.0 > > Attachments: TEZ-2176.1.patch, TEZ-2176.2.1.txt, TEZ-2176.2.patch, > TEZ-2176.patch > > > SLF4J supports a more comprehensive set of APIs - MDC, Formatted strings. > Also drop commons-logging from the dependency set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2149) Optimizations for the timed version of DAGClient.getStatus
[ https://issues.apache.org/jira/browse/TEZ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2149: Attachment: TEZ-2149.1.txt Patch adds a notify in the AM to return early instead of the sleep. Also changes the waitUntilCompletion methods to use this API instead of an explicit sleep. [~bikassaha], [~hitesh], [~pramachandran] - please review. > Optimizations for the timed version of DAGClient.getStatus > -- > > Key: TEZ-2149 > URL: https://issues.apache.org/jira/browse/TEZ-2149 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-2149.1.txt > > > From > https://issues.apache.org/jira/browse/TEZ-1967?focusedCommentId=14325037&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325037 > - The sleep within the AM can be improved via monitors. > - INITED state is returned when communicating with the AM, SUBMITTED state is > returned when communicating with the RM. That could be used to optimize the > flow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2222) Investigate moving to log4j2 for logging
Siddharth Seth created TEZ-: --- Summary: Investigate moving to log4j2 for logging Key: TEZ- URL: https://issues.apache.org/jira/browse/TEZ- Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Via slf4j. Some bits to keep in mind - We have explicit code which rotates logs using direct log4j12 APIs. This should keep working. I believe the log4j2 APIs are different here - API compatibility between log4j12 / log4j2 can be problematic - if both end up on the classpath (I believe the APIs are different) - Hadoop dist includes a slf4j-log4j12 binding. Changing the default can result in sl4j-log4j12 and slf4j-log4j2 to co-exist by default - which could be problematic. Needs investigation. End of the day, we will likely need an option to use either of the two. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376763#comment-14376763 ] Rajesh Balamohan commented on TEZ-2214: --- [~hitesh] - In such cases, the next line "inMemoryMerger.waitForMerge()" acts as the barrier. It would wait until the existing merging completes (which internally releases memory for usedMemory & commitMemory). > FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses > memToDiskMerging > -- > > Key: TEZ-2214 > URL: https://issues.apache.org/jira/browse/TEZ-2214 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2214.1.patch > > > Scenario: > - commitMemory & usedMemory are beyond their allowed threshold. > - InMemoryMerge kicks off and is in the process of flushing memory contents > to disk > - As it progresses, it releases memory segments as well (but not yet over). > - Fetchers who need memory < maxSingleShuffleLimit, get scheduled. > - If fetchers are fast, this quickly adds up to commitMemory & usedMemory. > Since InMemoryMerge is already in progress, this wouldn't trigger another > merge(). > - Pretty soon all fetchers would be stalled and get into the following state. > {noformat} > Thread 9351: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() > @bci=17, line=337 (Interpreted frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() > @bci=34, line=157 (Interpreted frame) > {noformat} > - Even if InMemoryMerger completes, "commitedMem & usedMem" are beyond their > threshold and no other fetcher threads (all are in stalled state) are there > to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376763#comment-14376763 ] Rajesh Balamohan edited comment on TEZ-2214 at 3/23/15 10:21 PM: - [~hitesh] - In such cases, the next line "inMemoryMerger.waitForMerge()" acts as the barrier. It would wait until the existing merge completes (which internally releases memory for usedMemory & commitMemory). was (Author: rajesh.balamohan): [~hitesh] - In such cases, the next line "inMemoryMerger.waitForMerge()" acts as the barrier. It would wait until the existing merging completes (which internally releases memory for usedMemory & commitMemory). > FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses > memToDiskMerging > -- > > Key: TEZ-2214 > URL: https://issues.apache.org/jira/browse/TEZ-2214 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2214.1.patch > > > Scenario: > - commitMemory & usedMemory are beyond their allowed threshold. > - InMemoryMerge kicks off and is in the process of flushing memory contents > to disk > - As it progresses, it releases memory segments as well (but not yet over). > - Fetchers who need memory < maxSingleShuffleLimit, get scheduled. > - If fetchers are fast, this quickly adds up to commitMemory & usedMemory. > Since InMemoryMerge is already in progress, this wouldn't trigger another > merge(). > - Pretty soon all fetchers would be stalled and get into the following state. > {noformat} > Thread 9351: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() > @bci=17, line=337 (Interpreted frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() > @bci=34, line=157 (Interpreted frame) > {noformat} > - Even if InMemoryMerger completes, "commitedMem & usedMem" are beyond their > threshold and no other fetcher threads (all are in stalled state) are there > to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2149 PreCommit Build #331
Jira: https://issues.apache.org/jira/browse/TEZ-2149 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/331/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2762 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706725/TEZ-2149.1.txt against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 186 javac compiler warnings (more than the master's current 180 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/331//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/331//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/331//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 51847fbdaa19c11add5148b625cd3be38588f1c8 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #330 Archived 45 artifacts Archive block size is 32768 Received 19 blocks and 2106290 bytes Compression is 22.8% Took 1.6 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2149) Optimizations for the timed version of DAGClient.getStatus
[ https://issues.apache.org/jira/browse/TEZ-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376778#comment-14376778 ] Hadoop QA commented on TEZ-2149: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706725/TEZ-2149.1.txt against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 186 javac compiler warnings (more than the master's current 180 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/331//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/331//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/331//console This message is automatically generated. > Optimizations for the timed version of DAGClient.getStatus > -- > > Key: TEZ-2149 > URL: https://issues.apache.org/jira/browse/TEZ-2149 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-2149.1.txt > > > From > https://issues.apache.org/jira/browse/TEZ-1967?focusedCommentId=14325037&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325037 > - The sleep within the AM can be improved via monitors. > - INITED state is returned when communicating with the AM, SUBMITTED state is > returned when communicating with the RM. That could be used to optimize the > flow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1937) Reduce cost of merging ifiles in UnorderedPartitionedWriter
[ https://issues.apache.org/jira/browse/TEZ-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan resolved TEZ-1937. --- Resolution: Duplicate This is already taken care as part of fixing TEZ-1094. Marking this as a duplicate. > Reduce cost of merging ifiles in UnorderedPartitionedWriter > --- > > Key: TEZ-1937 > URL: https://issues.apache.org/jira/browse/TEZ-1937 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1937.1.patch, TEZ-1937.2.patch, TEZ-1937.WIP.patch > > > Currently we iterate through all spilled files for merging. This incurs > additional deserialization cost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2076) Tez framework to extract/analyze data stored in ATS for specific dag
[ https://issues.apache.org/jira/browse/TEZ-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2076: -- Attachment: TEZ-2076.6.patch Fixed minor pom.xml issue. > Tez framework to extract/analyze data stored in ATS for specific dag > > > Key: TEZ-2076 > URL: https://issues.apache.org/jira/browse/TEZ-2076 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2076.1.patch, TEZ-2076.2.patch, TEZ-2076.3.patch, > TEZ-2076.4.patch, TEZ-2076.5.patch, TEZ-2076.6.patch, TEZ-2076.WIP.2.patch, > TEZ-2076.WIP.3.patch, TEZ-2076.WIP.patch > > > - Users should be able to download ATS data pertaining to a DAG from Tez-UI > (more like a zip file containing DAG/Vertex/Task/TaskAttempt info). > - This can be plugged to an analyzer which parses the data, adds semantics > and provides an in-memory representation for further analysis. > - This will enable to write different analyzer rules, which can be run on top > of this in-memory representation to come up with analysis on the DAG. > - Results of this analyzer rules can be rendered on to UI (standalone webapp) > later point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2076 PreCommit Build #332
Jira: https://issues.apache.org/jira/browse/TEZ-2076 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/332/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2767 lines...] [INFO] Final Memory: 82M/846M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706758/TEZ-2076.6.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/332//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/332//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 03d06ac089aa0174e7fd748b49510ec9b96dd930 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #330 Archived 53 artifacts Archive block size is 32768 Received 6 blocks and 7384023 bytes Compression is 2.6% Took 2.3 sec Description set: TEZ-2076 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2076) Tez framework to extract/analyze data stored in ATS for specific dag
[ https://issues.apache.org/jira/browse/TEZ-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376958#comment-14376958 ] Hadoop QA commented on TEZ-2076: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706758/TEZ-2076.6.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/332//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/332//console This message is automatically generated. > Tez framework to extract/analyze data stored in ATS for specific dag > > > Key: TEZ-2076 > URL: https://issues.apache.org/jira/browse/TEZ-2076 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2076.1.patch, TEZ-2076.2.patch, TEZ-2076.3.patch, > TEZ-2076.4.patch, TEZ-2076.5.patch, TEZ-2076.6.patch, TEZ-2076.WIP.2.patch, > TEZ-2076.WIP.3.patch, TEZ-2076.WIP.patch > > > - Users should be able to download ATS data pertaining to a DAG from Tez-UI > (more like a zip file containing DAG/Vertex/Task/TaskAttempt info). > - This can be plugged to an analyzer which parses the data, adds semantics > and provides an in-memory representation for further analysis. > - This will enable to write different analyzer rules, which can be run on top > of this in-memory representation to come up with analysis on the DAG. > - Results of this analyzer rules can be rendered on to UI (standalone webapp) > later point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377018#comment-14377018 ] Jeff Zhang commented on TEZ-2204: - Upload new patch (exclude the findbugs warning ) [~hitesh] [~bikassaha] Please help review it. > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch > > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2204: Attachment: TEZ-2204-3.patch > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch > > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2204: Attachment: TEZ-2204-4.patch Minor update on the patch > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch, > TEZ-2204-4.patch > > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2217: Attachment: TEZ-2217.1.patch Attaching a fix that ensures that when there are no further pending container requests then new containers are not released if they have been added to the min held list. This should be safe because there are no pending requests. [~gopalv] Can you please try this out and see if this fixes your case? If so, then a review would be great :) The code change is minimal and explained above. The test was a pain to write :P > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377068#comment-14377068 ] Gopal V commented on TEZ-2217: -- I quickly cross-checked, this - it seems to be still letting go of containers despite min-held being > queue size. The containers were observed as being released during the getSplits() operation. {code} 2015-03-23 18:35:14,865 INFO [InputInitializer [Map 1] #0] io.HiveInputFormat: Generating splits 2015-03-23 18:35:14,870 INFO [InputInitializer [Map 1] #0] log.PerfLogger: 2015-03-23 18:35:14,889 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing container, containerId=container_1424502260528_1391_01_000310, containerExpiryTime=1427160914665, idleTimeoutMin=5000 2015-03-23 18:35:14,889 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1424502260528_1391_01_000310 2015-03-23 18:35:14,889 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1424502260528_1391_11][Event:CONTAINER_STOPPED]: containerId=container_1424502260528_1391_01_000310, stoppedTime=1427160914889, exitStatus=0 2015-03-23 18:35:14,889 INFO [Dispatcher thread: Central] container.AMContainerImpl: AMContainer container_1424502260528_1391_01_000310 transitioned from IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-03-23 18:35:14,890 INFO [ContainerLauncher #25] launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_STOP_REQUEST {code} > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377072#comment-14377072 ] Gopal V commented on TEZ-2217: -- [~bikassaha]: any suggestions on more logging in the code to narrow down this? > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2204 PreCommit Build #333
Jira: https://issues.apache.org/jira/browse/TEZ-2204 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/333/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2753 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706785/TEZ-2204-4.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/333//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/333//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. e96c6d82358f4d860778235c1794bfe40a782908 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #332 Archived 44 artifacts Archive block size is 32768 Received 8 blocks and 2461464 bytes Compression is 9.6% Took 0.89 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377074#comment-14377074 ] Hadoop QA commented on TEZ-2204: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706785/TEZ-2204-4.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/333//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/333//console This message is automatically generated. > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch, > TEZ-2204-4.patch > > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2223) TestMockDAGAppMaster fails due to TEZ-2210
Jeff Zhang created TEZ-2223: --- Summary: TestMockDAGAppMaster fails due to TEZ-2210 Key: TEZ-2223 URL: https://issues.apache.org/jira/browse/TEZ-2223 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang [~bikassaha] looks like TestMockDAGAppMaster fails due to TEZ-2210 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2223) TestMockDAGAppMaster fails due to TEZ-2210
[ https://issues.apache.org/jira/browse/TEZ-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2223: Description: [~bikassaha] looks like TestMockDAGAppMaster fails due to TEZ-2210 It would fail on mac due to cpuPlugin is null was:[~bikassaha] looks like TestMockDAGAppMaster fails due to TEZ-2210 > TestMockDAGAppMaster fails due to TEZ-2210 > -- > > Key: TEZ-2223 > URL: https://issues.apache.org/jira/browse/TEZ-2223 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang > > [~bikassaha] looks like TestMockDAGAppMaster fails due to TEZ-2210 > It would fail on mac due to cpuPlugin is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2223) TestMockDAGAppMaster fails due to TEZ-2210 on mac
[ https://issues.apache.org/jira/browse/TEZ-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2223: Summary: TestMockDAGAppMaster fails due to TEZ-2210 on mac (was: TestMockDAGAppMaster fails due to TEZ-2210) > TestMockDAGAppMaster fails due to TEZ-2210 on mac > - > > Key: TEZ-2223 > URL: https://issues.apache.org/jira/browse/TEZ-2223 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang > > [~bikassaha] looks like TestMockDAGAppMaster fails due to TEZ-2210 > It would fail on mac due to cpuPlugin is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377077#comment-14377077 ] Bikas Saha commented on TEZ-2217: - Sorry. To be clear. This is with the patch attached? > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2221: Attachment: TEZ-2221-1.patch > VertexGroup name should be unqiue > - > > Key: TEZ-2221 > URL: https://issues.apache.org/jira/browse/TEZ-2221 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2221-1.patch > > > VertexGroupCommitStartedEvent & VertexGroupCommitFinishedEvent use vertex > group name to identify the vertex group commit, the same name of vertex group > will conflict. While in the current equals & hashCode of VertexGroup, vertex > group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377083#comment-14377083 ] Bikas Saha commented on TEZ-2217: - If its with the patch, then it would mean that the scheduler has non-empty task requests at that time. With the fix, can you please attach the AM logs with debug logging enabled for the YarnTaskSchedulerService only. Else it will have RPC junk in it. Thanks > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377086#comment-14377086 ] Gopal V commented on TEZ-2217: -- Yes, the LOG does not say "delay expired or is new." - which seems in the codepath that this patch changed. Which is why I asked about new logging. > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377105#comment-14377105 ] Bikas Saha commented on TEZ-714: Not seen the patch yet because it may change if you agree with these comments bq. Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions. Can this be fixed by having the events for both be different? But still handled in the same transition. The transition can check if its a group commit event vs normal commit event (based on event type) - and then log for group commit. Maybe group commit event can derive from normal commit event. Is this recovery log written relevant only in the non-commit-at-end case where group commits can happen before the DAG finishes? bq. Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher For these tests we could choose to use the normal thread pool by overriding the setup. Since this is a new test, it can try to not depend on ordering like the existing tests do. If so, then it should be fine to use the real threadpool instead of the fake thread pool that delegates to the dispatcher. Maybe you can create a new TestCommit that starts from scratch without the hacks in TestVertexImpl. bq. For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR) Is this for VertexImpl or DAGImpl? That sounds like a bug. Is that relevant to the commit operation though? > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377105#comment-14377105 ] Bikas Saha edited comment on TEZ-714 at 3/24/15 2:05 AM: - Not seen the patch yet because it may change if you agree with these comments bq. Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions. Can this be fixed by having the events for both be different? But still handled in the same transition. The transition can check if its a group commit event vs normal commit event (based on event type) - and then log for group commit. Maybe group commit event can derive from normal commit event. IMO, having less transitions makes the code much simpler. Is this recovery log written relevant only in the non-commit-at-end case where group commits can happen before the DAG finishes? bq. Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher For these tests we could choose to use the normal thread pool by overriding the setup. Since this is a new test, it can try to not depend on ordering like the existing tests do. If so, then it should be fine to use the real threadpool instead of the fake thread pool that delegates to the dispatcher. Maybe you can create a new TestCommit that starts from scratch without the hacks in TestVertexImpl. bq. For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR) Is this for VertexImpl or DAGImpl? That sounds like a bug. Is that relevant to the commit operation though? was (Author: bikassaha): Not seen the patch yet because it may change if you agree with these comments bq. Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions. Can this be fixed by having the events for both be different? But still handled in the same transition. The transition can check if its a group commit event vs normal commit event (based on event type) - and then log for group commit. Maybe group commit event can derive from normal commit event. Is this recovery log written relevant only in the non-commit-at-end case where group commits can happen before the DAG finishes? bq. Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher For these tests we could choose to use the normal thread pool by overriding the setup. Since this is a new test, it can try to not depend on ordering like the existing tests do. If so, then it should be fine to use the real threadpool instead of the fake thread pool that delegates to the dispatcher. Maybe you can create a new TestCommit that starts from scratch without the hacks in TestVertexImpl. bq. For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR) Is this for VertexImpl or DAGImpl? That sounds like a bug. Is that relevant to the commit operation though? > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377115#comment-14377115 ] Bikas Saha commented on TEZ-2217: - The existing debug logs should be enough if enabled. What is intriguing is that at this point in time there are pending task requests that have not already been matched to the containers because I am guessing that the job already has all the containers it will ever get. If that was not the case then it would hit the changed code path (AM is idle or there are no pending requests). What is the min expiry time compared to the delays between node-rack-star matching? Hoping that the containers have been tried to be matched upto star before the min expiry elapses. So all tasks should have been matched to some containers leading to empty task requests. > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377118#comment-14377118 ] Bikas Saha commented on TEZ-2217: - This may help in setting debug logs for only 1 class {noformat} /** * Root Logging level passed to the Tez app master. * * Simple configuration: Set the log level for all loggers. * e.g. INFO * This sets the log level to INFO for all loggers. * * Advanced configuration: Set the log level for all classes, along with a different level for some. * e.g. DEBUG;org.apache.hadoop.ipc=INFO;org.apache.hadoop.security=INFO * This sets the log level for all loggers to DEBUG, expect for the * org.apache.hadoop.ipc and org.apache.hadoop.security, which are set to INFO * * Note: The global log level must always be the first parameter. * DEBUG;org.apache.hadoop.ipc=INFO;org.apache.hadoop.security=INFO is valid * org.apache.hadoop.ipc=INFO;org.apache.hadoop.security=INFO is not valid * */ @ConfigurationScope(Scope.AM) public static final String TEZ_AM_LOG_LEVEL = TEZ_AM_PREFIX + "log.level"; public static final String TEZ_AM_LOG_LEVEL_DEFAULT = "INFO"; {noformat} > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2223) TestMockDAGAppMaster fails due to TEZ-2210 on mac
[ https://issues.apache.org/jira/browse/TEZ-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377121#comment-14377121 ] Bikas Saha commented on TEZ-2223: - perhaps tezmxbeansresourcecalculator can be used as the default. > TestMockDAGAppMaster fails due to TEZ-2210 on mac > - > > Key: TEZ-2223 > URL: https://issues.apache.org/jira/browse/TEZ-2223 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang > > [~bikassaha] looks like TestMockDAGAppMaster fails due to TEZ-2210 > It would fail on mac due to cpuPlugin is null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377128#comment-14377128 ] Hadoop QA commented on TEZ-2217: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706789/TEZ-2217.1.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/334//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/334//console This message is automatically generated. > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217.1.patch, TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2217 PreCommit Build #334
Jira: https://issues.apache.org/jira/browse/TEZ-2217 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/334/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2752 lines...] [INFO] Final Memory: 67M/805M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706789/TEZ-2217.1.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/334//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/334//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. c119360121f6701025a88826b76dec1f3083c568 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #332 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2527096 bytes Compression is 7.2% Took 0.73 sec Description set: TEZ-2217 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-2217: - Attachment: TEZ-2217-debug.txt.bz2 Debug logs attached. {code} $ grep "Releasing unused" app-log.txt | wc -l 111 {code} I always use {{--hiveconf tez.am.log.level="INFO;=DEBUG"}}, that seems to have worked. > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217-debug.txt.bz2, TEZ-2217.1.patch, > TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2217 PreCommit Build #337
Jira: https://issues.apache.org/jira/browse/TEZ-2217 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/337/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 31 lines...] HEAD is now at 6d0b10a TEZ-2176. Move all logging to slf4j. Contributed by Vasanth kumar RJ. Previous HEAD position was 6d0b10a... TEZ-2176. Move all logging to slf4j. Contributed by Vasanth kumar RJ. Switched to branch 'master' Your branch is behind 'origin/master' by 30 commits, and can be fast-forwarded. (use "git pull" to update your local branch) First, rewinding head to replay your work on top of it... Fast-forwarded master to 6d0b10a8445d3c26b0958ce816c64b577a1608d9. TEZ-2217 patch is being downloaded at Tue Mar 24 02:28:13 UTC 2015 from http://issues.apache.org/jira/secure/attachment/12706809/TEZ-2217-debug.txt.bz2 patch: Only garbage was found in the patch input. patch: Only garbage was found in the patch input. patch: Only garbage was found in the patch input. The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706809/TEZ-2217-debug.txt.bz2 against master revision 6d0b10a. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/337//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 468b2e1ce34852fa777431321e7aaa5322b885d9 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #334 Archived 7 artifacts Archive block size is 32768 Received 0 blocks and 1408632 bytes Compression is 0.0% Took 0.36 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
Failed: TEZ-714 PreCommit Build #336
Jira: https://issues.apache.org/jira/browse/TEZ-714 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/336/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 1229 lines...] Running tests /home/jenkins/tools/maven/latest/bin/mvn clean install -fn -DTezPatchProcess /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/build-tools/test-patch.sh: line 609: /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/testrun.txt: No such file or directory cat: /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/testrun.txt: No such file or directory awk: cannot open /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/testrun.txt (No such file or directory) {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706466/TEZ-714-2.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to cause Findbugs (version 2.0.3) to fail. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/336//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/336//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 92eeff2a0bc0fe4afb6396a0f6663a6b640cf699 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377148#comment-14377148 ] Hadoop QA commented on TEZ-2217: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706809/TEZ-2217-debug.txt.bz2 against master revision 6d0b10a. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/337//console This message is automatically generated. > The min-held-containers constraint is not enforced during query runtime > > > Key: TEZ-2217 > URL: https://issues.apache.org/jira/browse/TEZ-2217 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Gopal V >Assignee: Bikas Saha > Attachments: TEZ-2217-debug.txt.bz2, TEZ-2217.1.patch, > TEZ-2217.txt.bz2 > > > The min-held containers constraint is respected during query idle times, but > is not respected when a query is actually in motion. > The AM releases unused containers during dag execution without checking for > min-held containers. > {code} > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing > container, containerId=container_1424502260528_1348_01_13, > containerExpiryTime=1426891313264, idleTimeoutMin=5000 > 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] > rm.YarnTaskSchedulerService: Releasing unused container: > container_1424502260528_1348_01_13 > {code} > This is actually useful only after the AM has received a soft pre-emption > message, doing it on an idle cluster slows down one of the most common query > patterns in BI systems. > {code} > create temporary table smalltable as ...; > select ... bigtable JOIN smalltable ON ...; > {code} > The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377150#comment-14377150 ] Hadoop QA commented on TEZ-714: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706466/TEZ-714-2.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to cause Findbugs (version 2.0.3) to fail. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/336//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/336//console This message is automatically generated. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377180#comment-14377180 ] Hadoop QA commented on TEZ-2221: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706800/TEZ-2221-1.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/335//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/335//console This message is automatically generated. > VertexGroup name should be unqiue > - > > Key: TEZ-2221 > URL: https://issues.apache.org/jira/browse/TEZ-2221 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2221-1.patch > > > VertexGroupCommitStartedEvent & VertexGroupCommitFinishedEvent use vertex > group name to identify the vertex group commit, the same name of vertex group > will conflict. While in the current equals & hashCode of VertexGroup, vertex > group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2221 PreCommit Build #335
Jira: https://issues.apache.org/jira/browse/TEZ-2221 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/335/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2749 lines...] [INFO] Final Memory: 70M/973M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706800/TEZ-2221-1.patch against master revision 6d0b10a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/335//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/335//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. d5affd0ff69e0697d9b68ca07d5e206cb522faa6 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #334 Archived 44 artifacts Archive block size is 32768 Received 21 blocks and 2035862 bytes Compression is 25.3% Took 0.75 sec Description set: TEZ-2221 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377209#comment-14377209 ] Jeff Zhang commented on TEZ-714: bq. Can this be fixed by having the events for both be different? But still handled in the same transition. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded & failure), so there would be totally 8 different cases in one transition which may be difficult to read. bq. Is this recovery log written relevant only in the non-commit-at-end case where group commits can happen before the DAG finishes? Yes bq. Maybe you can create a new TestCommit that starts from scratch without the hacks in TestVertexImpl. Yes, this is I plan to do. bq. Is this for VertexImpl or DAGImpl? That sounds like a bug. Is that relevant to the commit operation though? It is relevant to the abort. Currently in DAG's InternalErrorTransition (no matter what state it is ), dag would abort directly and go to ERROR state without waiting for vertex to finish. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2097) TEZ-UI Add dag logs
[ https://issues.apache.org/jira/browse/TEZ-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2097: - Priority: Critical (was: Blocker) > TEZ-UI Add dag logs > --- > > Key: TEZ-2097 > URL: https://issues.apache.org/jira/browse/TEZ-2097 > Project: Apache Tez > Issue Type: Bug > Components: UI >Reporter: Jeff Zhang >Priority: Critical > > If dag fails due to AM error, there's no way to check the dag logs on tez-ui. > Users have to grab the app logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2097) TEZ-UI Add dag logs
[ https://issues.apache.org/jira/browse/TEZ-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377263#comment-14377263 ] Hitesh Shah commented on TEZ-2097: -- Downgrading to critical. > TEZ-UI Add dag logs > --- > > Key: TEZ-2097 > URL: https://issues.apache.org/jira/browse/TEZ-2097 > Project: Apache Tez > Issue Type: Bug > Components: UI >Reporter: Jeff Zhang >Priority: Critical > > If dag fails due to AM error, there's no way to check the dag logs on tez-ui. > Users have to grab the app logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2097) TEZ-UI Add dag logs
[ https://issues.apache.org/jira/browse/TEZ-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2097: - Target Version/s: 0.6.2 (was: 0.6.1) > TEZ-UI Add dag logs > --- > > Key: TEZ-2097 > URL: https://issues.apache.org/jira/browse/TEZ-2097 > Project: Apache Tez > Issue Type: Bug > Components: UI >Reporter: Jeff Zhang >Priority: Critical > > If dag fails due to AM error, there's no way to check the dag logs on tez-ui. > Users have to grab the app logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377268#comment-14377268 ] Hitesh Shah commented on TEZ-2205: -- [~rohini] [~hagleitn] any comments/concerns on the approach that we plan to take? > Tez still tries to post to ATS when yarn.timeline-service.enabled=false > --- > > Key: TEZ-2205 > URL: https://issues.apache.org/jira/browse/TEZ-2205 > Project: Apache Tez > Issue Type: Sub-task >Affects Versions: 0.6.1 >Reporter: Chang Li >Assignee: Chang Li > Attachments: TEZ-2205.wip.patch > > > when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, > but hits error as token is not found. Does not fail the job because of the > fix to not fail job when there is error posting to ATS. But it should not be > trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377270#comment-14377270 ] Hitesh Shah commented on TEZ-2047: -- [~pramachandran] Sorry for the delay in the review. Comments: The basic change looks fine but I am not sure how we are enforcing only http ( no ssl ) mode with the current implemenation? The WebApps code seems to eventually look into the config for the yarn policy. Should the WebUIService be setting that up correctly to enforce http only? > Build fails against hadoop-2.2 post TEZ-2018 > > > Key: TEZ-2047 > URL: https://issues.apache.org/jira/browse/TEZ-2047 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Prakash Ramachandran >Priority: Blocker > Attachments: TEZ-2047.1.patch > > > Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) > on project tez-dag: Compilation failure: Compilation failure: > [ERROR] > /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] > cannot find symbol > [ERROR] symbol : method > withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) > [ERROR] location: class > org.apache.hadoop.yarn.webapp.WebApps.Builder > [ERROR] > /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] > cannot find symbol > [ERROR] symbol : method getConnectorAddress(int) > [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-986) Make conf set on DAG and vertex available in jobhistory
[ https://issues.apache.org/jira/browse/TEZ-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377271#comment-14377271 ] Hitesh Shah commented on TEZ-986: - Moving this out to 0.6.2. Not sure if [~Sreenath] has had a chance to look at this jira. > Make conf set on DAG and vertex available in jobhistory > --- > > Key: TEZ-986 > URL: https://issues.apache.org/jira/browse/TEZ-986 > Project: Apache Tez > Issue Type: Sub-task > Components: UI >Reporter: Rohini Palaniswamy >Priority: Blocker > > Would like to have the conf set on DAG and Vertex > 1) viewable in Tez UI after the job completes. This is very essential for > debugging jobs. > 2) We have processes, that parse jobconf.xml from job history (hdfs) and > load them into hive tables for analysis. Would like to have Tez also make all > the configuration (byte array) available in job history so that we can > similarly parse them. 1) mandates that you store it in hdfs. 2) is just to > say make the format stored as a contract others can rely on for parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-986) Make conf set on DAG and vertex available in jobhistory
[ https://issues.apache.org/jira/browse/TEZ-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-986: Target Version/s: 0.6.2 (was: 0.6.1) > Make conf set on DAG and vertex available in jobhistory > --- > > Key: TEZ-986 > URL: https://issues.apache.org/jira/browse/TEZ-986 > Project: Apache Tez > Issue Type: Sub-task > Components: UI >Reporter: Rohini Palaniswamy >Priority: Blocker > > Would like to have the conf set on DAG and Vertex > 1) viewable in Tez UI after the job completes. This is very essential for > debugging jobs. > 2) We have processes, that parse jobconf.xml from job history (hdfs) and > load them into hive tables for analysis. Would like to have Tez also make all > the configuration (byte array) available in job history so that we can > similarly parse them. 1) mandates that you store it in hdfs. 2) is just to > say make the format stored as a contract others can rely on for parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2192) Relocalization does not check for source
[ https://issues.apache.org/jira/browse/TEZ-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2192: - Target Version/s: 0.5.4, 0.6.1 (was: 0.5.4) > Relocalization does not check for source > > > Key: TEZ-2192 > URL: https://issues.apache.org/jira/browse/TEZ-2192 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0, 0.5.2 >Reporter: Rohini Palaniswamy >Priority: Blocker > > PIG-4443 spills the input splits to disk if serialized split size is greater > than some threshold. It faces issues with relocalization when more than one > vertex has job.split file. If a job.split file is already there on container > reuse, it is reused causing wrong data to be read. > Either need a way to turn off relocalization or check the source+timestamp > and redownload the file during relocalization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377280#comment-14377280 ] Hitesh Shah commented on TEZ-1421: -- [~ozawa] In that case ( given that the solution seems to non-trivial), I think we can move the target version to 0.7.0 given that not many other folks have reported this issue. Agree? > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated TEZ-1421: Comment: was deleted (was: [~hitesh] Yes, I agree with you.) > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377293#comment-14377293 ] Tsuyoshi Ozawa commented on TEZ-1421: - [~hitesh] Yes, I agree with you. > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377292#comment-14377292 ] Tsuyoshi Ozawa commented on TEZ-1421: - [~hitesh] Yes, I agree with you. > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377297#comment-14377297 ] Hitesh Shah commented on TEZ-1909: -- Comments: {code} LOG.warn("Other recovery files will be skipped due to error in the previous recovery file"); {code} - please add the file name to this line as well as its length For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to TEZ_TEST_... and likewise change property value. No scope defined? It seems like the patch for this jira has been merged with fixes for a different jira? Can these be separated out? > Remove need to copy over all events from attempt 1 to attempt 2 dir > --- > > Key: TEZ-1909 > URL: https://issues.apache.org/jira/browse/TEZ-1909 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch > > > Use of file versions should prevent the need for copying over data into a > second attempt dir. Care needs to be taken to handle "last corrupt record" > handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377297#comment-14377297 ] Hitesh Shah edited comment on TEZ-1909 at 3/24/15 5:07 AM: --- Comments: {code} LOG.warn("Other recovery files will be skipped due to error in the previous recovery file"); {code} - please add the file name to this log line For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to TEZ_TEST_... and likewise change property value. No scope defined? It seems like the patch for this jira has been merged with fixes for a different jira? Can these be separated out? was (Author: hitesh): Comments: {code} LOG.warn("Other recovery files will be skipped due to error in the previous recovery file"); {code} - please add the file name to this line as well as its length For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to TEZ_TEST_... and likewise change property value. No scope defined? It seems like the patch for this jira has been merged with fixes for a different jira? Can these be separated out? > Remove need to copy over all events from attempt 1 to attempt 2 dir > --- > > Key: TEZ-1909 > URL: https://issues.apache.org/jira/browse/TEZ-1909 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch > > > Use of file versions should prevent the need for copying over data into a > second attempt dir. Care needs to be taken to handle "last corrupt record" > handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377301#comment-14377301 ] Hitesh Shah commented on TEZ-2221: -- what happens if someone does the following: {code} dag.createVertexGroup("group_1", v1,v2); dag.createVertexGroup("group_2", v1,v2); {code} This should also be disallowed. Correct? > VertexGroup name should be unqiue > - > > Key: TEZ-2221 > URL: https://issues.apache.org/jira/browse/TEZ-2221 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2221-1.patch > > > VertexGroupCommitStartedEvent & VertexGroupCommitFinishedEvent use vertex > group name to identify the vertex group commit, the same name of vertex group > will conflict. While in the current equals & hashCode of VertexGroup, vertex > group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1421: - Priority: Critical (was: Blocker) > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Critical > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch
[ https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1421: - Target Version/s: 0.7.0 (was: 0.6.1) > MRCombiner throws NPE in MapredWordCount on master branch > - > > Key: TEZ-1421 > URL: https://issues.apache.org/jira/browse/TEZ-1421 > Project: Apache Tez > Issue Type: Bug >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > > I tested MapredWordCount against 70GB generated by RandowTextWriter. When a > Combiner runs, it throws NPE. It looks setCombinerClass doesn't work > correctly. > {quote} > Caused by: java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) > at > org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122) > at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605) > at > org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2204: - Target Version/s: 0.7.0 (was: 0.5.4) > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch, > TEZ-2204-4.patch > > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377310#comment-14377310 ] Hitesh Shah commented on TEZ-2204: -- Comments: {code} // don't handle events if DAGAppMaster is in the state of STOPPED, 720 // otherwise there may be dead-lock happen. TEZ-2204 721 if (DAGAppMaster.this.getServiceState() == STATE.STOPPED) { 722 return; 723 } {code} Can you add a log message to identify what events are being received after the AM is stopped? +1 after the above comment is addressed. > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch, > TEZ-2204-4.patch > > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377323#comment-14377323 ] Jeff Zhang commented on TEZ-2221: - bq. This should also be disallowed. Correct? Yes, it is not allowed. > VertexGroup name should be unqiue > - > > Key: TEZ-2221 > URL: https://issues.apache.org/jira/browse/TEZ-2221 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2221-1.patch > > > VertexGroupCommitStartedEvent & VertexGroupCommitFinishedEvent use vertex > group name to identify the vertex group commit, the same name of vertex group > will conflict. While in the current equals & hashCode of VertexGroup, vertex > group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377327#comment-14377327 ] Hitesh Shah commented on TEZ-2221: -- Sorry - should have clarified. The test is being changed to not test that condition. > VertexGroup name should be unqiue > - > > Key: TEZ-2221 > URL: https://issues.apache.org/jira/browse/TEZ-2221 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2221-1.patch > > > VertexGroupCommitStartedEvent & VertexGroupCommitFinishedEvent use vertex > group name to identify the vertex group commit, the same name of vertex group > will conflict. While in the current equals & hashCode of VertexGroup, vertex > group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377330#comment-14377330 ] Jeff Zhang commented on TEZ-2221: - In the previous testcase we compare vertex group by using both group_name and members, I change the the test case to indicate that now we only compare with group name. > VertexGroup name should be unqiue > - > > Key: TEZ-2221 > URL: https://issues.apache.org/jira/browse/TEZ-2221 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2221-1.patch > > > VertexGroupCommitStartedEvent & VertexGroupCommitFinishedEvent use vertex > group name to identify the vertex group commit, the same name of vertex group > will conflict. While in the current equals & hashCode of VertexGroup, vertex > group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377401#comment-14377401 ] Bikas Saha commented on TEZ-714: bq. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded & failure), so there would be totally 8 different cases in one transition which may be difficult to read. I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state transitions as inspiration. There are some standard things to do when a commit operation completes. e.g. decrement the outstanding commit counter. If commit was a group commit then write the recovery entry for it. If the commit fails then set a flag to abort. This can be in a base transition say CommitCompletedTransition. Then we can have CommitCompletedWhileRunningTransition that calls the base for common code and does running specific stuff.e.g. trigger job failure upon commit failure. And another transition for CommitCompletedWhileCommitting that just waits for the commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits for all commit operations to complete and then calls abort (this could be blocking for now). Perhaps, all commit events need to have a shared boolean that they should check before invoking commit. This boolean could be set to false when the vertex/dag decides to abort. This would make and pending commit operations complete quickly instead of trying to commit unnecessarily. Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. Create custom committers that fail/pass as desired and check that the dag behaved as expected. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377401#comment-14377401 ] Bikas Saha edited comment on TEZ-714 at 3/24/15 6:54 AM: - bq. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded & failure), so there would be totally 8 different cases in one transition which may be difficult to read. I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state transitions as inspiration. There are some standard things to do when a commit operation completes. e.g. decrement the outstanding commit counter. If commit was a group commit then write the recovery entry for it. If the commit fails then set a flag to abort. This can be in a base transition say CommitCompletedTransition. Then we can have CommitCompletedWhileRunningTransition that calls the base for common code and does running specific stuff.e.g. trigger job failure upon commit failure. And another transition for CommitCompletedWhileCommitting that just waits for the commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits for all commit operations to complete and then calls abort (this could be blocking for now). This way we can separate things while still keeping the transitions essentially linear. Instead of multiplying the possibilities by (2 commit types x 3 states x 2 commit results) Perhaps, all commit events need to have a shared boolean that they should check before invoking commit. This boolean could be set to false when the vertex/dag decides to abort. This would make and pending commit operations complete quickly instead of trying to commit unnecessarily. Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. Create custom committers that fail/pass as desired and check that the dag behaved as expected. was (Author: bikassaha): bq. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded & failure), so there would be totally 8 different cases in one transition which may be difficult to read. I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state transitions as inspiration. There are some standard things to do when a commit operation completes. e.g. decrement the outstanding commit counter. If commit was a group commit then write the recovery entry for it. If the commit fails then set a flag to abort. This can be in a base transition say CommitCompletedTransition. Then we can have CommitCompletedWhileRunningTransition that calls the base for common code and does running specific stuff.e.g. trigger job failure upon commit failure. And another transition for CommitCompletedWhileCommitting that just waits for the commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits for all commit operations to complete and then calls abort (this could be blocking for now). Perhaps, all commit events need to have a shared boolean that they should check before invoking commit. This boolean could be set to false when the vertex/dag decides to abort. This would make and pending commit operations complete quickly instead of trying to commit unnecessarily. Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. Create custom committers that fail/pass as desired and check that the dag behaved as expected. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)