[jira] [Updated] (TEZ-2692) bugfixes enhancements related to job parser and analyzer
[ https://issues.apache.org/jira/browse/TEZ-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2692: -- Attachment: TEZ-2692.3.patch - Fixed getTaskRuntime() in SlowTaskAnalyzer. It should be firstTaskToStart. - Fixed concurrency calculator logic (using sorted multi-set of timestamps as different tasks can start at same time as well. Walking through the set to determine concurrency as suggested) - Merged test with TestATSFileParser and renamed TestATSFileParser to TestHistoryParser will commit once the pre-commit passes. bugfixes enhancements related to job parser and analyzer -- Key: TEZ-2692 URL: https://issues.apache.org/jira/browse/TEZ-2692 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2692.1.patch, TEZ-2692.2.patch, TEZ-2692.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2692) bugfixes enhancements related to job parser and analyzer
[ https://issues.apache.org/jira/browse/TEZ-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2692: -- Attachment: TEZ-2692.2.patch Attaching revised patch to address review comments. bugfixes enhancements related to job parser and analyzer -- Key: TEZ-2692 URL: https://issues.apache.org/jira/browse/TEZ-2692 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2692.1.patch, TEZ-2692.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2704) Fix version on tez job analyzer
Siddharth Seth created TEZ-2704: --- Summary: Fix version on tez job analyzer Key: TEZ-2704 URL: https://issues.apache.org/jira/browse/TEZ-2704 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2658) Create a CLI utility tool to track Tez DAG/Application Stats
[ https://issues.apache.org/jira/browse/TEZ-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680412#comment-14680412 ] TezQA commented on TEZ-2658: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749617/TEZ-2658.2.patch against master revision eadbfec. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/972//console This message is automatically generated. Create a CLI utility tool to track Tez DAG/Application Stats Key: TEZ-2658 URL: https://issues.apache.org/jira/browse/TEZ-2658 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2658.1.patch, TEZ-2658.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2618) In Ordered Fetcher, if Local Fetch fails, fallback and try http Fetch before returning a failure
[ https://issues.apache.org/jira/browse/TEZ-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat updated TEZ-2618: Attachment: TEZ-2618.1.patch rebased patch on top of TEZ-2172 In Ordered Fetcher, if Local Fetch fails, fallback and try http Fetch before returning a failure Key: TEZ-2618 URL: https://issues.apache.org/jira/browse/TEZ-2618 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2618.1.patch, TEZ-2618.patch In setupLocalDiskFetch() method[this is invoked when the fetcher is in the same host as the target map host], first try to check if we can open the target spill file using the localDirAllocator.getLocalPathToRead(). The localDirAllocator searches through the list of configured dirs for the file. In disk full scenarios, if the path is not found, fetcher should to try an http fetch. proposed solution: in local fetch mode, the fetcher should first try getLocalPathToRead() for all the pending maps. and So local fetch gets divided into 2 stages: first the maps for which path was found via LocalDirAllocator and second construct a http fallback fetch list for the maps which couldnt be found via LocalDirAllocator.getLocalPathToRead() and do an http fetch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2658) Create a CLI utility tool to track Tez DAG/Application Stats
[ https://issues.apache.org/jira/browse/TEZ-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat updated TEZ-2658: Attachment: TEZ-2658.3.patch Create a CLI utility tool to track Tez DAG/Application Stats Key: TEZ-2658 URL: https://issues.apache.org/jira/browse/TEZ-2658 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2658.1.patch, TEZ-2658.2.patch, TEZ-2658.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-2704) Fix version on tez job analyzer
[ https://issues.apache.org/jira/browse/TEZ-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-2704. - Resolution: Done Looks like this was already fixed by [~zjffdu] Fix version on tez job analyzer --- Key: TEZ-2704 URL: https://issues.apache.org/jira/browse/TEZ-2704 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680555#comment-14680555 ] TezQA commented on TEZ-2300: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749649/TEZ-2300.2.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.client.TestTezClient Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/974//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/974//artifact/patchprocess/newPatchFindbugsWarningstez-api.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/974//console This message is automatically generated. TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-2300: - Attachment: TEZ-2300.2.patch TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2658) Create a CLI utility tool to track Tez DAG/Application Stats
[ https://issues.apache.org/jira/browse/TEZ-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680676#comment-14680676 ] TezQA commented on TEZ-2658: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749655/TEZ-2658.3.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/975//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/975//artifact/patchprocess/newPatchFindbugsWarningstez-cli-tools.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/975//console This message is automatically generated. Create a CLI utility tool to track Tez DAG/Application Stats Key: TEZ-2658 URL: https://issues.apache.org/jira/browse/TEZ-2658 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2658.1.patch, TEZ-2658.2.patch, TEZ-2658.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2658) Create a CLI utility tool to track Tez DAG/Application Stats
[ https://issues.apache.org/jira/browse/TEZ-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat updated TEZ-2658: Attachment: TEZ-2658.4.patch fixed findbug warning for VA_FORMAT_STRING_USES_NEWLINE. Create a CLI utility tool to track Tez DAG/Application Stats Key: TEZ-2658 URL: https://issues.apache.org/jira/browse/TEZ-2658 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2658.1.patch, TEZ-2658.2.patch, TEZ-2658.3.patch, TEZ-2658.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services
[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680721#comment-14680721 ] Siddharth Seth edited comment on TEZ-2003 at 8/10/15 9:09 PM: -- bq. logErrorIngored, hearbeats, getCurretnDagName bq. - remove “*” e,g, import org.apache.tez.common.asterisk; Captured in TEZ-2678 bq. abortTask vs close/cleanup Will check the code. abortTask should try cleaning up in both of them. bq. TezTaskRunner2 killTask isn't used yet within Tez, which is why it's not informing the AM. When task preemption comes in - the flow is likely to be a killTask invoked as a result of an RPC, at which point the AM already knows that the task is killed since it took the decision. On the various atomic gets - there's separate variables to track what states have been set, and is used in the return result. Atomicity of the entire operation is handled via synchronization blocks. TaskRunner handling containerStop is a result of containerStop coming over a shared Task/Container protocol - which is linked to the running task. It could be separated, but I think that'll need the protocols to be separated as well. canCommit during a shutdown - will change this. I'll also verify what the TaskRunner behaviour was. TEZ-2678 bq. TaskReporter I don't think shutdown needs synchronization. It modifies a final variable. Whether it's implemented correctly needs more investigation. It's the same as what exists on master. bq. ShuffleHandler This is essentially the shuffle handler that is used in regular clusters. It's not meant as a benchmark tool. Using he current shuffle mechanics seems like the simplest mechanism to have jobs work with the standard set of Inputs/Outputs which write to disk. bq. ext-service-tests Agree with making this a reference for ext services. It would need to implement the APIs better, and be documented a lot bette to serve this purpose. Creating a new jira to track this - TEZ-2705. Post merge ? bq. JoinValidate The changes are for private use, to be able to re-use the example in testing. Will add docs to mention this. bq. TezTaskCommunicatorImpl Using payloads wherever possible - including internal plugins. Avoided in LocalContainerLauncher only at the moment, where a lot of runtime AM information is used. Will fix isKnownContainer and containerAlive t be based on specific communicator. Renaming methods in TaskComm - tracked in the TaskComm enhancements jira getDagName null - will try improving this. getVertexName - I'm not sure there's a lot that can be done. TezException instead of NPE ? Eventually this will lead to an error in the plugin, which needs to be handled better. There's a jira to track such error handling. onStateUpdated - is the AM telling the TaskCommunicator plugin that a vertex has changed state. Similar to what is done elsewhere - like the InputInitializers. dagCompleteStart - couldn't find this. Maybe I removed it at some point for the same reason - is a very confusing name. bq. Is there a need for the framework to make updates into the Context object? If yes, should the Context implement 2 interfaces? Should the internal objects just bind to the internal Impl objects or are they bound to the public plugin interfaces to catch compat errors? Binding to Impls directly may mean a smaller public API interface. Need more clarification on this comment. bq. ctor.setAccessible(true); Will do. was (Author: sseth): bq. logErrorIngored, hearbeats, getCurretnDagName bq. - remove “*” e,g, import org.apache.tez.common.asterisk; Captured in TEZ-2678 bq. abortTask vs close/cleanup Will check the code. abortTask should try cleaning up in both of them. bq. TezTaskRunner2 killTask isn't used yet within Tez, which is why it's not informing the AM. When task preemption comes in - the flow is likely to be a killTask invoked as a result of an RPC, at which point the AM already knows that the task is killed since it took the decision. On the various atomic gets - there's separate variables to track what states have been set, and is used in the return result. Atomicity of the entire operation is handled via synchronization blocks. TaskRunner handling containerStop is a result of containerStop coming over a shared Task/Container protocol - which is linked to the running task. It could be separated, but I think that'll need the protocols to be separated as well. canCommit during a shutdown - will change this. I'll also verify what the TaskRunner behaviour was. TEZ-2678 bq. TaskReporter I don't think shutdown needs synchronization. It modifies a final variable. Whether it's implemented correctly needs more investigation. It's the same as what exists on master. bq. ShuffleHandler This is essentially the shuffle handler that is used in regular clusters. It's not meant as a benchmark tool. Using he current shuffle
[jira] [Commented] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services
[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680784#comment-14680784 ] Bikas Saha commented on TEZ-2003: - Some initial comments on the modified existing code. Not yet seen the newly added code. One repeated item in the comments is the special casing of uber/yarn mode in different places. I would expect plugins to come in from the user when the AM is created and either the client/AppMaster would create default plugins for uber/yarn. Thereafter dag/vertex/routers should not need to have special casing for any plugin (like the uber/yarn special casing that exists in all these places in the branch). e.g. the vertexmanager only uses VertexManagerPluginDescriptor - even for the built-in plugins. Similar, I would expect the communicator/scheduler/launcher wrappers to work only with plugin descriptors. Also creating a ServicePlugin class will help in reducing code duplication and make maintenance easier instead of having scheduler id, launcherId and commId everywhere. ContainerSignatureMatcher - ExecutorSignatureMatcher ? {code} +public interface ContainerSignatureMatcher { {code} Why is this here? ServicePluginLifecyle etc. in tez-runtime-api like Inputs/Output/InputInitializer etc. ? Typically we say start() - stop() instead of shutdown {code}+public interface ServicePluginLifecycle { + void start() throws Exception; + void shutdown() throws Exception; {code} Why are executedInAm and executeInContainers there? {code} + public static class VertexExecutionContext { +final boolean executeInAm; +final boolean executeInContainers; +final String taskSchedulerName;{code} Rename to ExecutorEndReason ? Also, how can An error in the AM be caused by a container running a task? {code} +public enum ContainerEndReason { + NODE_FAILED, // Completed because the node running the container was marked as dead + APPLICATION_ERROR, // An error in the AM caused by user code + FRAMEWORK_ERROR, // An error in the AM - likely a bug. + LAUNCH_FAILED, // Failure to launch the container +}{code} Why does this have schedulerName and taskCommName ? {code} +public class ContainerLaunchRequest extends ContainerLauncherOperationBase { + + private final ContainerLaunchContext clc; + private final Container container; +} {code} Why enableContainers and enableUber? {code} +public class ServicePluginsDescriptor { + + private final boolean enableContainers; + private final boolean enableUber; + {code} Has the internal one been replaced by this? {code} +public enum TaskAttemptEndReason { + NODE_FAILED, // Completed because the node running the container was marked as dead +}{code} Rename to ExecutorBusy ? {code} + COMMUNICATION_ERROR, // Equivalent to a launch failure + SERVICE_BUSY, // Service rejected the task + INTERRUPTED_BY_SYSTEM, // Interrupted by the system. e.g. Pre-emption + INTERRUPTED_BY_USER, // Interrupted by the user + }{code} Why isLocal flag needs to be passed to Scheduler/Launcher/Communicator routers? Instead of a service plugin for local Is is ensured that the integer for a service plugin will turn out to be the same after AM restart? Why is yarn scheduler special cased? Launcher/Communicator dont have the special casing ? {code} + static void processSchedulerDescriptors(ListNamedEntityDescriptor descriptors, boolean isLocal, + UserPayload defaultPayload, + BiMapString, Integer schedulerPluginMap) { . + if (!foundYarn) { +NamedEntityDescriptor yarnDescriptor = +new NamedEntityDescriptor(TezConstants.getTezYarnServicePluginName(), null) +.setUserPayload(defaultPayload); +addDescriptor(descriptors, schedulerPluginMap, yarnDescriptor); + }{code} Why use different code path for uber/default. They should just work when instantiated the same way as a custom plugin. {code} + TaskCommunicator createTaskCommunicator(NamedEntityDescriptor taskCommDescriptor, + int taskCommIndex) { +if (taskCommDescriptor.getEntityName().equals(TezConstants.getTezYarnServicePluginName())) { + return createDefaultTaskCommunicator(taskCommunicatorContexts[taskCommIndex]); +} else if (taskCommDescriptor.getEntityName() +.equals(TezConstants.getTezUberServicePluginName())) { + return createUberTaskCommunicator(taskCommunicatorContexts[taskCommIndex]); +} else { + return createCustomTaskCommunicator(taskCommunicatorContexts[taskCommIndex], + taskCommDescriptor); }{code} Are this and other methods threadsafe wrt callback from multiple plugins? {code} + public TaskHeartbeatResponse heartbeat(TaskHeartbeatRequest request) + throws IOException, TezException { {code} Also in heartbeat(), the following code has been lost during
Success: TEZ-2658 PreCommit Build #976
Jira: https://issues.apache.org/jira/browse/TEZ-2658 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/976/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3340 lines...] [INFO] Final Memory: 87M/1386M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749680/TEZ-2658.4.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/976//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/976//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 11df6f4d94223ca30700d446da3bf500189ebab2 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #973 Archived 53 artifacts Archive block size is 32768 Received 4 blocks and 2983826 bytes Compression is 4.2% Took 0.68 sec Description set: TEZ-2658 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2658) Create a CLI utility tool to track Tez DAG/Application Stats
[ https://issues.apache.org/jira/browse/TEZ-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680858#comment-14680858 ] TezQA commented on TEZ-2658: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749680/TEZ-2658.4.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/976//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/976//console This message is automatically generated. Create a CLI utility tool to track Tez DAG/Application Stats Key: TEZ-2658 URL: https://issues.apache.org/jira/browse/TEZ-2658 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2658.1.patch, TEZ-2658.2.patch, TEZ-2658.3.patch, TEZ-2658.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services
[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680781#comment-14680781 ] Siddharth Seth commented on TEZ-2003: - Haven't thought much about work preserving restart. That'll need to be considered at some point when we start supporting this. It'll depend on the mechanism that is used for running tasks to reconnect to the AM. That's where the TaskCommunicator comes in - and may need to provide additional information for recovery. A push based mechanism to communicate with executors will make work preserving recovery a lot simpler. The communicator protocol - which is now plugin code - would need to handle recovery - with appropriate timeouts and retry policies in place from the task side, as well as some re-discovery and reconnection mechanics. The framework can help by providing this sub-system with relevant information after a restart occurs. [Umbrella] Allow Tez to co-ordinate execution to external services -- Key: TEZ-2003 URL: https://issues.apache.org/jira/browse/TEZ-2003 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Attachments: 2003_20150728.1.txt, 2003_20150807.1.txt, 2003_20150807.2.txt, Tez With External Services.pdf The Tez engine itself takes care of co-ordinating execution - controlling how data gets routed (different connection patterns), fault tolerance, scheduling of work, etc. This is currently tied to TaskSpecs defined within Tez and on containers launched by Tez itself (TezChild). The proposal is to allow Tez to work with external services instead of just containers launched by Tez. This involves several more pluggable layers to work with alternate Task Specifications, custom launch and task allocation mechanics, as well as custom scheduling sources. A simple example would be a simple a process with the capability to execute multiple Tez TaskSpecs as threads. In such a case, a container launch isn't really need and can be mocked. Sourcing / scheduling containers would need to be pluggable. A more advanced example would be LLAP (HIVE-7926; https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf). This works with custom interfaces - which would need to be supported by Tez, along with a custom event model which would need translation hooks. Tez should be able to work with a combination of certain vertices running in external services and others running in regular Tez containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2692) bugfixes enhancements related to job parser and analyzer
[ https://issues.apache.org/jira/browse/TEZ-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680817#comment-14680817 ] Bikas Saha commented on TEZ-2692: - Do we need firstTasktoFinish or firstTaskToStart? If the latter, then should we be using dag.getStartTime() or vertex.getStartTime() ? {code}+ private long getTaskRuntime(VertexInfo vertexInfo) { +TaskInfo firstTaskToFinish = vertexInfo.getFirstTaskToStart(); +TaskInfo lastTaskToFinish = vertexInfo.getLastTaskToFinish(); + +DagInfo dagInfo = vertexInfo.getDagInfo(); +long totalTime = ((lastTaskToFinish == null) ? +dagInfo.getFinishTime() : lastTaskToFinish.getFinishTime()) - +((firstTaskToFinish == null) ? dagInfo.getFinishTime() : firstTaskToFinish.getFinishTime()); +return totalTime; }{code} The concurrency calculator logic could be improved a bit. E.g. if we arrange all start and stop timestampts in a sorted order as - St1, St2, Et3, Et4. Then we can walk this list to produce concurrency as - (t1, 1), (t2, 2), (t3, 1), (t4, 0). If this logic is correct, we could do it here or in a follow up jira. If possible, can the new test be merged into the existing ATS parser test. This would reuse code and also reduce test run time by reusing the same mini cluster. Rest looks good! bugfixes enhancements related to job parser and analyzer -- Key: TEZ-2692 URL: https://issues.apache.org/jira/browse/TEZ-2692 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2692.1.patch, TEZ-2692.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2687) ATS History shutdown happens before the min-held containers are released
[ https://issues.apache.org/jira/browse/TEZ-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-2687: - Assignee: (was: Gopal V) ATS History shutdown happens before the min-held containers are released Key: TEZ-2687 URL: https://issues.apache.org/jira/browse/TEZ-2687 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.2, 0.8.0, 0.7.1 Reporter: Gopal V Attachments: TEZ-2687.1.patch When ATS goes into a GC pause under heavy loads and while it recovers, each Tez AM holds onto a few containers even though it is shutting down and will never accept any more DAGs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2687) ATS History shutdown happens before the min-held containers are released
[ https://issues.apache.org/jira/browse/TEZ-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680881#comment-14680881 ] Gopal V commented on TEZ-2687: -- Deleting the bad patch attached to the JIRA and leaving the issue as unresolved. ATS History shutdown happens before the min-held containers are released Key: TEZ-2687 URL: https://issues.apache.org/jira/browse/TEZ-2687 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.2, 0.8.0, 0.7.1 Reporter: Gopal V When ATS goes into a GC pause under heavy loads and while it recovers, each Tez AM holds onto a few containers even though it is shutting down and will never accept any more DAGs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2687) ATS History shutdown happens before the min-held containers are released
[ https://issues.apache.org/jira/browse/TEZ-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-2687: - Attachment: (was: TEZ-2687.1.patch) ATS History shutdown happens before the min-held containers are released Key: TEZ-2687 URL: https://issues.apache.org/jira/browse/TEZ-2687 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.2, 0.8.0, 0.7.1 Reporter: Gopal V When ATS goes into a GC pause under heavy loads and while it recovers, each Tez AM holds onto a few containers even though it is shutting down and will never accept any more DAGs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2692) bugfixes enhancements related to job parser and analyzer
[ https://issues.apache.org/jira/browse/TEZ-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680574#comment-14680574 ] Bikas Saha commented on TEZ-2692: - bq. fixing it here might not be helpful for older releases I did not mean fixing it here. I mean fixing them separately so that downstream clients (whether ATS parser or something else) can read identical information from both. Since you have identified what the differences are, could you please open jiras to track them. It may be that we end up creating a Tee that always does simply history logging. So having the correct information in both of them may be essential. For now, we are working around in the ATS parser, which is ok for now. bugfixes enhancements related to job parser and analyzer -- Key: TEZ-2692 URL: https://issues.apache.org/jira/browse/TEZ-2692 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2692.1.patch, TEZ-2692.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2300 PreCommit Build #974
Jira: https://issues.apache.org/jira/browse/TEZ-2300 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/974/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2125 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749649/TEZ-2300.2.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.client.TestTezClient Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/974//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/974//artifact/patchprocess/newPatchFindbugsWarningstez-api.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/974//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. b25e046d6e739ede0628b2529fe75016ec99a1bc logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #965 Archived 50 artifacts Archive block size is 32768 Received 6 blocks and 2806437 bytes Compression is 6.5% Took 2.7 sec [description-setter] Could not determine description. Recording test results Publish JUnit test result report is waiting for a checkpoint on PreCommit-TEZ-Build #973 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 4 tests failed. REGRESSION: org.apache.tez.client.TestTezClient.testTezclientSession Error Message: test timed out after 5000 milliseconds Stack Trace: java.lang.Exception: test timed out after 5000 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.tez.client.TezClient.stop(TezClient.java:518) at org.apache.tez.client.TestTezClient.testTezClient(TestTezClient.java:240) at org.apache.tez.client.TestTezClient.testTezclientSession(TestTezClient.java:135) REGRESSION: org.apache.tez.client.TestTezClient.testWaitTillReady_Interrupt Error Message: test timed out after 5000 milliseconds Stack Trace: java.lang.Exception: test timed out after 5000 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.tez.client.TezClient.stop(TezClient.java:518) at org.apache.tez.client.TestTezClient.testWaitTillReady_Interrupt(TestTezClient.java:334) REGRESSION: org.apache.tez.client.TestTezClient.testPreWarm Error Message: test timed out after 5000 milliseconds Stack Trace: java.lang.Exception: test timed out after 5000 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.tez.client.TezClient.stop(TezClient.java:518) at org.apache.tez.client.TestTezClient.testPreWarm(TestTezClient.java:268) REGRESSION: org.apache.tez.client.TestTezClient.testMultipleSubmissions Error Message: test timed out after 1 milliseconds Stack Trace: java.lang.Exception: test timed out after 1 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.tez.client.TezClient.stop(TezClient.java:518) at org.apache.tez.client.TestTezClient.testMultipleSubmissionsJob(TestTezClient.java:308) at org.apache.tez.client.TestTezClient.testMultipleSubmissions(TestTezClient.java:273)
Success: TEZ-2618 PreCommit Build #973
Jira: https://issues.apache.org/jira/browse/TEZ-2618 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/973/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3198 lines...] [INFO] Final Memory: 86M/1465M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749643/TEZ-2618.1.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/973//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/973//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. de8404d740241e1a641049c7a51412a893e2b675 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #965 Archived 50 artifacts Archive block size is 32768 Received 4 blocks and 2946162 bytes Compression is 4.3% Took 0.67 sec Description set: TEZ-2618 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2618) In Ordered Fetcher, if Local Fetch fails, fallback and try http Fetch before returning a failure
[ https://issues.apache.org/jira/browse/TEZ-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680621#comment-14680621 ] TezQA commented on TEZ-2618: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749643/TEZ-2618.1.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/973//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/973//console This message is automatically generated. In Ordered Fetcher, if Local Fetch fails, fallback and try http Fetch before returning a failure Key: TEZ-2618 URL: https://issues.apache.org/jira/browse/TEZ-2618 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Attachments: TEZ-2618.1.patch, TEZ-2618.patch In setupLocalDiskFetch() method[this is invoked when the fetcher is in the same host as the target map host], first try to check if we can open the target spill file using the localDirAllocator.getLocalPathToRead(). The localDirAllocator searches through the list of configured dirs for the file. In disk full scenarios, if the path is not found, fetcher should to try an http fetch. proposed solution: in local fetch mode, the fetcher should first try getLocalPathToRead() for all the pending maps. and So local fetch gets divided into 2 stages: first the maps for which path was found via LocalDirAllocator and second construct a http fallback fetch list for the maps which couldnt be found via LocalDirAllocator.getLocalPathToRead() and do an http fetch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2705) Add a reference implementation for ext services
Siddharth Seth created TEZ-2705: --- Summary: Add a reference implementation for ext services Key: TEZ-2705 URL: https://issues.apache.org/jira/browse/TEZ-2705 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Potentially convert tez-ext-service-tests into this reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2692) bugfixes enhancements related to job parser and analyzer
[ https://issues.apache.org/jira/browse/TEZ-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681179#comment-14681179 ] Bikas Saha commented on TEZ-2692: - Please commit the next patch with fix, if needed, for private long getTaskRuntime(VertexInfo vertexInfo). bugfixes enhancements related to job parser and analyzer -- Key: TEZ-2692 URL: https://issues.apache.org/jira/browse/TEZ-2692 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2692.1.patch, TEZ-2692.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2658 PreCommit Build #975
Jira: https://issues.apache.org/jira/browse/TEZ-2658 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/975/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3342 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12749655/TEZ-2658.3.patch against master revision eadbfec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/975//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/975//artifact/patchprocess/newPatchFindbugsWarningstez-cli-tools.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/975//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. c8c6a894d833ded5bf1a55689b69e09833fffe7b logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #973 Archived 53 artifacts Archive block size is 32768 Received 4 blocks and 2998631 bytes Compression is 4.2% Took 0.67 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services
[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680721#comment-14680721 ] Siddharth Seth commented on TEZ-2003: - bq. logErrorIngored, hearbeats, getCurretnDagName bq. - remove “*” e,g, import org.apache.tez.common.asterisk; Captured in TEZ-2678 bq. abortTask vs close/cleanup Will check the code. abortTask should try cleaning up in both of them. bq. TezTaskRunner2 killTask isn't used yet within Tez, which is why it's not informing the AM. When task preemption comes in - the flow is likely to be a killTask invoked as a result of an RPC, at which point the AM already knows that the task is killed since it took the decision. On the various atomic gets - there's separate variables to track what states have been set, and is used in the return result. Atomicity of the entire operation is handled via synchronization blocks. TaskRunner handling containerStop is a result of containerStop coming over a shared Task/Container protocol - which is linked to the running task. It could be separated, but I think that'll need the protocols to be separated as well. canCommit during a shutdown - will change this. I'll also verify what the TaskRunner behaviour was. TEZ-2678 bq. TaskReporter I don't think shutdown needs synchronization. It modifies a final variable. Whether it's implemented correctly needs more investigation. It's the same as what exists on master. bq. ShuffleHandler This is essentially the shuffle handler that is used in regular clusters. It's not meant as a benchmark tool. Using he current shuffle mechanics seems like the simplest mechanism to have jobs work with the standard set of Inputs/Outputs which write to disk. bq. ext-service-tests Agree with making this a reference for ext services. It would need to implement the APIs better, and be documented a lot bette to serve this purpose. Creating a new jira to track this - TEZ-2705. Post merge ? bq. JoinValidate The changes are for private use, to be able to re-use the example in testing. Will add docs to mention this. bq. TezTaskCommunicatorImpl Using payloads wherever possible - including internal plugins. Avoided in LocalContainerLauncher only at the moment, where a lot of runtime AM information is used. Will fix isKnownContainer and containerAlive t be based on specific communicator. Renaming methods in TaskComm - tracked in the TaskComm enhancements jira getDagName null - will try improving this. getVertexName - I'm not sure there's a lot that can be done. TezException instead of NPE ? Eventually this will lead to an error in the plugin, which needs to be handled better. There's a jira to track such error handling. onStateUpdated - is the AM telling the TaskCommunicator plugin that a vertex has changed state. Similar to what is done elsewhere - like the InputInitializers. dagCompleteStart - couldn't find this. Maybe I removed it at some point for the same reason - is a very confusing name. bq. Is there a need for the framework to make updates into the Context object? If yes, should the Context implement 2 interfaces? Should the internal objects just bind to the internal Impl objects or are they bound to the public plugin interfaces to catch compat errors? Binding to Impls directly may mean a smaller public API interface. Need more clarification on this comment. bq. Is there a need for the framework to make updates into the Context object? If yes, should the Context implement 2 interfaces? Should the internal objects just bind to the internal Impl objects or are they bound to the public plugin interfaces to catch compat errors? Binding to Impls directly may mean a smaller public API interface. Will do. [Umbrella] Allow Tez to co-ordinate execution to external services -- Key: TEZ-2003 URL: https://issues.apache.org/jira/browse/TEZ-2003 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Attachments: 2003_20150728.1.txt, 2003_20150807.1.txt, 2003_20150807.2.txt, Tez With External Services.pdf The Tez engine itself takes care of co-ordinating execution - controlling how data gets routed (different connection patterns), fault tolerance, scheduling of work, etc. This is currently tied to TaskSpecs defined within Tez and on containers launched by Tez itself (TezChild). The proposal is to allow Tez to work with external services instead of just containers launched by Tez. This involves several more pluggable layers to work with alternate Task Specifications, custom launch and task allocation mechanics, as well as custom scheduling sources. A simple example would be a simple a process with the capability to execute multiple Tez TaskSpecs as threads. In such a case, a container launch isn't really
[jira] [Commented] (TEZ-2678) Fix comments from reviews - part 1
[ https://issues.apache.org/jira/browse/TEZ-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680725#comment-14680725 ] Siddharth Seth commented on TEZ-2678: - Verify abortTask/cleanup are used correctly in TezTaskRunner why would a task call canCommit while shutting down? Shouldn’t we throw an exception anyway as it is not meant to be called during shutdown? Add docs to joinValidate explaining extensions are for private use. TaskCommunicatorContextImpl: Shouldn’t each plugin manage its own containers? Or at least shouldn’t this query be done based on which launcher plugin was being used for the given container? Likewise for containerAlive(). | Try fixing this to be specific to the communicator. setAccessible not required during construction of plugins. remove “*” e,g, import org.apache.tez.common.asterisk; typos: logErrorIngored, hearbeats, getCurretnDagName Fix comments from reviews - part 1 -- Key: TEZ-2678 URL: https://issues.apache.org/jira/browse/TEZ-2678 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Typos in API - Curretn, localicty, others Add diagnostic string wherever ContainerEndReason is used. TODO in ContainerLauncherContext - TEZ-2676 TaskEndReason lossy compared to YARN. Cache the context in DAGImpl.getDefaultExecutionContext TaskAttempt. TA_KILLED moves to KILL_IN_PROGRESS instead of KILLED TaskAttempt - add scheduleTime to history event Exception propagation in ContainerLauncherRouter AMNodeTracker calls super(AMNodeMap); ContainerLauncherOperationBase - token abstraction -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services
[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681128#comment-14681128 ] Siddharth Seth commented on TEZ-2003: - bq. should not need to have special casing for any plugin Special casing is in place primarily for LocalContainerExecutor, which requires a bunch of information at runtime - which isn't needed in the context otherwise. There's a jira to provide such information via runtime binding in the payload. For the other cases, it's mainly used to make it simpler to write tests - where the default executor can be easily overwritten for the tests. The construction, along with the payload, remains the same - except it's direct instead of using reflection. bq. Also creating a ServicePlugin class will help in reducing code duplication and make maintenance easier instead of having scheduler id, launcherId and commId everywhere. The 3 constructs are not used together everywhere. There's multiple events / other classes which only use a subset of these. A single class won't really help there. bq. ContainerSignatureMatcher - ExecutorSignatureMatcher ? Tracked in 2708. bq. ServicePluginLifecyle etc. in tez-runtime-api like Inputs/Output/InputInitializer etc shutdown would make more sense for a service. bq. Why are executedInAm and executeInContainers there executeInAm and executeInContainers are in Contexts to specify whether a task runs in a service or in the AM. It's possible to set a DAG level default to run everything in an external service, and some vertices either in containers or in the AM. Similarly for the ServiceDescriptor - decide whether the AM runs containers or uber-mode during setup. bq. Rename to ExecutorEndReason ? Give the abstraction that exists is containers (an executor could be confused for a service daemon), ContainerEndReason seems fine. This can change when Tez introduces it's own version of 'Contaienrs' instead of relying on the YARN abstraction. bq. Also, how can An error in the AM be caused by a container running a task? Not sure what you mean by this. An error in the AM caused by user code - implies an error which occurred in the AM process as a result of a plugin. bq. Why does this have schedulerName and taskCommName ? It's used for the startRequest. bq. Has the internal one been replaced by this? No but there's a jira open to consolidate the two. bq. Rename to ExecutorBusy ? tracked in TEZ-2707 bq. Why isLocal flag needs to be passed to Scheduler/Launcher/Communicator routers? Instead of a service plugin for local There's certain operations which are performed differently for local mode. Also used to indicate to internal plugins whether they're running in local / uber mode. bq. Is is ensured that the integer for a service plugin will turn out to be the same after AM restart? Yes bq. Why is yarn scheduler special cased? Launcher/Communicator dont have the special casing ? To always run the YARNScheduler (i.e. register with YARN) if running in non-local mode. If we were to support alternate frameworks, this could be removed. bq. Why use different code path for uber/default. They should just work when instantiated the same way as a custom plugin. Primarily for testing. First part of this comment. bq. Are this and other methods threadsafe wrt callback from multiple plugins? They should be. I'll scan through them. Would appreciate if you do the same to identify issues. bq. Also in heartbeat(), the following code has been lost during merge. Tracked in TEZ-2707 bq. Why are the contextImpls not directing invoking/handling the plugins instead of going through the router? They don't need to. ContextImpls are primarily for communication from the plugins to the framework. The routers should handle framework to plugins. bq. Why are the contextImpls not directing invoking/handling the plugins instead of going through the router? This avoids some race between dag transitions. bq. Why has the synchronization been removed. I remember this being a subtle race condition. sync on containerInfo is no longer required since there's a new entry inserted into the structure each time. bq. The dagCompleteStart/End logic is either broken or unnecessary because the correct dag seems to be always received from appContext.getCurrentDAG(). This is again for transitions between DAGs. A new dag is received when a dag is submitted - the context update needs to be factored out. dagComplete is sent to a plugin - which can take an arbitrary time to process. During this time, any lookups it does will be from the last dag - instead of a possible new dag, which could be submitted anytime. bq. Why not keep a cached copy instead of converting each time? Fixed in TEZ-2678 bq. There is a scheduledTime on master that this is duplicating. Will create a nice conflict when i rebase the branch next. Will resolve it then. bq. What is the
[jira] [Commented] (TEZ-2708) renames for tez-2003 changes
[ https://issues.apache.org/jira/browse/TEZ-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681127#comment-14681127 ] Siddharth Seth commented on TEZ-2708: - ContainerSignatureMatcher - ExecutorSignatureMatcher renames for tez-2003 changes Key: TEZ-2708 URL: https://issues.apache.org/jira/browse/TEZ-2708 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth This jira is to track some class renames which are required. TBD just before merging or right after the merge. - ContainerLauncherImpl to TezContainerLauncherImpl ? Make all the default implementation with prefix Tez. - TaskAttemptListenerImpTezDag to TaskCommunicatorManager - Likewise for tests. - Remove TezTaskRunner - Rename TaskSchedulerEventHandler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2678) Fix comments from reviews - part 1
[ https://issues.apache.org/jira/browse/TEZ-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2678: Attachment: TEZ-2678.1.txt Fix comments from reviews - part 1 -- Key: TEZ-2678 URL: https://issues.apache.org/jira/browse/TEZ-2678 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2678.1.txt Typos in API - Curretn, localicty, others Add diagnostic string wherever ContainerEndReason is used. TODO in ContainerLauncherContext - TEZ-2676 TaskEndReason lossy compared to YARN. Cache the context in DAGImpl.getDefaultExecutionContext TaskAttempt. TA_KILLED moves to KILL_IN_PROGRESS instead of KILLED TaskAttempt - add scheduleTime to history event Exception propagation in ContainerLauncherRouter AMNodeTracker calls super(AMNodeMap); ContainerLauncherOperationBase - token abstraction -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2707) Fix comments from reviews - part 2
[ https://issues.apache.org/jira/browse/TEZ-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681126#comment-14681126 ] Siddharth Seth commented on TEZ-2707: - Rename endReason.SERVICE_BUSY to EXECUTOR_BUSY Also in heartbeat(), the following code has been lost during merge. TaskAttempt.scheduleTime recently added to master Base class for schedulerEvents with schedulerId Similarly for AMContainerEvents NodeId ref sourceId - rename to schedulerId Remove TODO TEZ-2124 from AMNodeImpl Remove commented code in MocKDAGAppMaster TestTaskAttempt - taskComm setup into a method ExecutionContextTestInfoHolder - try re-using logic from AM Fix comments from reviews - part 2 -- Key: TEZ-2707 URL: https://issues.apache.org/jira/browse/TEZ-2707 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2707) Fix comments from reviews - part 2
Siddharth Seth created TEZ-2707: --- Summary: Fix comments from reviews - part 2 Key: TEZ-2707 URL: https://issues.apache.org/jira/browse/TEZ-2707 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2708) renames for tez-2003 changes
Siddharth Seth created TEZ-2708: --- Summary: renames for tez-2003 changes Key: TEZ-2708 URL: https://issues.apache.org/jira/browse/TEZ-2708 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth This jira is to track some class renames which are required. TBD just before merging or right after the merge. - ContainerLauncherImpl to TezContainerLauncherImpl ? Make all the default implementation with prefix Tez. - TaskAttemptListenerImpTezDag to TaskCommunicatorManager - Likewise for tests. - Remove TezTaskRunner - Rename TaskSchedulerEventHandler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2709) Enhancement to history events for external services
Siddharth Seth created TEZ-2709: --- Summary: Enhancement to history events for external services Key: TEZ-2709 URL: https://issues.apache.org/jira/browse/TEZ-2709 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth - Log the scheduler, launcher and task comm for an attempt. Also for containers where relevant. - scheduleTime in TaskAttempt needs to be logged. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2678) Fix comments from reviews - part 1
[ https://issues.apache.org/jira/browse/TEZ-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681069#comment-14681069 ] Siddharth Seth commented on TEZ-2678: - bq. Typos in API - Curretn, localicty, others Fixed bq. Add diagnostic string wherever ContainerEndReason is used. Done. Also for TaskEndReason bq. TODO in ContainerLauncherContext - TEZ-2676 Likely done elsewhere. Consolidation of TODOs and jiras bq. TaskEndReason lossy compared to YARN. [~hitesh] - could you please elaborate on this. (Was this meant to be YARN or some internal error reporting?) bq. Cache the context in DAGImpl.getDefaultExecutionContext Done bq. TaskAttempt. TA_KILLED moves to KILL_IN_PROGRESS instead of KILLED Moving to KILLED now bq. TaskAttempt - add scheduleTime to history event Deferring to jira which will include additional history fixes like where a task is executing. TEZ-2709 bq. Exception propagation in ContainerLauncherRouter Converted UnknownHost to an unchecked exception as well. bq. AMNodeTracker calls super(AMNodeMap); Fixed bq. ContainerLauncherOperationBase - token abstraction Tracked in TEZ-2702 bq. TaskCommunicator.java Rename unregisterRunningTaskAttempt to registerTaskAttemptEnd (make it more consistent) Tracked in TEZ-2678 bq. Post merge / just before merge: Rename ContainerLauncherImpl to TezContainerLauncherImpl ? Make all the default implementation with prefix Tez. Tracked in TEZ-2708 bq. Typo DagTypeConverters.convertServicePluginDescriptoToProto -- DagTypeConverters.convertServicePluginDescriptorToProto (miss r) Fixed bq. Verify VertexExecutionContext matches against the ServicePluginDescriptor setup for the TezClient Fixed bq. Verify abortTask/cleanup are used correctly in TezTaskRunner ceanpu is always called - which is the correct thing to do. (In both TezTaskRunner and TezTaskRunner2) bq. why would a task call canCommit while shutting down? Shouldn’t we throw an exception anyway as it is not meant to be called during shutdown? Looked at this some more. The shutdown could be a result of anything including preemption. The task doesn't necessarily know that it has been asked to die (race with canCommit invocations or whatever the task is doing). Sending back an exception results in an unnecessary exception from the task. A false seems much safer - and has been the approach we've used for TaskRunner as well. bq. TaskCommunicatorContextImpl: Shouldn’t each plugin manage its own containers? Or at least shouldn’t this query be done based on which launcher plugin was being used for the given container? Likewise for containerAlive(). | Try fixing this to be specific to the communicator. Fixed bq. setAccessible not required during construction of plugins. Fixed bq. remove “*” e,g, import org.apache.tez.common.asterisk; Fixed bq. typos: logErrorIngored, hearbeats, getCurretnDagName Fixed Fix comments from reviews - part 1 -- Key: TEZ-2678 URL: https://issues.apache.org/jira/browse/TEZ-2678 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Typos in API - Curretn, localicty, others Add diagnostic string wherever ContainerEndReason is used. TODO in ContainerLauncherContext - TEZ-2676 TaskEndReason lossy compared to YARN. Cache the context in DAGImpl.getDefaultExecutionContext TaskAttempt. TA_KILLED moves to KILL_IN_PROGRESS instead of KILLED TaskAttempt - add scheduleTime to history event Exception propagation in ContainerLauncherRouter AMNodeTracker calls super(AMNodeMap); ContainerLauncherOperationBase - token abstraction -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-2703) TEZ-2003 build fails
[ https://issues.apache.org/jira/browse/TEZ-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang resolved TEZ-2703. - Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: TEZ-2003 TEZ-2003 build fails Key: TEZ-2703 URL: https://issues.apache.org/jira/browse/TEZ-2703 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: TEZ-2003 Attachments: TEZ-2703-1.patch {code} [ERROR] [ERROR] The project org.apache.tez:tez-job-analyzer:0.8.0-SNAPSHOT (/Users/jzhang/github/tez/tez-tools/analyzers/job-analyzer/pom.xml) has 1 error [ERROR] 'dependencies.dependency.version' for io.dropwizard.metrics:metrics-core:jar is missing. @ line 28, column 17 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2703) TEZ-2003 build fails
[ https://issues.apache.org/jira/browse/TEZ-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679624#comment-14679624 ] Jeff Zhang commented on TEZ-2703: - Committed to TEZ-2003 TEZ-2003 build fails Key: TEZ-2703 URL: https://issues.apache.org/jira/browse/TEZ-2703 Project: Apache Tez Issue Type: Sub-task Reporter: Jeff Zhang Attachments: TEZ-2703-1.patch {code} [ERROR] [ERROR] The project org.apache.tez:tez-job-analyzer:0.8.0-SNAPSHOT (/Users/jzhang/github/tez/tez-tools/analyzers/job-analyzer/pom.xml) has 1 error [ERROR] 'dependencies.dependency.version' for io.dropwizard.metrics:metrics-core:jar is missing. @ line 28, column 17 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2004) Define basic interface for pluggable ContainerLaunchers
[ https://issues.apache.org/jira/browse/TEZ-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659692#comment-14659692 ] Jeff Zhang edited comment on TEZ-2004 at 8/10/15 10:39 AM: --- Comments: * Is ContainerOp necessary ? It seems ContainerLauncherOperationBase can be used instead, just need to move OPType into ContainerLauncherOperationBase, how about rename it as ContainerOperationBase ? * TaskCommunicator.java Rename unregisterRunningTaskAttempt to registerTaskAttemptEnd (make it more consistent) * Rename ContainerLauncherImpl to TezContainerLauncherImpl ? Make all the default implementation with prefix Tez ? * Typo DagTypeConverters.convertServicePluginDescriptoToProto -- DagTypeConverters.convertServicePluginDescriptorToProto (miss r) * Need to verify the DAG's defaultExecutionContext and Vertex's ExecutionContext exist in the TezClient.servicePluginsDescriptor when submitting the dag * Need to verify VertexExecutionContext's executeInAm executeInContainers is supported in TezClient's ServicePluginsDescriptor * Seems currently there's only programmatic way to specify TaskScheduler, ContainerLauncher, TaskCommunicator through TezClient's ServicePluginsDescriptor, is it expected ? That would mean if some third party want to introduce new external service to hive, not only they need to implement new TaskScheduler, ContainerLauncher, TaskCommunicator but also need to change hive code and rebuild. * ServicePluginsDescriptor As my understanding, TaskSchduler/TaskCommunicator/ContainerLauncher are used together, can not be combined arbitrarily. So should we use ExecutionContextDescriptor to replace the 3 separators descriptors ? * TaskAttempt#scheduleTime may need to put into history event TaskAttemptStartedEvent to be used by Tez-UI was (Author: zjffdu): Comments: * Is ContainerOp necessary ? It seems ContainerLauncherOperationBase can be used instead, just need to move OPType into ContainerLauncherOperationBase, how about rename it as ContainerOperationBase ? * TaskCommunicator.java Rename unregisterRunningTaskAttempt to registerTaskAttemptEnd (make it more consistent) * Rename ContainerLauncherImpl to TezContainerLauncherImpl ? Make all the default implementation with prefix Tez ? * Typo DagTypeConverters.convertServicePluginDescriptoToProto -- DagTypeConverters.convertServicePluginDescriptorToProto (miss r) * Need to verify the DAG's defaultExecutionContext and Vertex's ExecutionContext exist in the TezClient.servicePluginsDescriptor when submitting the dag * Need to verify VertexExecutionContext's executeInAm executeInContainers is supported in TezClient's ServicePluginsDescriptor * Seems currently there's only programmatic way to specify TaskScheduler, ContainerLauncher, TaskCommunicator through TezClient's ServicePluginsDescriptor, is it expected ? That would mean if some third party want to introduce new external service to hive, not only they need to implement new TaskScheduler, ContainerLauncher, TaskCommunicator but also need to change hive code and rebuild. Define basic interface for pluggable ContainerLaunchers --- Key: TEZ-2004 URL: https://issues.apache.org/jira/browse/TEZ-2004 Project: Apache Tez Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: TEZ-2003 Attachments: TEZ-2004.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)