[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661142#comment-14661142 ]
Hitesh Shah commented on TEZ-2003: ---------------------------------- Spell check: - logErrorIngored, hearbeats, getCurretnDagName Fix imports: - remove “*” e,g, import org.apache.tez.common.*; abortTask vs close/cleanup - are both close and cleanup being invoked in both TezTaskRunner and Runner2 even when a task is aborted? ( need to re-check code ) TezTaskRunner2 {code} * If a kill request is made to the task, it will not communicate this information to * the AM - since a task KILL is an external event, and whoever invoked it should * be able to track it. {code} - shouldn’t the AM be informed that the task has completed after the abort related steps have been done? {code} private boolean isRunningState() { return !taskComplete.get() && !killTaskRequested.get() && !stopContainerRequested.get() && !errorSeen.get(); } {code} - any possibility of simplifying this? Each get is atomic but the combined AND expression is not atomic. - any reason why the task runner needs to know that the container stop is requested? For a task runner, it should only know about managing a task and not the container. TezChild should handle the container stop handling instead? {code} // Don't throw an error since the task is already in the process of shutting down. LOG.info("returning canCommit=false since task is not in a running state"); return false; {code} - why would a task call canCommit while shutting down? Shouldn’t we throw an exception anyway as it is not meant to be called during shutdown? TaskReporter - some functions such as shutdown not synchronized? tez-ext-service-tests/ShuffleHandler - Not sure why we need a complex impl of a shuffle handler for a regression test? If this is meant for a benchmark tool, maybe it should categorized as such and become more maintstream? tez-ext-service-tests - if the ext-service layer is meant to be a full fledged plugin, it would be good if this became the defacto guide on how to implement the necessary plugin layers to talk to a simple 3rd party service. Current approach seems quite complex. JoinValidate: - how does one actually use this example with configured external execution engines? The change is added but no real docs or information on how it can actually be run to demonstrate the feature. TezTaskCommunicatorImpl - Assuming this is the internal communicator used by Tez between the AM and YARN containers, why does it needs a user payload and initialization from the user payload? TaskCommunicator - has an initialize() but user payload seems to be parsed in the constructor in TezTaskCommunicatorImpl TaskCommunicatorContextImpl {code} public boolean isKnownContainer(ContainerId containerId) { return context.getAllContainers().get(containerId) != null; } {code} - Shouldn’t each plugin manage its own containers? Or at least shouldn’t this query be done based on which launcher plugin was being used for the given container? Likewise for containerAlive(). APIs in TaskCommunicatorContextImpl probably some minor renaming - say something like notifyTaskFailure() maybe? getCurretnDagName() - dag can be null. Not handled. Likewise for getInputVertexNames() and other apis where both dag, vertex ( even task ) could be null. Needs more defensive programming against bad plugins. {code} public void onStateUpdated(VertexStateUpdate event) {code} - in this scenario, is the TaskCommunicator using the context to tell the AM that a vertex has completed? dagCompleteStart() - not sure what this API is meant to signify. It seems like the context is being notified that the DAG has started? Is there a need for the framework to make updates into the Context object? If yes, should the Context implement 2 interfaces? Should the internal objects just bind to the internal Impl objects or are they bound to the public plugin interfaces to catch compat errors? Binding to Impls directly may mean a smaller public API interface. TaskAttemptListenerImpTezDag ctor.setAccessible(true); - why is this needed? Why not throw an error? General design question: - work preserving mode - not for this jira but any general thoughts on how work preserving restarts will work? For that matter, considering an LLAP service will not go down if the AM crashes, how can work/tasks be recovered without needing to re-run them? > [Umbrella] Allow Tez to co-ordinate execution to external services > ------------------------------------------------------------------ > > Key: TEZ-2003 > URL: https://issues.apache.org/jira/browse/TEZ-2003 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Attachments: 2003_20150728.1.txt, Tez With External Services.pdf > > > The Tez engine itself takes care of co-ordinating execution - controlling how > data gets routed (different connection patterns), fault tolerance, scheduling > of work, etc. > This is currently tied to TaskSpecs defined within Tez and on containers > launched by Tez itself (TezChild). > The proposal is to allow Tez to work with external services instead of just > containers launched by Tez. This involves several more pluggable layers to > work with alternate Task Specifications, custom launch and task allocation > mechanics, as well as custom scheduling sources. > A simple example would be a simple a process with the capability to execute > multiple Tez TaskSpecs as threads. In such a case, a container launch isn't > really need and can be mocked. Sourcing / scheduling containers would need to > be pluggable. > A more advanced example would be LLAP (HIVE-7926; > https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf). > This works with custom interfaces - which would need to be supported by Tez, > along with a custom event model which would need translation hooks. > Tez should be able to work with a combination of certain vertices running in > external services and others running in regular Tez containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)