[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681128#comment-14681128 ]
Siddharth Seth commented on TEZ-2003: ------------------------------------- bq. should not need to have special casing for any plugin Special casing is in place primarily for LocalContainerExecutor, which requires a bunch of information at runtime - which isn't needed in the context otherwise. There's a jira to provide such information via runtime binding in the payload. For the other cases, it's mainly used to make it simpler to write tests - where the default executor can be easily overwritten for the tests. The construction, along with the payload, remains the same - except it's direct instead of using reflection. bq. Also creating a ServicePlugin class will help in reducing code duplication and make maintenance easier instead of having scheduler id, launcherId and commId everywhere. The 3 constructs are not used together everywhere. There's multiple events / other classes which only use a subset of these. A single class won't really help there. bq. ContainerSignatureMatcher -> ExecutorSignatureMatcher ? Tracked in 2708. bq. ServicePluginLifecyle etc. in tez-runtime-api like Inputs/Output/InputInitializer etc shutdown would make more sense for a service. bq. Why are executedInAm and executeInContainers there executeInAm and executeInContainers are in Contexts to specify whether a task runs in a service or in the AM. It's possible to set a DAG level default to run everything in an external service, and some vertices either in containers or in the AM. Similarly for the ServiceDescriptor - decide whether the AM runs containers or uber-mode during setup. bq. Rename to ExecutorEndReason ? Give the abstraction that exists is containers (an executor could be confused for a service daemon), ContainerEndReason seems fine. This can change when Tez introduces it's own version of 'Contaienrs' instead of relying on the YARN abstraction. bq. Also, how can "An error in the AM" be caused by a container running a task? Not sure what you mean by this. " An error in the AM caused by user code" - implies an error which occurred in the AM process as a result of a plugin. bq. Why does this have schedulerName and taskCommName ? It's used for the startRequest. bq. Has the internal one been replaced by this? No but there's a jira open to consolidate the two. bq. Rename to ExecutorBusy ? tracked in TEZ-2707 bq. Why isLocal flag needs to be passed to Scheduler/Launcher/Communicator routers? Instead of a service plugin for local There's certain operations which are performed differently for local mode. Also used to indicate to internal plugins whether they're running in local / uber mode. bq. Is is ensured that the integer for a service plugin will turn out to be the same after AM restart? Yes bq. Why is yarn scheduler special cased? Launcher/Communicator dont have the special casing ? To always run the YARNScheduler (i.e. register with YARN) if running in non-local mode. If we were to support alternate frameworks, this could be removed. bq. Why use different code path for uber/default. They should just work when instantiated the same way as a custom plugin. Primarily for testing. First part of this comment. bq. Are this and other methods threadsafe wrt callback from multiple plugins? They should be. I'll scan through them. Would appreciate if you do the same to identify issues. bq. Also in heartbeat(), the following code has been lost during merge. Tracked in TEZ-2707 bq. Why are the contextImpls not directing invoking/handling the plugins instead of going through the router? They don't need to. ContextImpls are primarily for communication from the plugins to the framework. The routers should handle framework to plugins. bq. Why are the contextImpls not directing invoking/handling the plugins instead of going through the router? This avoids some race between dag transitions. bq. Why has the synchronization been removed. I remember this being a subtle race condition. sync on containerInfo is no longer required since there's a new entry inserted into the structure each time. bq. The dagCompleteStart/End logic is either broken or unnecessary because the correct dag seems to be always received from appContext.getCurrentDAG(). This is again for transitions between DAGs. A new dag is received when a dag is submitted - the context update needs to be factored out. dagComplete is sent to a plugin - which can take an arbitrary time to process. During this time, any lookups it does will be from the last dag - instead of a possible new dag, which could be submitted anytime. bq. Why not keep a cached copy instead of converting each time? Fixed in TEZ-2678 bq. There is a scheduledTime on master that this is duplicating. Will create a nice conflict when i rebase the branch next. Will resolve it then. bq. What is the use of this? Can be used for scheduling decisions around aging of a task. bq. Why is the vertex code concerned with checking local mode/uber etc.? It should simply take the vertex or default execution context and use it. It has to take a decision about running in local mode / non-local mode - contexts can be null. They're not always setup and sent over the wire. bq. Create base class that takes care of the repeated schedulerId code in all scheduler events? I believe I tried doing this and there were some complications, or it was done for some other events. IAC, will look. Tracked in TEZ-2707. bq. Why is the end reason here. Dont see any use anywhere in the patch. It's not used by the Tez schedulers, but can be used by external schedulers to blacklist nodes - for instance when an external service is busy. bq. Have a base class that prevents code duplication for launcherId etc. for all AMContainerEvents Same as for the scheduler events. bq. The pending changes to AMNode (in general node handling per scheduler) really need to be prioritized. They look quite incomplete because nodes are closely tied to execution environment and the current state of the code with some bits handled and some bits not, can be error prone. This should be in place already. TEZ-2124 is done. bq. What is a source? Can we call it scheduler or execution environment? Typically a scheduler. What gets interesting here is when the same source provides to multiple executors. At that point we'd need to differentiate between a node going down as a result of the 'real' source reporting it, vs Tez making an inference based on failed tasks. bq. Remove commented code Tracked in 2707 bq. New code misses the ++cData.numUpdates numUpdate handling no longer done via TaskAttemptListener, so this isn't needed. bq. Why do we need the new session token related code here? Separation between TaskAttemptListener and the plugin it creates. A token is needed. bq. Create shared method instead of duplicating code? 2707 bq. ExecutionContextTestInfoHolder 2707 bq. Why is this mock return of RUNNING state missing. Allocated containers checks for running state. Is that handled by the new code in TestTaskSchedulerHelpers? That's served and checked via the context. > [Umbrella] Allow Tez to co-ordinate execution to external services > ------------------------------------------------------------------ > > Key: TEZ-2003 > URL: https://issues.apache.org/jira/browse/TEZ-2003 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Attachments: 2003_20150728.1.txt, 2003_20150807.1.txt, > 2003_20150807.2.txt, Tez With External Services.pdf > > > The Tez engine itself takes care of co-ordinating execution - controlling how > data gets routed (different connection patterns), fault tolerance, scheduling > of work, etc. > This is currently tied to TaskSpecs defined within Tez and on containers > launched by Tez itself (TezChild). > The proposal is to allow Tez to work with external services instead of just > containers launched by Tez. This involves several more pluggable layers to > work with alternate Task Specifications, custom launch and task allocation > mechanics, as well as custom scheduling sources. > A simple example would be a simple a process with the capability to execute > multiple Tez TaskSpecs as threads. In such a case, a container launch isn't > really need and can be mocked. Sourcing / scheduling containers would need to > be pluggable. > A more advanced example would be LLAP (HIVE-7926; > https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf). > This works with custom interfaces - which would need to be supported by Tez, > along with a custom event model which would need translation hooks. > Tez should be able to work with a combination of certain vertices running in > external services and others running in regular Tez containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)