[ https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648250#comment-14648250 ]
Hitesh Shah commented on TEZ-2003: ---------------------------------- First round of commments ( mostly specific to files and not really overarching design comments - those will be in round 3/4 ): General comments: - there is not really a clear ownership doc on what the framework owns/controls as compared to what is handed off to the plugins. Improved docs on this front would help. - Plugins seem to be hardwired at the vertex level and cannot be changed for different tasks. What happens if an external service is unavailable? Does everything fail? - All classes in serviceplugins/api need lots of docs - All TODOs need a jira especially ones that will block a final release - Code in some files exceeds 100 chars - Spell checker needed on various API typos - Curretn, localicty - Unused imports, order of imports at times is not alphabetical TezClient: - Do we really need a TezClientBuilder in addition to the constructor APIs? One more api to maintain? {code} AMPluginDescriptorProto pluginDescriptorProto = createAMServicePluginDescriptorProto(servicePluginsDescriptor); builder.setAmPluginDescriptor(pluginDescriptorProto); {code} - no check for null before calling the function? - seems like implicit functionality such as uber and containers is being driven by the presence/absence of any service provider ContainerEndReason: ContainerEndReason instead of TerminationReason/Cause? Inconsistent compared to current naming convention which uses Termination or Completion? Also, it does not seem like sufficient enough info to debug what happened. No diagnostics/stack trace? How does FRAMEWORK_ERROR translate to telling a user that there is a bug in the AM? How much of this is meant to be user visible for debugging purposes? ContainerLauncherContext - what does getNumNodes represent? - Why is there a need to distinguish a launch failure from a stop failure? Also, this api has diagnostics but no reason enum or exit status? - why does containerCompleted have a TaskAttemptEndReason arg? There is a TODO - please file a bug for cleaning this up. - why is getApplicationAttemptId() needed? What does it represent to a container launcher? - Why would the plugin need to tell the framework that a container should be stopped? Does the containerLauncher manage the “containers”? ( w.r.t. containerStopRequested() api ) ContainerLauncherOperationBase - is a token always needed? ContainerLaunchRequest/ContainerStopRequest - who is responsible for tracking nodeId/containerId mappings? - why does the container launcher plugin need to be aware of the scheduler and task communicator plugins? - Are they all independent entities that can work with other types of plugins or is there more or less a implicit 1:1:1 mapping? ServicePluginsDescriptor - why does this need to have enableContainers and uber mode? - there seems to be implicit setting of defaults - I think uber mode should not even exist unless uber mode means allow “some” tasks to run within the AM. Even then, it should be outside of the plugins layer as there should be a simple toggle which says whether the AM supports running containers within it or not and the tasks/VertexManager can query for this functionality to decide whether they can run the task in the AM container or not. TaskAttemptEndReason - too many different enums being introduced. - please file a jira to consolidate current vs new. - this task attempt enum is lossy compared to the current failure conditions available from YARN. - There should be a single unified view for potential failures ( with a way for a plugin to extend the information available ) ServicePluginLifecycle - why does this need an explicit initialize()? Why not have the plugin do whatever it needs in the constructor? TaskScheduler - what does this getAvailableResources() mean? Who uses it? And how is it used? Likewise for getTotalResources() ? - same question on the following APIs on what they mean when not running on YARN: - dagComplete() - getClusterNodeCount() - deallocateContainer() - is this meant to be invoked by the container launcher? TaskSchedulerContext - why does AppFinalStatus belong here? The - If the task scheduler is managing allocations, why does it need to call taskAllocated() - all the other APIs in TaskSchedulerContext seem to be completely YARN specific. Why would a plugin writer who implements a different scheduler need these? TaskCommunicatorContext - not sure why a task communicator needs to know the FirstAttemptStartTime? - If “getInitialUserPayload” implies an initial payload, this means there are updates. How does it affect recovery? - Why does the TaskCommunicator need to ask the framework whether a task is alive? Or are the “taskAlive”/“containerAlive” apis being used to inform the framework that the task/container are alive? Even then who is responsible for tracking liveliness - the plugin or the framework? - Not sure why getDagStartTime() is needed by the plugin TaskHeartbeatRequest - what is preRoutedStartIndex? TaskAttemptEventAttemptFailed - New javadoc comments need clarifying. Not sure why only 2 particular failure types are accepted. DAGImpl: {code} public org.apache.tez.dag.api.Vertex.VertexExecutionContext getDefaultExecutionContext() { if (jobPlan.hasDefaultExecutionContext()) { return DagTypeConverters.convertFromProto(jobPlan.getDefaultExecutionContext()); } else { return null; } } {code} - why not convert in INIT and cache? TaskAttemptImpl: {code} .addTransition(TaskAttemptStateInternal.RUNNING, TaskAttemptStateInternal.KILL_IN_PROGRESS, TaskAttemptEventType.TA_KILLED, new TerminatedWhileRunningTransition(KILLED_HELPER)) {code} - Not sure why a TA moves to KILL_IN_PROGRESS on a TA_KILLED? Shouldn’t it move to KILLED? TaskAttempt scheduleTime introduced but not used in any history event. TaskImpl: - getFirstAttemptStartTime() does not account for task attempts not being scheduled yet VertexImpl: - Will unknown plugins be caught correctly and the correct reason be notified back to the user? Don’t believe so based on looking at the code given that error is thrown from the VertexImpl constructor. ContainerLauncherRouter: - I see use of isLocal, isPureLocalMode and uber being used interchangeably in multiple places. What is the difference? - Do all configured launchers need to initialize correctly? Does it matter if one does not? - Why wrap all exceptions into an unchecked exception for createContainerLauncher()? Also, why is the UnknownHostException not being wrapped? Given that the LauncherRouter eventually is called from the DAGAppMaster, it can just let the exception pass through as serviceInit declares a throws Exception AMContainerEventLaunchRequest - launcherId unused? SchedulerId/TaskLauncherId/TaskCommId - are these always the same across AM attempts? - does any ordering need to be enforced? - Probably needs some documentation on where the data is being read/objects initialized ContainerMatcher - does not compare ids of the launcher, etc. Is that being done elsewhere to ensure that containers managed by different launchers are not re-used? Or is there a scenario where they could be re-used? AMNodeEvent - what is sourceId? Why is it not schedulerId? - what does source signify for AMNode? AMNodeTracker - calls super(AMNodeMap) - Why not encapsulate the tracker with the scheduler instead of having a global tracker with sourceId being passed along everywhere? - Is there a case for managing blacklisting or any other tracking policies that apply across all schedulers/sources combined? - getAndCreateIfNeededPerSourceTracker() thread safe? > [Umbrella] Allow Tez to co-ordinate execution to external services > ------------------------------------------------------------------ > > Key: TEZ-2003 > URL: https://issues.apache.org/jira/browse/TEZ-2003 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Attachments: 2003_20150728.1.txt, Tez With External Services.pdf > > > The Tez engine itself takes care of co-ordinating execution - controlling how > data gets routed (different connection patterns), fault tolerance, scheduling > of work, etc. > This is currently tied to TaskSpecs defined within Tez and on containers > launched by Tez itself (TezChild). > The proposal is to allow Tez to work with external services instead of just > containers launched by Tez. This involves several more pluggable layers to > work with alternate Task Specifications, custom launch and task allocation > mechanics, as well as custom scheduling sources. > A simple example would be a simple a process with the capability to execute > multiple Tez TaskSpecs as threads. In such a case, a container launch isn't > really need and can be mocked. Sourcing / scheduling containers would need to > be pluggable. > A more advanced example would be LLAP (HIVE-7926; > https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf). > This works with custom interfaces - which would need to be supported by Tez, > along with a custom event model which would need translation hooks. > Tez should be able to work with a combination of certain vertices running in > external services and others running in regular Tez containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)