[ 
https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648250#comment-14648250
 ] 

Hitesh Shah commented on TEZ-2003:
----------------------------------

First round of commments ( mostly specific to files and not really overarching 
design comments - those will be in round 3/4 ):

General comments: 
   - there is not really a clear ownership doc on what the framework 
owns/controls as compared to what is handed off to the plugins. Improved docs 
on this front would help. 
   - Plugins seem to be hardwired at the vertex level and cannot be changed for 
different tasks. What happens if an external service is unavailable? Does 
everything fail? 
   - All classes in serviceplugins/api need lots of docs
   - All TODOs need a jira especially ones that will block a final release
   - Code in some files exceeds 100 chars
   - Spell checker needed on various API typos 
      - Curretn, localicty
   - Unused imports, order of imports at times is not alphabetical

TezClient:
   - Do we really need a TezClientBuilder in addition to the constructor APIs? 
One more api to maintain?


{code}
AMPluginDescriptorProto pluginDescriptorProto =
    
createAMServicePluginDescriptorProto(servicePluginsDescriptor);

builder.setAmPluginDescriptor(pluginDescriptorProto);
{code}
  - no check for null before calling the function? 
  - seems like implicit functionality such as uber and containers is being 
driven by the presence/absence of any service provider


ContainerEndReason:

ContainerEndReason instead of TerminationReason/Cause? Inconsistent compared to 
current naming convention which uses Termination or Completion? 
Also, it does not seem like sufficient enough info to debug what happened. No 
diagnostics/stack trace? 
How does FRAMEWORK_ERROR translate to telling a user that there is a bug in the 
AM? How much of this is meant to be user visible for debugging purposes?

ContainerLauncherContext
   - what does getNumNodes represent?
   - Why is there a need to distinguish a launch failure from a stop failure? 
Also, this api has diagnostics but no reason enum or exit status?
   - why does containerCompleted have a TaskAttemptEndReason arg? There is a 
TODO - please file a bug for cleaning this up.
   - why is getApplicationAttemptId() needed? What does it represent to a 
container launcher?
   - Why would the plugin need to tell the framework that a container should be 
stopped? Does the containerLauncher manage the “containers”? ( w.r.t. 
containerStopRequested() api )
   

ContainerLauncherOperationBase
   - is a token always needed? 

ContainerLaunchRequest/ContainerStopRequest
   - who is responsible for tracking nodeId/containerId mappings?
   - why does the container launcher plugin need to be aware of the scheduler 
and task communicator plugins?
   - Are they all independent entities that can work with other types of 
plugins or is there more or less a implicit 1:1:1 mapping?


ServicePluginsDescriptor
   - why does this need to have enableContainers and uber mode? 
   - there seems to be implicit setting of defaults
   - I think uber mode should not even exist unless uber mode means allow 
“some” tasks to run within the AM. Even then, it should be outside of the 
plugins layer as there should be a simple toggle which says whether the AM 
supports running containers within it or not and the tasks/VertexManager can 
query for this functionality to decide whether they can run the task in the AM 
container or not.

TaskAttemptEndReason
   - too many different enums being introduced. 
   - please file a jira to consolidate current vs new.
   - this task attempt enum is lossy compared to the current failure conditions 
available from YARN. 
   - There should be a single unified view for potential failures ( with a way 
for a plugin to extend the information available )

ServicePluginLifecycle
   - why does this need an explicit initialize()? Why not have the plugin do 
whatever it needs in the constructor? 

TaskScheduler
   - what does this getAvailableResources() mean? Who uses it? And how is it 
used? Likewise for getTotalResources() ? 
   - same question on the following APIs on what they mean when not running on 
YARN:
      - dagComplete()
      - getClusterNodeCount()
      - deallocateContainer() - is this meant to be invoked by the container 
launcher? 

TaskSchedulerContext
   - why does AppFinalStatus belong here? The
   - If the task scheduler is managing allocations, why does it need to call 
taskAllocated() 
   - all the other APIs in TaskSchedulerContext seem to be completely YARN 
specific. Why would a plugin writer who implements a different scheduler need 
these?

TaskCommunicatorContext
   - not sure why a task communicator needs to know the FirstAttemptStartTime?
   - If “getInitialUserPayload” implies an initial payload, this means there 
are updates. How does it affect recovery? 
   - Why does the TaskCommunicator need to ask the framework whether a task is 
alive? Or are the “taskAlive”/“containerAlive” apis being used to inform the 
framework that the task/container are alive? Even then who is responsible for 
tracking liveliness - the plugin or the framework? 
   - Not sure why getDagStartTime() is needed by the plugin

TaskHeartbeatRequest
   - what is preRoutedStartIndex?

TaskAttemptEventAttemptFailed
   - New javadoc comments need clarifying. Not sure why only 2 particular 
failure types are accepted. 

DAGImpl:
{code}
public org.apache.tez.dag.api.Vertex.VertexExecutionContext 
getDefaultExecutionContext() {
  if (jobPlan.hasDefaultExecutionContext()) {
   
 return 
DagTypeConverters.convertFromProto(jobPlan.getDefaultExecutionContext());
  } 
else {
    return null;
  }
}
{code}
  - why not convert in INIT and cache? 

TaskAttemptImpl: 
{code}
.addTransition(TaskAttemptStateInternal.RUNNING,
    
TaskAttemptStateInternal.KILL_IN_PROGRESS,
    TaskAttemptEventType.TA_KILLED,
 
   new TerminatedWhileRunningTransition(KILLED_HELPER))
{code}
  - Not sure why a TA moves to KILL_IN_PROGRESS on a TA_KILLED? Shouldn’t it 
move to KILLED?
  
TaskAttempt scheduleTime introduced but not used in any history event. 

TaskImpl:
   - getFirstAttemptStartTime() does not account for task attempts not being 
scheduled yet


VertexImpl:
   - Will unknown plugins be caught correctly and the correct reason be 
notified back to the user? Don’t believe so based on looking at the code given 
that error is thrown from the VertexImpl constructor.


ContainerLauncherRouter:
   - I see use of isLocal, isPureLocalMode and uber being used interchangeably 
in multiple places. What is the difference?
   - Do all configured launchers need to initialize correctly? Does it matter 
if one does not?
   - Why wrap all exceptions into an unchecked exception for 
createContainerLauncher()? Also, why is the UnknownHostException not being 
wrapped? Given that the LauncherRouter eventually is called from the 
DAGAppMaster, it can just let the exception pass through as serviceInit 
declares a throws Exception

AMContainerEventLaunchRequest
   - launcherId unused?

SchedulerId/TaskLauncherId/TaskCommId
   - are these always the same across AM attempts?
   - does any ordering need to be enforced?
   - Probably needs some documentation on where the data is being read/objects 
initialized

ContainerMatcher
    - does not compare ids of the launcher, etc. Is that being done elsewhere 
to ensure that containers managed by different launchers are not re-used? Or is 
there a scenario where they could be re-used? 

AMNodeEvent
    - what is sourceId? Why is it not schedulerId?
    - what does source signify for AMNode? 

AMNodeTracker
    - calls super(AMNodeMap)
    - Why not encapsulate the tracker with the scheduler instead of having a 
global tracker with sourceId being passed along everywhere? 
    - Is there a case for managing blacklisting or any other tracking policies 
that apply across all schedulers/sources combined?
    - getAndCreateIfNeededPerSourceTracker() thread safe?














   
  



 












 







   
   

 
      






   
   



 




> [Umbrella] Allow Tez to co-ordinate execution to external services
> ------------------------------------------------------------------
>
>                 Key: TEZ-2003
>                 URL: https://issues.apache.org/jira/browse/TEZ-2003
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>         Attachments: 2003_20150728.1.txt, Tez With External Services.pdf
>
>
> The Tez engine itself takes care of co-ordinating execution - controlling how 
> data gets routed (different connection patterns), fault tolerance, scheduling 
> of work, etc.
> This is currently tied to TaskSpecs defined within Tez and on containers 
> launched by Tez itself (TezChild).
> The proposal is to allow Tez to work with external services instead of just 
> containers launched by Tez. This involves several more pluggable layers to 
> work with alternate Task Specifications, custom launch and task allocation 
> mechanics, as well as custom scheduling sources.
> A simple example would be a simple a process with the capability to execute 
> multiple Tez TaskSpecs as threads. In such a case, a container launch isn't 
> really need and can be mocked. Sourcing / scheduling containers would need to 
> be pluggable.
> A more advanced example would be LLAP (HIVE-7926; 
> https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf).
> This works with custom interfaces - which would need to be supported by Tez, 
> along with a custom event model which would need translation hooks.
> Tez should be able to work with a combination of certain vertices running in 
> external services and others running in regular Tez containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to