[
https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194989#comment-14194989
]
Siddharth Seth commented on TEZ-1547:
-------------------------------------
bq. If anything nudged you towards that then let me know and I can add
clarifying comments.
I guess what gives this impression is the event going out after the vertex has
started. More on this later. Also, the fact that there's two issues that were
fixed via TEZ-1494, TEZ-1597, TEZ-1749 - one being the hang, and the second
being out of order scheduling which would cause tasks belonging to consumers
eating up the cluster before producers start - thus causing unnecessary
slowdowns. The need to remove startAfterOneTaskComplete.
Thanks for the clarification. Will change the title to reflect the same - it's
fairly vague at the moment.
bq. As you can see, the Shuffle VM first notifies of configuration complete,
and then proceeds to schedule tasks based on the existing min/max logic.
Wouldn't this lead to a situation where the notified vertex - ends up
scheduling it's tasks before the ShuffleVertex can schedule any tasks ? Even
though at a different priority - any task on the downstream vertex which starts
up will end up taking up cluster capacity unnecessarily. That effectively
undoes the changes from 1494, etc for ShuffleEdges.
On the event going out after the Vertex has started.
There's 3 cases
1) A pre-configured vertex which intends to re-configure. This has to go
through the vertexReconfigurationPlanned API. The CONFIGURED event can go out
the moment doneReconfiguringVertex is invoked, without the requirement for the
start check.
2) An unconfigured vertex, making use of the vertexReconfigurationPlanned API.
The same applies - send event as soon as doneReconfiguringVertex is invoked
(without the startTime check - looks like this check went out in the .9 patch)
3) An unconfigured vertex, not making use of the vertexReconfigurationPlanned
API. This won't reach the StartVertex transition until it;s configured
(parallelism != -1, null edges etc). Even if the RECONFIGURED event is sent out
in startVertex() - it's safe to send it out before invoking onVertexStarted.
For 1, we should also look at when vertexReconfigurationPlanned can be invoked.
Constructor / initialize() seems to be the only place where this method should
be called, and should be disallowed after this.
Now if we move the maybeSendConfiguredEvent() before onVertexStarted - which is
correct and safe for the intent of the event - it's very likely that downstream
vertices will end up scheduling tasks before the srcVertex. Take the case of
ImmediateStartVertexManager. Sending the event after onVertexStarted happens to
be giving us the behaviour that we want, in terms of correct scheduling order.
So, essentially, I'm trying to understand why the event goes out after
onVertexStarted, even though we can send it out much earlier (even from within
startVertex). If it's for the correct scheduling behaviour right now - lets
just get the patch in and fix the scheduling later.
If we're going with this - I think the startTime check in
doneReconfiguringVertex is important, also ShuffleVertexManager should be
scheduling tasks before indicating that it's doneReconfiguring (primarily to
avoid regressing on the scheduling order). Also documenting when
vertexReconfigurationPlanned can be invoked.
> Make use of state change notifier in VertexManagerPlugins
> ---------------------------------------------------------
>
> Key: TEZ-1547
> URL: https://issues.apache.org/jira/browse/TEZ-1547
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Bikas Saha
> Attachments: TEZ-1547.1.patch, TEZ-1547.10.patch, TEZ-1547.3.patch,
> TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch,
> TEZ-1547.8.patch, TEZ-1547.9.patch
>
>
> Instead of the various APIs like onVertexStarted, simple notifications could
> be sent.
> Some existing APIs could end up being deprecated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)