[
https://issues.apache.org/jira/browse/TEZ-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037031#comment-14037031
]
Siddharth Seth commented on TEZ-1206:
-------------------------------------
Some comments on the changes to DAGAppMaster (some of these could be issues
which may have existed before applying the patch). Haven't really looked at the
rest.
- Invocation of shutdownHandler.shutdown() from within
"handle(DAGAppMasterEvent event)". 1) This will likely try to shut down the
services multiple times, since it explicitly invokes stopServices, which in
turn invokes DAGAppMaster.shutdown, which in turn invokes stop() which would
try stopping the services again. 2) There's no sysexit anywhere on this path.
If a service running a non daemon thread were to fail during shutdown, we could
end up seeing orphaned DAGAppMaster processes.
- Minor: serviceStartupEntry.getValue().get(10, TimeUnit.SECONDS) - is a
timeout required on this call. Wouldn't the serviceStartLatch only terminate
after all threads are complete.
- Similar to the first point - stopServices invokes shutdown, which in turn
invokes stopServices via the stop method. Depending on the call sequence
(whether stop was used or a direct method call), this can lead to multiple stop
attempts. It looks like we have way to many methods to stop the AppMaster -
serviceStop (required), shutdown(), stopServices, shutdownhandler. It's tough
to figure out what the call sequence is and when which method should be
invoked. Do you have additional details ?
- MasterStartupTask - noone monitors for errors from this. It's submitted to
the TaskExecutor and then forgotten. There's error checking around the
serviceStart code - for shutdown, but not around the rest of the code executed
in serviceStart(). Also, does this have to be executed from the Threadpool -
why not just use the currently executing Thread for this ?
- Minor: Calls like super.serviceStart() and super.serviceStop() should not be
required.
- The system.exit in main - should this also be in place for the
InterruptedException ? Possibly for throwable (even though catching those is
not the best thing to do)
- In MasterStartupTask (and elsewhere when executing within a ugi.doAs) -
exceptions other than Interrupted and IOException will not be handled
correctly. In the specific case of the MasterStartupTask - it's definitely
possible for a TezException to be thrown, which will make it out of the ugi
block as an UndeclaredThrowable. That also needs to be handled.
- Similar to what Hitesh mentioned, I'm not sure why the environment parameter
parsing has been moved out of main and into DAGAppMaster. For local mode, the
DAGAppMaster will be used with an argument list to instantiate it in process.
Moving environment variables parsing into the DAGAppMaster just complicates
this.
- Likely not a big issue (at least in the v3 patch), but the CachedThreadPool
keeps threads along for longer than required (60 seconds by default). The v2
patch would have kept them around forever. Does it make sense to shutdown the
threadpool once all services are started ?
Between the second and third patch, the changes to TaskScheduler have gone
missing. Is that not required ?
Haven't seen the code for the InMemoryContainerExecutor - but is it possible
for it to define it's own environment parameter, which could be used by the
DAGAppMaster to figure out which mode it's running in ? This could then be used
to make decisions on whether to exit or not.
> Lifecycle issues with DAGAppMaster
> ----------------------------------
>
> Key: TEZ-1206
> URL: https://issues.apache.org/jira/browse/TEZ-1206
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.4.0
> Reporter: Oleg Zhurakousky
> Assignee: Oleg Zhurakousky
> Attachments: TEZ-1206.patch, TEZ-1206.patch.2, TEZ-1206.patch.3.patch
>
>
> This is an umbrella issue to document and address issues with DAGAppMaster
> lifecycle
--
This message was sent by Atlassian JIRA
(v6.2#6252)