[
https://issues.apache.org/jira/browse/SAMZA-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097543#comment-14097543
]
Chris Riccomini commented on SAMZA-376:
---------------------------------------
[~zjshen], does this sound accurate to you? It seems possible to me. We make a
blocking call to Kafka in SamzaAppMaster before we call amClient.start. If the
Kafka calls take a long time (say, several minutes), would it lead to this
behavior?
> ApplicationMaster Timeout after LeaderNotAvailableException
> -----------------------------------------------------------
>
> Key: SAMZA-376
> URL: https://issues.apache.org/jira/browse/SAMZA-376
> Project: Samza
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Nicolas Bär
> Priority: Minor
>
> The application master does not send a heartbeat to the resource manager if
> the leader of the topic is not available. It will retry until the leader is
> available and then send the heartbeat. If the Kafka cluster is busy during
> this time, the leader election might take a moment and the timeout is reached
> resulting in a shutdown of the application master.
> I hit this issue on our testbed and received a few follow-up error messages
> after the application master was restarted:
> {quote}
> ERROR security.UserGroupInformation: PriviledgedActionException as:baer
> (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Password not found for ApplicationAttempt
> appattempt_1407522131931_0001_000001
> {quote}
> I will investigate in this further, but assume it is better placed at the
> YARN mailing list.
> Here is the relevant part from our discussion on IRC (criccomini):
> {quote}
> SamzaAppMaster
> you'll see: amClient.start
> and later, amClient.stop
> the start is starting the YARN AMClient's heartbeat
> now
> SamzaAppMasterTaskManager
> calls assignContainerToSSPTaskNames
> in Util
> which calls Util.getInputStreamPartitions(config)
> and THAT is where Kafka is called
> so basically
> before amClient.start is called
> that getInputStreamPartitiosn method is invoked
> which will block on metadata timeouts
> until it can get the data it needs
> so SamzaAppMaster is constructing SamzaAppMasterTaskManager before it calls
> amClient.start
> {quote}
--
This message was sent by Atlassian JIRA
(v6.2#6252)