[
https://issues.apache.org/jira/browse/TWILL-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600471#comment-14600471
]
ASF GitHub Bot commented on TWILL-139:
--------------------------------------
Github user ghelmling commented on a diff in the pull request:
https://github.com/apache/incubator-twill/pull/51#discussion_r33214142
--- Diff:
twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java
---
@@ -29,20 +34,71 @@
*/
public final class EmbeddedKafkaServer extends AbstractIdleService {
- private final KafkaServerStartable server;
+ public static final String START_TIMEOUT_RETRIES =
"twill.kafka.start.timeout.retries";
+
+ private static final Logger LOG =
LoggerFactory.getLogger(EmbeddedKafkaServer.class);
+ private static final String DEFAULT_START_TIMEOUT_RETRIES = "5";
+
+ private final int startTimeoutRetries;
+ private final KafkaConfig kafkaConfig;
+ private KafkaServer server;
public EmbeddedKafkaServer(Properties properties) {
- server = new KafkaServerStartable(new KafkaConfig(properties));
+ this.startTimeoutRetries =
Integer.parseInt(properties.getProperty(START_TIMEOUT_RETRIES,
+
DEFAULT_START_TIMEOUT_RETRIES));
+ this.kafkaConfig = new KafkaConfig(properties);
}
@Override
protected void startUp() throws Exception {
- server.startup();
+ int tries = 0;
+ do {
+ KafkaServer kafkaServer = createKafkaServer(kafkaConfig);
+ try {
+ kafkaServer.startup();
+ server = kafkaServer;
+ } catch (Exception e) {
+ kafkaServer.shutdown();
+ kafkaServer.awaitShutdown();
+
+ Throwable rootCause = Throwables.getRootCause(e);
+ if (rootCause instanceof ZkTimeoutException) {
+ // Potentially caused by race condition bug described in
TWILL-139.
+ LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer.
Attempt number {}.", tries, rootCause);
+ }
+ }
+ } while (server == null && ++tries < startTimeoutRetries);
}
--- End diff --
Seems like we should throw an exception here if server is still null,
saying that we were unable to initialize the kafka server. Otherwise we'll
just throw an NPE later on, right?
> ApplicationMaster hangs during start when ZooKeeper SASL authentication is
> turned on
> ------------------------------------------------------------------------------------
>
> Key: TWILL-139
> URL: https://issues.apache.org/jira/browse/TWILL-139
> Project: Apache Twill
> Issue Type: Bug
> Components: core, yarn
> Affects Versions: 0.5.0-incubating, 0.4.1-incubating
> Reporter: Terence Yim
> Assignee: Terence Yim
> Priority: Blocker
> Fix For: 0.6.0-incubating
>
>
> It is caused by a race condition when one {{ZKClient}} instance is performing
> the authentication while the {{EmbeddedKafkaServer}} is trying to start and
> connect to zookeeper.
> Here is the main method to reproduce the issue:
> {noformat}
> public static void main(String[] args) throws Exception {
> String zkStr = args[0];
> ZKClientService zkClient = ZKClientService.Builder.of(zkStr).build();
> EmbeddedKafkaServer kafka = new
> EmbeddedKafkaServer(generateKafkaConfig(zkStr));
> zkClient.startAndWait(); // <-- This returns when SyncConnected
> kafka.startAndWait(); // <-- This call hangs and never return
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)