[
https://issues.apache.org/jira/browse/TWILL-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600481#comment-14600481
]
ASF GitHub Bot commented on TWILL-139:
--------------------------------------
Github user ghelmling commented on a diff in the pull request:
https://github.com/apache/incubator-twill/pull/51#discussion_r33214384
--- Diff:
twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java
---
@@ -29,20 +34,71 @@
*/
public final class EmbeddedKafkaServer extends AbstractIdleService {
- private final KafkaServerStartable server;
+ public static final String START_TIMEOUT_RETRIES =
"twill.kafka.start.timeout.retries";
+
+ private static final Logger LOG =
LoggerFactory.getLogger(EmbeddedKafkaServer.class);
+ private static final String DEFAULT_START_TIMEOUT_RETRIES = "5";
+
+ private final int startTimeoutRetries;
+ private final KafkaConfig kafkaConfig;
+ private KafkaServer server;
public EmbeddedKafkaServer(Properties properties) {
- server = new KafkaServerStartable(new KafkaConfig(properties));
+ this.startTimeoutRetries =
Integer.parseInt(properties.getProperty(START_TIMEOUT_RETRIES,
+
DEFAULT_START_TIMEOUT_RETRIES));
+ this.kafkaConfig = new KafkaConfig(properties);
}
@Override
protected void startUp() throws Exception {
- server.startup();
+ int tries = 0;
+ do {
+ KafkaServer kafkaServer = createKafkaServer(kafkaConfig);
+ try {
+ kafkaServer.startup();
+ server = kafkaServer;
+ } catch (Exception e) {
+ kafkaServer.shutdown();
+ kafkaServer.awaitShutdown();
+
+ Throwable rootCause = Throwables.getRootCause(e);
+ if (rootCause instanceof ZkTimeoutException) {
+ // Potentially caused by race condition bug described in
TWILL-139.
+ LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer.
Attempt number {}.", tries, rootCause);
+ }
--- End diff --
Should we propagate any exceptions that are _not_ ZkTimeoutExceptions? Or
should we continue to retry those too? Seems like we should let those out
since we don't really know how to handle them.
> ApplicationMaster hangs during start when ZooKeeper SASL authentication is
> turned on
> ------------------------------------------------------------------------------------
>
> Key: TWILL-139
> URL: https://issues.apache.org/jira/browse/TWILL-139
> Project: Apache Twill
> Issue Type: Bug
> Components: core, yarn
> Affects Versions: 0.5.0-incubating, 0.4.1-incubating
> Reporter: Terence Yim
> Assignee: Terence Yim
> Priority: Blocker
> Fix For: 0.6.0-incubating
>
>
> It is caused by a race condition when one {{ZKClient}} instance is performing
> the authentication while the {{EmbeddedKafkaServer}} is trying to start and
> connect to zookeeper.
> Here is the main method to reproduce the issue:
> {noformat}
> public static void main(String[] args) throws Exception {
> String zkStr = args[0];
> ZKClientService zkClient = ZKClientService.Builder.of(zkStr).build();
> EmbeddedKafkaServer kafka = new
> EmbeddedKafkaServer(generateKafkaConfig(zkStr));
> zkClient.startAndWait(); // <-- This returns when SyncConnected
> kafka.startAndWait(); // <-- This call hangs and never return
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)