[ 
https://issues.apache.org/jira/browse/TWILL-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600488#comment-14600488
 ] 

ASF GitHub Bot commented on TWILL-139:
--------------------------------------

Github user hsaputra commented on a diff in the pull request:

    https://github.com/apache/incubator-twill/pull/51#discussion_r33214503
  
    --- Diff: 
twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java
 ---
    @@ -29,20 +34,71 @@
      */
     public final class EmbeddedKafkaServer extends AbstractIdleService {
     
    -  private final KafkaServerStartable server;
    +  public static final String START_TIMEOUT_RETRIES = 
"twill.kafka.start.timeout.retries";
    +
    +  private static final Logger LOG = 
LoggerFactory.getLogger(EmbeddedKafkaServer.class);
    +  private static final String DEFAULT_START_TIMEOUT_RETRIES = "5";
    +
    +  private final int startTimeoutRetries;
    +  private final KafkaConfig kafkaConfig;
    +  private KafkaServer server;
     
       public EmbeddedKafkaServer(Properties properties) {
    -    server = new KafkaServerStartable(new KafkaConfig(properties));
    +    this.startTimeoutRetries = 
Integer.parseInt(properties.getProperty(START_TIMEOUT_RETRIES,
    +                                                                       
DEFAULT_START_TIMEOUT_RETRIES));
    +    this.kafkaConfig = new KafkaConfig(properties);
       }
     
       @Override
       protected void startUp() throws Exception {
    -    server.startup();
    +    int tries = 0;
    +    do {
    +      KafkaServer kafkaServer = createKafkaServer(kafkaConfig);
    +      try {
    +        kafkaServer.startup();
    +        server = kafkaServer;
    +      } catch (Exception e) {
    +        kafkaServer.shutdown();
    +        kafkaServer.awaitShutdown();
    +
    +        Throwable rootCause = Throwables.getRootCause(e);
    +        if (rootCause instanceof ZkTimeoutException) {
    +          // Potentially caused by race condition bug described in 
TWILL-139.
    +          LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. 
Attempt number {}.", tries, rootCause);
    +        }
    --- End diff --
    
    +1 to let them just go out


> ApplicationMaster hangs during start when ZooKeeper SASL authentication is 
> turned on
> ------------------------------------------------------------------------------------
>
>                 Key: TWILL-139
>                 URL: https://issues.apache.org/jira/browse/TWILL-139
>             Project: Apache Twill
>          Issue Type: Bug
>          Components: core, yarn
>    Affects Versions: 0.5.0-incubating, 0.4.1-incubating
>            Reporter: Terence Yim
>            Assignee: Terence Yim
>            Priority: Blocker
>             Fix For: 0.6.0-incubating
>
>
> It is caused by a race condition when one {{ZKClient}} instance is performing 
> the authentication while the {{EmbeddedKafkaServer}} is trying to start and 
> connect to zookeeper.
> Here is the main method to reproduce the issue:
> {noformat}
> public static void main(String[] args) throws Exception {
>   String zkStr = args[0];
>   ZKClientService zkClient = ZKClientService.Builder.of(zkStr).build();
>   EmbeddedKafkaServer kafka = new 
> EmbeddedKafkaServer(generateKafkaConfig(zkStr));
>   zkClient.startAndWait();    // <-- This returns when SyncConnected
>   kafka.startAndWait();        // <-- This call hangs and never return
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to