Github user mashengchen commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1427#discussion_r166504107
  
    --- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
    @@ -111,11 +104,59 @@ public DcsMaster(String[] args) {
             trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME);
             jvmShutdownHook = new JVMShutdownHook();
             Runtime.getRuntime().addShutdownHook(jvmShutdownHook);
    -        thrd = new Thread(this);
    -        thrd.start();
    +
    +        ExecutorService executorService = Executors.newFixedThreadPool(1);
    +        CompletionService<Integer> completionService = new 
ExecutorCompletionService<Integer>(executorService);
    +
    +        while (true) {
    +            completionService.submit(this);
    +            Future<Integer> f = null;
    +            try {
    +                f = completionService.take();
    +                if (f != null) {
    +                    Integer status = f.get();
    +                    if (status <= 0) {
    +                        System.exit(status);
    +                    } else {
    +                        // 35000 * 15mins ~= 1 years
    +                        RetryCounter retryCounter = 
RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES);
    +                        while (true) {
    +                            try {
    +                                ZkClient tmpZkc = new ZkClient();
    +                                tmpZkc.connect();
    +                                tmpZkc.close();
    +                                tmpZkc = null;
    +                                LOG.info("Connected to ZooKeeper 
successful, restart DCS Master.");
    +                                // reset lock
    +                                isLeader = new CountDownLatch(1);
    +                                break;
    --- End diff --
    
    this logic is for when dcsmaster return with network erro situation.
    in the logic , it will try to connect to zk
    if it can't conn ( tmpZkc.connect(); ) , there will in catch block and do 
retry
    if it connect to zk, then dcsmaster will run call() method again, in the 
time dcsmaster rework ,there must hava another backup-master working ( because 
there must one dcs master work and current master lose network ,then 
backupmaster take over the role) ,  so when dcsmaster rework , it will set 
value in zk /rootpath/dcs/leader/ then hang by lock "isLeader = new 
CountDownLatch(1);"


---

Reply via email to