[ 
https://issues.apache.org/jira/browse/HBASE-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151996#comment-13151996
 ] 

nkeywal commented on HBASE-4798:
--------------------------------

Ok, I've got it for TestRegionServerCoprocessorExceptionWithAbort as well.

It's because of my change in HRegionServer#stop:
{noformat}
  public void stop(final String msg) {
    this.stopped = true;
    LOG.info("STOPPED: " + msg);
    // Wakes run() if it is sleeping
    sleeper.skipSleepCycle();   // <================= NEW
  }
{noformat}


This is a notification that makes the region server stops *immediately* instead 
of waiting the next sleep ending.
So immediately that the poor client sees its region server disappearing during 
its put, and enters its usual retry stuff. This is slow and the test timeout 
does not let it finish.

This bug was already there before my change, it's just less random now. Now 
you've got it 90% of the time, before it was 0,1% :-)
Note as well that the current implementation of HRegionServer#stop actually 
intends to stop immediately but fails because it does the notify on the wrong 
object. But the comment is clear on the intention. So a bug was hiding another 
bug. Usual stuff :-).


I am not sure on how to fix this cleanly. We could launch a thread that would 
wait before aborting in handleCoprocessorThrowable, but it's more a workaround 
than anything else. @stack, @eugene, what do you think?





                
> Sleeps and synchronisation improvements for tests
> -------------------------------------------------
>
>                 Key: HBASE-4798
>                 URL: https://issues.apache.org/jira/browse/HBASE-4798
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver, test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 4798_trunk_all.v2.patch
>
>
> Multiple small changes:
> @commiters: Removing some sleeps made visible a bug on 
> JVMClusterUtil#HMaster#waitForServerOnline, so I had to add a synchro point. 
> You may want to review this.
> JVMClusterUtil#HMaster#waitForServerOnline: removed, the condition was never 
> met (test on "!c && !!c"). Added a new synchronization point.
> AssignementManager#waitForAssignment: add a timeout on the wait => not stuck 
> if the notification is received before the wait.
> HMaster#loop: use a notification instead of a 1s sleep
> HRegionServer#waitForServerOnline: new method used by 
> JVMClusterUtil#waitForServerOnline() to replace a 1s sleep by a notification
> HRegionServer#getMaster() 1s sleeps replaced by one 0,1s sleep and one 0,2s 
> sleep
> HRegionServer#stop: use a notification on sleeper to lower shutdown by 0,5s
> ZooKeeperNodeTracker#start: replace a recursive call by a loop
> ZooKeeperNodeTracker#blockUntilAvailable: add a timeout on the wait => not 
> stuck if the notification is received before the wait.
> HBaseTestingUtility#expireSession: use a timeout of 1s instead of 5s
> TestZooKeeper#testClientSessionExpired: use a timeout of 1s instead of 5s, 
> with the change on HBaseTestingUtility we are 60s faster
> TestRegionRebalancing#waitForAllRegionsAssigned: use a sleep of 0,2s instead 
> of 1s
> TestRestartCluster#testClusterRestart: send all the table creation together, 
> then check creation, should be faster
> TestHLog: shutdown the whole cluster instead of DFS only (more standard) 
> JVMClusterUtil#startup: lower the sleep from 1s to 0,1s
> HConnectionManager#close: Zookeeper name in debug message from 
> HConnectionManager after connection close was always null because it was set 
> to null in the delete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to