I believe this timer does in fact test the pooled client connections. I my experience the all connections bad exception usually occurs when a shard server is no responding in a timely manor. It could be GCing or blocking from HDFS or some other unknown problem.
Timer: https://github.com/apache/incubator-blur/blob/master/blur-thrift/src/main/java/org/apache/blur/thrift/ClientPool.java#L98 Also there is a test method that will test connections before their use. https://github.com/apache/incubator-blur/blob/master/blur-thrift/src/main/java/org/apache/blur/thrift/ClientPool.java#L299 Hope this helps. Aaron On Sat, Dec 10, 2016 at 5:56 AM, Ravikumar Govindarajan < [email protected]> wrote: > Just now tried to understand the logic... > > Whenever an IOException/TTransportException is thrown, we mark a > Connection > as bad. Slowly when all Connections are greeted by this, we get "All > Connections Bad..." > > Is it a good idea to write a reaper thread to proactively try & replenish > the bad Connection, instead of waiting for search to hit it at the wrong > moment? > > Also, I just found that "staleness" check is eagerly performed. It should > be possible to return a live connection & refresh stale ones in background? > [*ClientPool.getConnection(Connection conn)*] > > -- > Ravi > > > > On Sat, Dec 10, 2016 at 3:44 PM, Ravikumar Govindarajan < > [email protected]> wrote: > > > Often, I find myself bang in the middle of a query, when > BlurClientManager > > comes up with this error. Happens both ways. When my app-server talks to > > controller-server as well as controller-server talks to shard-server. > This > > is affecting search experience quite a bit nowadays in production!! > > > > BlurException(message:Unknown error during remote call to node > > [AAA.BB.CCC.DD:40020], stackTraceStr:org.apache.blur. > thrift.BadConnectionException: > > Could not connect to controller/shard server. All connections are bad. at > > org.apache.blur.thrift.BlurClientManager.execute( > BlurClientManager.java:243) > > at org.apache.blur.thrift.BlurClientManager.execute( > BlurClientManager.java:314) > > at org.apache.blur.thrift.BlurControllerServer$BlurClientRemote$1.call( > BlurControllerServer.java:132) > > at org.apache.blur.thrift.BlurControllerServer$BlurClientRemote.execute( > > BlurControllerServer.java:139) > > > > When do we get such an Exception? In-correct timeout settings or > > shard-server restarts etc... > > > > Any help is much appreciated > > > > -- > > Ravi > > >
