I expect (without double checking the path in the code ;-) that the code in HConnectionManager will retry.
On Tue, Jul 10, 2012 at 7:22 PM, Suraj Varma <svarma...@gmail.com> wrote: > Yes. > > On the maxRetries, though ... I saw the code > (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.2/org/apache/hadoop/hbase/ipc/HBaseClient.java#677) > show > this.maxRetries = conf.getInt("hbase.ipc.client.connect.max.retries", 0); > > So - looks like by default, the maxRetries is set to 0? So ... there > is effectively no retry (i.e. it is fail-fast) > --Suraj > > On Tue, Jul 10, 2012 at 10:12 AM, N Keywal <nkey...@gmail.com> wrote: >> Thanks for the jira. >> The client can be connected to multiple RS, depending on the rows is >> working on. So yes it's initial, but it's a dynamic initial :-). >> This said there is a retry on error... >> >> On Tue, Jul 10, 2012 at 6:46 PM, Suraj Varma <svarma...@gmail.com> wrote: >>> I will create a JIRA ticket ... >>> >>> The only side-effect I could think of is ... if a RS is having a GC of >>> a few seconds, any _new_ client trying to connect would get connect >>> failures. So ... the _initial_ connection to the RS is what would >>> suffer from a super-low setting of the ipc.socket.timeout. This was my >>> read of the code. >>> >>> So - was hoping to get a confirmation if this is the only side effect. >>> Again - this is on the client side - I wouldn't risk doing this on the >>> cluster side ... >>> --Suraj >>> >>> On Mon, Jul 9, 2012 at 9:44 AM, N Keywal <nkey...@gmail.com> wrote: >>>> Hi, >>>> >>>> What you're describing -the 35 minutes recovery time- seems to match >>>> the code. And it's a bug (still there on trunk). Could you please >>>> create a jira for it? If you have the logs it even better. >>>> >>>> Lowering the ipc.socket.timeout seems to be an acceptable partial >>>> workaround. Setting it to 10s seems ok to me. Lower than this... I >>>> don't know. >>>> >>>> N. >>>> >>>> >>>> On Mon, Jul 9, 2012 at 6:16 PM, Suraj Varma <svarma...@gmail.com> wrote: >>>>> Hello: >>>>> I'd like to get advice on the below strategy of decreasing the >>>>> "ipc.socket.timeout" configuration on the HBase Client side ... has >>>>> anyone tried this? Has anyone had any issues with configuring this >>>>> lower than the default 20s? >>>>> >>>>> Thanks, >>>>> --Suraj >>>>> >>>>> On Mon, Jul 2, 2012 at 5:51 PM, Suraj Varma <svarma...@gmail.com> wrote: >>>>>> By "power down" below, I mean powering down the host with the RS that >>>>>> holds the .META. table. (So - essentially, the host IP is unreachable >>>>>> and the RS/DN is gone.) >>>>>> >>>>>> Just wanted to clarify my below steps ... >>>>>> --S >>>>>> >>>>>> On Mon, Jul 2, 2012 at 5:36 PM, Suraj Varma <svarma...@gmail.com> wrote: >>>>>>> Hello: >>>>>>> We've been doing some failure scenario tests by powering down a .META. >>>>>>> holding region server host and while the HBase cluster itself recovers >>>>>>> and reassigns the META region and other regions (after we tweaked down >>>>>>> the default timeouts), our client apps using HBaseClient take a long >>>>>>> time to recover. >>>>>>> >>>>>>> hbase-0.90.6 / cdh3u4 / JDK 1.6.0_23 >>>>>>> >>>>>>> Process: >>>>>>> 1) Apply load via client app on HBase cluster for several minutes >>>>>>> 2) Power down the region server holding the .META. server >>>>>>> 3) Measure how long it takes for cluster to reassign META table and >>>>>>> for client threads to re-lookup and re-orient to the lesser cluster >>>>>>> (minus the RS and DN on that host). >>>>>>> >>>>>>> What we see: >>>>>>> 1) Client threads spike up to maxThread size ... and take over 35 mins >>>>>>> to recover (i.e. for the thread count to go back to normal) - no calls >>>>>>> are being serviced - they are all just backed up on a synchronized >>>>>>> method ... >>>>>>> >>>>>>> 2) Essentially, all the client app threads queue up behind the >>>>>>> HBaseClient.setupIOStreams method in oahh.ipc.HBaseClient >>>>>>> (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.2/org/apache/hadoop/hbase/ipc/HBaseClient.java#312). >>>>>>> http://tinyurl.com/7js53dj >>>>>>> >>>>>>> After taking several thread dumps we found that the thread within this >>>>>>> synchronized method was blocked on >>>>>>> NetUtils.connect(this.socket, remoteId.getAddress(), >>>>>>> getSocketTimeout(conf)); >>>>>>> >>>>>>> Essentially, the thread which got the lock would try to connect to the >>>>>>> dead RS (till socket times out), retrying, and then the next thread >>>>>>> gets in and so forth. >>>>>>> >>>>>>> Solution tested: >>>>>>> ------------------- >>>>>>> So - the ipc.HBaseClient code shows ipc.socket.timeout default is 20s. >>>>>>> We dropped this down to a low number (1000 ms, 100 ms, etc) and the >>>>>>> recovery was much faster (in a couple of minutes). >>>>>>> >>>>>>> So - we're thinking of setting the HBase client side hbase-site.xml >>>>>>> with an ipc.socket.timeout of 100ms. Looking at the code, it appears >>>>>>> that this is only ever used during the initial "HConnection" setup via >>>>>>> the NetUtils.connect and should only ever be used when connectivity to >>>>>>> a region server is lost and needs to be re-established. i.e it does >>>>>>> not affect the normal "RPC" actiivity as this is just the connect >>>>>>> timeout. >>>>>>> >>>>>>> Am I reading the code right? Any thoughts on how whether this is too >>>>>>> low for comfort? (Our internal tests did not show any errors during >>>>>>> normal operation related to timeouts etc ... but, I just wanted to run >>>>>>> this by the experts.). >>>>>>> >>>>>>> Note that this above timeout tweak is only on the HBase client side. >>>>>>> Thanks, >>>>>>> --Suraj