Will do some deeper testing on this to try to narrow it down and will update then here for sure.
On Mon, Mar 23, 2015 at 5:42 PM Nicolas Liochon <nkey...@gmail.com> wrote: > Ok, so hopefully there are some info in the namenode & datanode logs. > > On Mon, Mar 23, 2015 at 5:32 PM, Dejan Menges <dejan.men...@gmail.com> > wrote: > > > ...and I also got sure that it's applied with hdfs getconf -confKey... > > > > On Mon, Mar 23, 2015 at 5:31 PM Dejan Menges <dejan.men...@gmail.com> > > wrote: > > > > > It was true all the time, together with dfs.namenode.avoid.read.stale. > > > datanode. > > > > > > On Mon, Mar 23, 2015 at 5:29 PM Nicolas Liochon <nkey...@gmail.com> > > wrote: > > > > > >> Actually, double checking the final patch in HDFS-4721, the stale mode > > is > > >> taken in account. Bryan is right, it's worth checking the namenodes > > >> config. > > >> Especially, dfs.namenode.avoid.write.stale.datanode must be set to > true > > >> on > > >> the namenode. > > >> > > >> On Mon, Mar 23, 2015 at 5:08 PM, Nicolas Liochon <nkey...@gmail.com> > > >> wrote: > > >> > > >> > stale should not help for recoverLease: it helps for reads, but it's > > the > > >> > step after lease recovery. > > >> > It's not needed in recoverLease because the recoverLease in hdfs > just > > >> > sorts the datanode by the heartbeat time, so, usually the stale > > datanode > > >> > will be the last one of the list. > > >> > > > >> > > > >> > > > >> > > > >> > On Mon, Mar 23, 2015 at 4:38 PM, Bryan Beaudreault < > > >> > bbeaudrea...@hubspot.com> wrote: > > >> > > > >> >> @Nicholas, I see, thanks. I'll keep the settings at default. So > > >> really > > >> >> if > > >> >> everything else is configured properly you should never reach the > > lease > > >> >> recovery timeout in any failure scenarios. It seems that the > > staleness > > >> >> check would be the thing that prevents this, correct? I'm > surprised > > it > > >> >> didn't help Dejan. > > >> >> > > >> >> On Mon, Mar 23, 2015 at 11:20 AM, Nicolas Liochon < > nkey...@gmail.com > > > > > >> >> wrote: > > >> >> > > >> >> > @bryan: yes, you can change hbase.lease.recovery.timeout if you > > >> changed > > >> >> he > > >> >> > hdfs settings. But this setting is really for desperate cases. > The > > >> >> recover > > >> >> > Lease should have succeeded before. As well, if you depend on > > >> >> > hbase.lease.recovery.timeout, it means that you're wasting > recovery > > >> >> time: > > >> >> > the lease should be recovered in a few seconds. > > >> >> > > > >> >> > On Mon, Mar 23, 2015 at 3:59 PM, Dejan Menges < > > >> dejan.men...@gmail.com> > > >> >> > wrote: > > >> >> > > > >> >> > > Interesting discussion I also found, gives me some more light > on > > >> what > > >> >> > > Nicolas is talking about - > > >> >> > https://issues.apache.org/jira/browse/HDFS-3703 > > >> >> > > > > >> >> > > On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault < > > >> >> > > bbeaudrea...@hubspot.com> > > >> >> > > wrote: > > >> >> > > > > >> >> > > > So it is safe to set hbase.lease.recovery.timeout lower if > you > > >> also > > >> >> > > > set heartbeat.recheck.interval lower (lowering that 10.5 min > > dead > > >> >> node > > >> >> > > > timer)? Or is it recommended to not touch either of those? > > >> >> > > > > > >> >> > > > Reading the above with interest, thanks for digging in here > > guys. > > >> >> > > > > > >> >> > > > On Mon, Mar 23, 2015 at 10:13 AM, Nicolas Liochon < > > >> >> nkey...@gmail.com> > > >> >> > > > wrote: > > >> >> > > > > > >> >> > > > > If the node is actually down it's fine. But the node may > not > > be > > >> >> that > > >> >> > > down > > >> >> > > > > (CAP theorem here); and then it's looking for trouble. > > >> >> > > > > HDFS, by default declare a node as dead after 10:30. 15 > > minutes > > >> >> is an > > >> >> > > > extra > > >> >> > > > > security. It seems your hdfs settings are different (or > there > > >> is a > > >> >> > > > bug...). > > >> >> > > > > There should be some info in the hdfs logs. > > >> >> > > > > > > >> >> > > > > On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges < > > >> >> > dejan.men...@gmail.com> > > >> >> > > > > wrote: > > >> >> > > > > > > >> >> > > > > > Will take a look. > > >> >> > > > > > > > >> >> > > > > > Actually, if node is down (someone unplugged network > cable, > > >> it > > >> >> just > > >> >> > > > died, > > >> >> > > > > > whatever) what's chance it's going to become live so > write > > >> can > > >> >> > > > continue? > > >> >> > > > > On > > >> >> > > > > > the other side, HBase is not starting recovery trying to > > >> contact > > >> >> > node > > >> >> > > > > which > > >> >> > > > > > is not there anymore, and even elected as dead on > Namenode > > >> side > > >> >> > > > (another > > >> >> > > > > > thing I didn't understood quite good). > > >> >> > > > > > > > >> >> > > > > > So what I was expecting is that as soon as Namenode > decided > > >> >> node is > > >> >> > > > dead, > > >> >> > > > > > that it would be enough for RegionServer to stop trying > to > > >> >> recover > > >> >> > > from > > >> >> > > > > the > > >> >> > > > > > dead node, but it wasn't the case. Also, this whole MTTR > > >> >> article in > > >> >> > > > HBase > > >> >> > > > > > book doesn't work at all with this parameter set to it's > > >> default > > >> >> > > value > > >> >> > > > > (15 > > >> >> > > > > > minutes). > > >> >> > > > > > > > >> >> > > > > > So I'm still struggling to figure out what's drawback > > >> exactly on > > >> >> > > this? > > >> >> > > > > > > > >> >> > > > > > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon < > > >> >> nkey...@gmail.com > > >> >> > > > > >> >> > > > > wrote: > > >> >> > > > > > > > >> >> > > > > > > Thanks for the explanation. There is an issue if you > > modify > > >> >> this > > >> >> > > > > setting > > >> >> > > > > > > however. > > >> >> > > > > > > hbase tries to recover the lease (i.e. be sure that > > nobody > > >> is > > >> >> > > > writing) > > >> >> > > > > > > If you change hbase.lease.recovery.timeout hbase will > > start > > >> >> the > > >> >> > > > > recovery > > >> >> > > > > > > (i.e. start to read) even if it's not sure that > nobody's > > >> >> writing. > > >> >> > > > That > > >> >> > > > > > > means there is a dataloss risk. > > >> >> > > > > > > Basically, you must not see this warning: WARN > > >> >> > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot > > >> recoverLease > > >> >> > after > > >> >> > > > > > trying > > >> >> > > > > > > for[] > > >> >> > > > > > > > > >> >> > > > > > > The recoverLease must succeed. The fact that it does > not > > >> >> after X > > >> >> > > > tries > > >> >> > > > > is > > >> >> > > > > > > strange. > > >> >> > > > > > > There may be a mistmatch between the hbase parameters > and > > >> the > > >> >> > hdfs > > >> >> > > > > ones. > > >> >> > > > > > > You may need to have a look at the comments in > > >> >> FSHDFSUtils.java > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges < > > >> >> > > > dejan.men...@gmail.com> > > >> >> > > > > > > wrote: > > >> >> > > > > > > > > >> >> > > > > > > > I found the issue and fixed it, and will try to > explain > > >> here > > >> >> > what > > >> >> > > > was > > >> >> > > > > > > > exactly in our case in case someone else finds this > > >> >> interesting > > >> >> > > > too. > > >> >> > > > > > > > > > >> >> > > > > > > > So initially, we had (couple of times) some network > and > > >> >> > hardware > > >> >> > > > > issues > > >> >> > > > > > > in > > >> >> > > > > > > > our datacenters. When one server would die (literary > > >> die, we > > >> >> > had > > >> >> > > > some > > >> >> > > > > > > issue > > >> >> > > > > > > > with PSUs) we saw issues with overall cluster > > >> performance on > > >> >> > > HBase > > >> >> > > > > > side. > > >> >> > > > > > > As > > >> >> > > > > > > > cluster is quite big and live, it was also quite hard > > to > > >> >> figure > > >> >> > > out > > >> >> > > > > > exact > > >> >> > > > > > > > root cause and how to fix the stuff we wanted to fix. > > >> >> > > > > > > > > > >> >> > > > > > > > So I set up another cluster, four nodes (with > DataNode > > >> and > > >> >> > > > > > RegionServer) > > >> >> > > > > > > > and two other nodes with HMaster and Namenode in HA, > > >> using > > >> >> same > > >> >> > > > stuff > > >> >> > > > > > we > > >> >> > > > > > > > use on production. We pumped some data into it, and I > > was > > >> >> able > > >> >> > to > > >> >> > > > > > > reproduce > > >> >> > > > > > > > same issue last week on it. What I tried to do is to > > cut > > >> one > > >> >> > > server > > >> >> > > > > > (shut > > >> >> > > > > > > > down it's interface) when all is good with cluster, > > when > > >> we > > >> >> > have > > >> >> > > > > load, > > >> >> > > > > > > and > > >> >> > > > > > > > see what's going to happen. > > >> >> > > > > > > > > > >> >> > > > > > > > On Friday, after Nicolas mentioned, I started taking > a > > >> look > > >> >> in > > >> >> > > > HBase > > >> >> > > > > > logs > > >> >> > > > > > > > on the node which was mentioned in HMaster log as the > > one > > >> >> > taking > > >> >> > > > over > > >> >> > > > > > > > regions for the dead server. Basically what I was > able > > to > > >> >> > observe > > >> >> > > > was > > >> >> > > > > > 15 > > >> >> > > > > > > > minutes time (+- couple of seconds, what was also > > >> >> interesting, > > >> >> > > and > > >> >> > > > > will > > >> >> > > > > > > got > > >> >> > > > > > > > later to that) between HMaster figures out that one > of > > >> it's > > >> >> > > > > > RegionServers > > >> >> > > > > > > > is dead, and the time one of the mentioned nodes > starts > > >> >> taking > > >> >> > > over > > >> >> > > > > > those > > >> >> > > > > > > > regions and they start appearing in HMaster's Web UI. > > >> >> > > > > > > > > > >> >> > > > > > > > I then set up everything like mentioned here > > >> >> > > > > > > > http://hbase.apache.org/book.html#mttr - but still > had > > >> >> exactly > > >> >> > > the > > >> >> > > > > > same > > >> >> > > > > > > > issues. Went over (again and again) all currently > > >> configured > > >> >> > > stuff, > > >> >> > > > > but > > >> >> > > > > > > > still had the same issue. > > >> >> > > > > > > > > > >> >> > > > > > > > Then I started looking into HDFS. Opened NameNode UI, > > saw > > >> >> all > > >> >> > is > > >> >> > > > > good, > > >> >> > > > > > > took > > >> >> > > > > > > > one node down, was also looking RegionServer logs in > > the > > >> >> same > > >> >> > > time > > >> >> > > > - > > >> >> > > > > > and > > >> >> > > > > > > I > > >> >> > > > > > > > also see that it took ~15 minutes for Namenode to > elect > > >> dead > > >> >> > node > > >> >> > > > as > > >> >> > > > > > > dead. > > >> >> > > > > > > > Somehow in the same moment regions started getting > back > > >> to > > >> >> > life. > > >> >> > > I > > >> >> > > > > > > remember > > >> >> > > > > > > > in some older versions dfs timeout checks and check > > >> retries. > > >> >> > > Looked > > >> >> > > > > > into > > >> >> > > > > > > > defaults for our Hadoop version - > > >> >> > > > > > > > > > >> >> > > > > > > > http://hadoop.apache.org/docs/ > > >> r2.4.1/hadoop-project-dist/ > > >> >> > > > > > > hadoop-hdfs/hdfs-default.xml > > >> >> > > > > > > > - and saw there that there's no recheck parameter > > >> anymore. > > >> >> > > Strange, > > >> >> > > > > as > > >> >> > > > > > on > > >> >> > > > > > > > StackOverflow I found post from month ago, for newer > > >> version > > >> >> > than > > >> >> > > > we > > >> >> > > > > > use > > >> >> > > > > > > > (we use 2.4.1, guy was using 2.6 - > > >> dfs.namenode.heartbeat. > > >> >> > > > > > > recheck-interval) > > >> >> > > > > > > > I set it to 10 seconds as he mentioned, having checks > > >> every > > >> >> > three > > >> >> > > > > > seconds > > >> >> > > > > > > > (default) and got DataNode elected as dead in ~45 > > >> seconds, > > >> >> as > > >> >> > he > > >> >> > > > > > > mentioned. > > >> >> > > > > > > > Not sure why this parameter is not documented, but > > >> >> obviously it > > >> >> > > > > works. > > >> >> > > > > > > > > > >> >> > > > > > > > Then figured out it still didn't fix our HBase > failover > > >> >> issue. > > >> >> > I > > >> >> > > > was > > >> >> > > > > > > > looking into HBase book again and again, and then saw > > >> this > > >> >> > part: > > >> >> > > > > > > > > > >> >> > > > > > > > "How much time we allow elapse between calls to > recover > > >> >> lease. > > >> >> > > > Should > > >> >> > > > > > be > > >> >> > > > > > > > larger than the dfs timeout." > > >> >> > > > > > > > > > >> >> > > > > > > > This was description for > hbase.lease.recovery.dfs.timeo > > >> ut. > > >> >> > Wasn't > > >> >> > > > > sure > > >> >> > > > > > > from > > >> >> > > > > > > > the comment what of all timeouts that's possible to > set > > >> in > > >> >> > > > > Hadoop/HBase > > >> >> > > > > > > and > > >> >> > > > > > > > that has something to do with DFS is this all about. > > But > > >> >> picked > > >> >> > > > > > > > hbase.lease.recovery.timeout, and that was the catch. > > >> >> > > > > > > > > > >> >> > > > > > > > Initially, by default, hbase.lease.recovery.timeout > is > > >> set > > >> >> to > > >> >> > 15 > > >> >> > > > > > minutes. > > >> >> > > > > > > > Not sure why, and wasn't able to find yet why, but > > >> getting > > >> >> this > > >> >> > > > down > > >> >> > > > > to > > >> >> > > > > > > one > > >> >> > > > > > > > minute (what's more than OK for us) I was able to get > > >> rid of > > >> >> > our > > >> >> > > > > issue. > > >> >> > > > > > > Not > > >> >> > > > > > > > also sure why this is not mentioned in MTTR section > in > > >> HBase > > >> >> > > book, > > >> >> > > > as > > >> >> > > > > > > > obviously MTTR doesn't work at all with this default > > >> >> timeout, > > >> >> > at > > >> >> > > > > least > > >> >> > > > > > it > > >> >> > > > > > > > doesn't work the way we expected it to work. > > >> >> > > > > > > > > > >> >> > > > > > > > So thanks again for everyone being spammed with this, > > and > > >> >> > > specially > > >> >> > > > > > > thanks > > >> >> > > > > > > > to Nicolas pointing me to the right direction. > > >> >> > > > > > > > > > >> >> > > > > > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon < > > >> >> > > nkey...@gmail.com > > >> >> > > > > > > >> >> > > > > > > wrote: > > >> >> > > > > > > > > > >> >> > > > > > > > > the attachments are rejected by the mailing list, > can > > >> you > > >> >> put > > >> >> > > > then > > >> >> > > > > on > > >> >> > > > > > > > > pastebin? > > >> >> > > > > > > > > > > >> >> > > > > > > > > stale is mandatory (so it's good), but the issue > here > > >> is > > >> >> just > > >> >> > > > > before. > > >> >> > > > > > > The > > >> >> > > > > > > > > region server needs to read the file. In order to > be > > >> sure > > >> >> > that > > >> >> > > > > there > > >> >> > > > > > is > > >> >> > > > > > > > no > > >> >> > > > > > > > > data loss, it needs to "recover the lease", that > > means > > >> >> > ensuring > > >> >> > > > > that > > >> >> > > > > > > > nobody > > >> >> > > > > > > > > is writing the file. The regionserver calls the > > >> namenode > > >> >> to > > >> >> > do > > >> >> > > > this > > >> >> > > > > > > > > recoverLease. So there should be some info in the > > >> namenode > > >> >> > > logs. > > >> >> > > > > You > > >> >> > > > > > > have > > >> >> > > > > > > > > HDFS-4721 on your hdfs? The hbase book details > (more > > or > > >> >> > > less...) > > >> >> > > > > this > > >> >> > > > > > > > > recoverLease stuff. > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges < > > >> >> > > > > > dejan.men...@gmail.com > > >> >> > > > > > > > > > >> >> > > > > > > > > wrote: > > >> >> > > > > > > > > > > >> >> > > > > > > > > > And also, just checked - > > >> dfs.namenode.avoid.read.stale. > > >> >> > > > datanode > > >> >> > > > > and > > >> >> > > > > > > > > > dfs.namenode.avoid.write.stale.datanode > > >> >> > > > > > > > > > are both true, and > dfs.namenode.stale.datanode.in > > >> terval > > >> >> is > > >> >> > > > set to > > >> >> > > > > > > > > default > > >> >> > > > > > > > > > 30000. > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges < > > >> >> > > > > > > dejan.men...@gmail.com> > > >> >> > > > > > > > > > wrote: > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > Hi Nicolas, > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > Please find log attached. > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > As I see it now more clearly, it was trying to > > >> recover > > >> >> > HDFS > > >> >> > > > > WALs > > >> >> > > > > > > from > > >> >> > > > > > > > > > node > > >> >> > > > > > > > > > > that's dead: > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > 2015-03-23 08:53:44,381 WARN > > >> >> > > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils: > > >> >> > > > > > > > > > > Cannot recoverLease after trying for 900000ms > > >> >> > > > > > > > > > > (hbase.lease.recovery.timeout); continuing, > but > > >> may be > > >> >> > > > > > > DATALOSS!!!; > > >> >> > > > > > > > > > > attempt=40 on > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent > > >> >> > > > > > > > > ionally_get_down_by_getting_ne > > >> twork_down},60020,1426862900 > > >> >> > > > > > > > > 506-splitting/{node_i_ > intentionally_get_down_by_gett > > >> >> > > > > > > > > ing_network_down}%2C60020%2C14 > > >> 26862900506.1427096924508 > > >> >> > > > > > > > > > > after 908210ms > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > And as you can see from the log, it tried 40 > > times, > > >> >> what > > >> >> > > took > > >> >> > > > > it > > >> >> > > > > > > > > exactly > > >> >> > > > > > > > > > > 15 minutes. > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > There's probably some parameter to tune to cut > it > > >> of > > >> >> from > > >> >> > > 40 > > >> >> > > > > > times > > >> >> > > > > > > / > > >> >> > > > > > > > 15 > > >> >> > > > > > > > > > > minutes to something more useful, as for 15 > > >> minutes we > > >> >> > > don't > > >> >> > > > > have > > >> >> > > > > > > our > > >> >> > > > > > > > > > > regions available, and HDFS have however > > >> replication > > >> >> > factor > > >> >> > > > 3. > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > Googling, if I figure out what's this I will > post > > >> it > > >> >> > here. > > >> >> > > > Will > > >> >> > > > > > > also > > >> >> > > > > > > > > > > appreciate if someone knows how to cut this > down. > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > Thanks, > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > Dejan > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas > Liochon < > > >> >> > > > > > nkey...@gmail.com > > >> >> > > > > > > > > > >> >> > > > > > > > > > wrote: > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > >> The split is done by the region servers (the > > >> master > > >> >> > > > > > coordinates). > > >> >> > > > > > > Is > > >> >> > > > > > > > > > there > > >> >> > > > > > > > > > >> some interesting stuff in their logs? > > >> >> > > > > > > > > > >> > > >> >> > > > > > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges > < > > >> >> > > > > > > > dejan.men...@gmail.com > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > >> wrote: > > >> >> > > > > > > > > > >> > > >> >> > > > > > > > > > >> > With client issue was that it was retrying > > >> >> connecting > > >> >> > to > > >> >> > > > the > > >> >> > > > > > > same > > >> >> > > > > > > > > > region > > >> >> > > > > > > > > > >> > servers even when they were reassigned. > > >> Lowering it > > >> >> > down > > >> >> > > > > > helped > > >> >> > > > > > > in > > >> >> > > > > > > > > > this > > >> >> > > > > > > > > > >> > specific use case, but yes, we still want > > >> servers > > >> >> to > > >> >> > > > > > reallocate > > >> >> > > > > > > > > > quickly. > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > What got me here: > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > http://hbase.apache.org/book.html#mttr > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > I basically set configuration exactly the > same > > >> way > > >> >> as > > >> >> > > it's > > >> >> > > > > > > > explained > > >> >> > > > > > > > > > >> here. > > >> >> > > > > > > > > > >> > *zookeeper.session.timeout* is (and was > > before) > > >> >> 60000 > > >> >> > > (one > > >> >> > > > > > > > minute). > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > So basically what happens is: - simulating > > >> network > > >> >> > > issues > > >> >> > > > we > > >> >> > > > > > had > > >> >> > > > > > > > > > >> recently). > > >> >> > > > > > > > > > >> > - After short time I see in HBase that my > > >> >> RegionServer > > >> >> > > is > > >> >> > > > > > dead, > > >> >> > > > > > > > and > > >> >> > > > > > > > > as > > >> >> > > > > > > > > > >> > total number of regions I see previous total > > >> minus > > >> >> > > number > > >> >> > > > of > > >> >> > > > > > > > regions > > >> >> > > > > > > > > > >> that > > >> >> > > > > > > > > > >> > were hosted on the node hosting RegionServer > > >> that > > >> >> just > > >> >> > > > > > > > > 'disappeared'. > > >> >> > > > > > > > > > >> > - In this point I want my clus > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > - I have test cluster consisting of four > > nodes, > > >> >> every > > >> >> > > node > > >> >> > > > > > being > > >> >> > > > > > > > > > >> DataNode > > >> >> > > > > > > > > > >> > and RegionServer. > > >> >> > > > > > > > > > >> > - I simulate network partition on one > (connect > > >> to > > >> >> it > > >> >> > > > through > > >> >> > > > > > > > console > > >> >> > > > > > > > > > and > > >> >> > > > > > > > > > >> > take network interface downter to recover as > > >> soon > > >> >> as > > >> >> > > > > possible, > > >> >> > > > > > > to > > >> >> > > > > > > > > > start > > >> >> > > > > > > > > > >> > serving missing regions. > > >> >> > > > > > > > > > >> > - First thing I see in HMaster logs are: > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:26,015 INFO > > >> >> > > > > > > > > > >> > > > >> >> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: > > >> >> > > > > > > > RegionServer > > >> >> > > > > > > > > > >> > ephemeral node deleted, processing > expiration > > >> >> > > > > > > > > > >> > [{name_of_node_I_took_down},60 > > >> 020,1426860403261] > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:26,067 INFO > > >> >> > > > > > > > > > >> > > > >> >> > > > > org.apache.hadoop.hbase.master.handler. > ServerShutdownHandler: > > >> >> > > > > > > > > > Splitting > > >> >> > > > > > > > > > >> > logs for > > >> >> > {name_of_node_I_took_down},60020,1426860403261 > > >> >> > > > > before > > >> >> > > > > > > > > > >> assignment. > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:26,105 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> dead > > >> >> > > > > splitlog > > >> >> > > > > > > > > workers > > >> >> > > > > > > > > > [ > > >> >> > > > > > > > > > >> > > > {name_of_node_I_took_down},60020,1426860403261] > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:26,107 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> > started > > >> >> > > > > > > splitting > > >> >> > > > > > > > 1 > > >> >> > > > > > > > > > >> logs in > > >> >> > > > > > > > > > >> > > > >> >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_ > > >> >> > > > I_ > > >> >> > > > > > > > > took_down} > > >> >> > > > > > > > > > >> > ,60020,1426860403261-splitting] > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:26,150 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> task > > >> >> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting > > >> %2F > > >> >> > > > > > > > > > >> > > > >> >> > {name_of_node_I_took_down}%252C60020%252C1426860403261. > > >> >> > > > 14268 > > >> >> > > > > > > > > 60404905 > > >> >> > > > > > > > > > >> > acquired by {fqdn_of_another_live_hnode}, > > >> >> > > > 60020,1426859445623 > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:26,182 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> total > > >> >> > > > tasks > > >> >> > > > > = > > >> >> > > > > > 1 > > >> >> > > > > > > > > > >> unassigned > > >> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs% > > >> >> > > > 2F{name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_ > > >> >> > > > took_ > > >> >> > > > > > > > > > >> > > >> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update > > >> >> > > > > > > > > > >> > = 1426861046182 last_version = 2 > > >> cur_worker_name = > > >> >> > > > > > > > > > >> > > > {fqdn_of_another_live_node},60020,1426859445623 > > >> >> > status = > > >> >> > > > > > > > > in_progress > > >> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = > > installed > > >> = 1 > > >> >> > done > > >> >> > > > = 0 > > >> >> > > > > > > > error = > > >> >> > > > > > > > > > 0} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:31,183 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> total > > >> >> > > > tasks > > >> >> > > > > = > > >> >> > > > > > 1 > > >> >> > > > > > > > > > >> unassigned > > >> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs% > > >> >> > > > 2F{name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_ > > >> >> > > > took_ > > >> >> > > > > > > > > > >> > > >> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update > > >> >> > > > > > > > > > >> > = 1426861046182 last_version = 2 > > >> cur_worker_name = > > >> >> > > > > > > > > > >> > > > {fqdn_of_another_live_node},60020,1426859445623 > > >> >> > status = > > >> >> > > > > > > > > in_progress > > >> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = > > installed > > >> = 1 > > >> >> > done > > >> >> > > > = 0 > > >> >> > > > > > > > error = > > >> >> > > > > > > > > > 0} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:36,184 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> total > > >> >> > > > tasks > > >> >> > > > > = > > >> >> > > > > > 1 > > >> >> > > > > > > > > > >> unassigned > > >> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs% > > >> >> > > > 2F{name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_ > > >> >> > > > took_ > > >> >> > > > > > > > > > >> > > >> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update > > >> >> > > > > > > > > > >> > = 1426861046182 last_version = 2 > > >> cur_worker_name = > > >> >> > > > > > > > > > >> > > > {fqdn_of_another_live_node},60020,1426859445623 > > >> >> > status = > > >> >> > > > > > > > > in_progress > > >> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = > > installed > > >> = 1 > > >> >> > done > > >> >> > > > = 0 > > >> >> > > > > > > > error = > > >> >> > > > > > > > > > 0} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:42,185 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> total > > >> >> > > > tasks > > >> >> > > > > = > > >> >> > > > > > 1 > > >> >> > > > > > > > > > >> unassigned > > >> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs% > > >> >> > > > 2F{name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_ > > >> >> > > > took_ > > >> >> > > > > > > > > > >> > > >> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update > > >> >> > > > > > > > > > >> > = 1426861046182 last_version = 2 > > >> cur_worker_name = > > >> >> > > > > > > > > > >> > > > {fqdn_of_another_live_node},60020,1426859445623 > > >> >> > status = > > >> >> > > > > > > > > in_progress > > >> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = > > installed > > >> = 1 > > >> >> > done > > >> >> > > > = 0 > > >> >> > > > > > > > error = > > >> >> > > > > > > > > > 0} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:17:48,184 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> total > > >> >> > > > tasks > > >> >> > > > > = > > >> >> > > > > > 1 > > >> >> > > > > > > > > > >> unassigned > > >> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs% > > >> >> > > > 2F{name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_ > > >> >> > > > took_ > > >> >> > > > > > > > > > >> > > >> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update > > >> >> > > > > > > > > > >> > = 1426861046182 last_version = 2 > > >> cur_worker_name = > > >> >> > > > > > > > > > >> > > > {fqdn_of_another_live_node},60020,1426859445623 > > >> >> > status = > > >> >> > > > > > > > > in_progress > > >> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = > > installed > > >> = 1 > > >> >> > done > > >> >> > > > = 0 > > >> >> > > > > > > > error = > > >> >> > > > > > > > > > 0} > > >> >> > > > > > > > > > >> > In the meantime, In hbase...out log I got > > this: > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > ==> > > >> >> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out > > >> >> > <== > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > java.io.IOException: Call to > > >> >> > > > > > > > > > >> > > > >> >> > > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_ > > >> >> > > > dow > > >> >> > > > > > > > > n}:60020 > > >> >> > > > > > > > > > >> > failed on local exception: > > >> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$ > > >> >> > > > CallTimeoutException: > > >> >> > > > > > > Call > > >> >> > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000 > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien > > >> >> > > > > > > > > > >> t.java:1532) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java: > > >> >> > > > > > > 1502) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod( > > >> >> > > > Rpc > > >> >> > > > > > > > > > >> Client.java:1684) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$ > > >> >> > > > BlockingRpcChannelImpl > > >> >> > > > > > > > > ementati > > >> >> > > > > > > > > > >> on.callBlockingMethod(RpcClient.java:1737) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$ > > >> >> > > > > > > > > AdminService$ > > >> >> > > > > > > > > > >> BlockingStub.getRegionInfo(Adm > > >> inProtos.java:20806) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin. > > >> >> > > > getCompactionState > > >> >> > > > > > > > > > >> (HBaseAdmin.java:2524) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> org.apache.hadoop.hbase.generated.master.table_jsp._ > > >> >> > > > jspServi > > >> >> > > > > > > > > > >> ce(table_jsp.java:167) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase. > > >> >> > > > > > > > > java:98) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:770) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder > > >> >> > > > > > > > > > >> .java:511) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain. > > >> >> > > > doFilte > > >> >> > > > > > > > > > >> r(ServletHandler.java:1221) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.http.lib.StaticUserWebFilter$ > > >> >> > > > StaticUserFil > > >> >> > > > > > > > > > >> ter.doFilter(StaticUserWebFilter.java:109) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain. > > >> >> > > > doFilte > > >> >> > > > > > > > > > >> r(ServletHandler.java:1212) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> org.apache.hadoop.http.HttpServer$QuotingInputFilter. > > >> >> > > > doFilte > > >> >> > > > > > > > > > >> r(HttpServer.java:1081) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain. > > >> >> > > > doFilte > > >> >> > > > > > > > > > >> r(ServletHandler.java:1212) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > > > >> >> > > > > > > > >> >> > > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain. > > >> >> > > > doFilte > > >> >> > > > > > > > > > >> r(ServletHandler.java:1212) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl > > >> >> > > > > > > > > > >> er.java:399) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > org.mortbay.jetty.security.Sec > > >> urityHandler.handle( > > >> >> > > > SecurityHa > > >> >> > > > > > > > > > >> ndler.java:216) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl > > >> >> > > > > > > > > > >> er.java:182) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl > > >> >> > > > > > > > > > >> er.java:766) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext. > > >> >> > > > > > > > > > >> java:450) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.mortbay.jetty.handler.Cont > > >> extHandlerCollection. > > >> >> > > > handle(Co > > >> >> > > > > > > > > > >> ntextHandlerCollection.java:230) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp > > >> >> > > > > > > > > > >> er.java:152) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at org.mortbay.jetty.Server.handl > > >> e(Server.java:326) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio > > >> >> > > > > > > > > > >> n.java:542) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.mortbay.jetty.HttpConnecti > > >> on$RequestHandler. > > >> >> > > > headerComple > > >> >> > > > > > > > > > >> te(HttpConnection.java:928) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > org.mortbay.jetty.HttpParser. > parseNext(HttpParser.java:549) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> >> > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> >> > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.mortbay.io.nio. > SelectChannelEndPoint.run( > > >> >> > > > SelectChannelEn > > >> >> > > > > > > > > > >> dPoint.java:410) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > org.mortbay.thread.QueuedThreadPool$PoolThread. > > >> run( > > >> >> > > > > > > > > > >> QueuedThreadPool.java:582) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.Rp > > >> cClient$ > > >> >> > > > > > > > > CallTimeoutException: > > >> >> > > > > > > > > > >> Call > > >> >> > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000 > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.Rp > > >> cClient$Connection. > > >> >> > > > cleanupCall > > >> >> > > > > > > > > > >> s(RpcClient.java:1234) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.Rp > > >> cClient$Connection. > > >> >> > > > readRespons > > >> >> > > > > > > > > > >> e(RpcClient.java:1171) > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie > > >> >> > > > > > > > > > >> nt.java:751) > > >> >> > > > > > > > > > >> > Beside this same issue, please note that > first > > >> >> message > > >> >> > > was > > >> >> > > > > at > > >> >> > > > > > > > > > 2015-03-20 > > >> >> > > > > > > > > > >> > 14:17:26,015. And then (we got to the point > > >> when it > > >> >> > > > started > > >> >> > > > > > > > > > transition): > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,059 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> task > > >> >> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting > > >> %2F > > >> >> > > > > > > > > > >> > > > >> >> > {name_of_node_I_took_down}%252C60020%252C1426860403261. > > >> >> > > > 14268 > > >> >> > > > > > > > > 60404905 > > >> >> > > > > > > > > > >> > entered state: DONE > > >> >> > > > > > {fqdn_of_new_live_node},60020,1426859445623 > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,109 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> Done > > >> >> > > > > splitting > > >> >> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F{name_o > > >> f_node_I_took_down} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_do > > >> wn} > > >> >> > > > > > > > > > >> > %252C60020%252C1426860403261.1426860404905 > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,190 INFO > > >> >> > > > > > > > > > >> > > > org.apache.hadoop.hbase.master.SplitLogManager: > > >> >> > finished > > >> >> > > > > > > splitting > > >> >> > > > > > > > > > >> (more > > >> >> > > > > > > > > > >> > than or equal to) 9 bytes in 1 log files in > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_ > > >> >> > > > of_ > > >> >> > > > > > > > > > >> node_I_took_down},60020, > 1426860403261-splitting] > > >> >> > > > > > > > > > >> > in 909083ms > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO > > >> >> > > > org.apache.hadoop.hbase.master > > >> >> > > > > > > > > > >> .RegionStates: > > >> >> > > > > > > > > > >> > Transitioned { > 0e7cc87a4ef6c47a779186f5bf79a01c > > >> >> > > > state=OPEN, > > >> >> > > > > > > > > > >> > ts=1426860639088, > > >> >> > > > > > > > > > server={name_of_node_I_took_do > > >> wn},60020,1426860403261} > > >> >> > > > > > > > > > >> to > > >> >> > > > > > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c > > >> state=OFFLINE, > > >> >> > > > > > > ts=1426861955191, > > >> >> > > > > > > > > > >> server= > > >> >> > > > > > > > > > >> > > > {name_of_node_I_took_down},60020,1426860403261} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO > > >> >> > > > org.apache.hadoop.hbase.master > > >> >> > > > > > > > > > >> .RegionStates: > > >> >> > > > > > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c > from > > >> >> > > > > > > > > > >> {name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > ,60020,1426860403261 > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO > > >> >> > > > org.apache.hadoop.hbase.master > > >> >> > > > > > > > > > >> .RegionStates: > > >> >> > > > > > > > > > >> > Transitioned { > 25ab6e7b42e36ddaa723d71be5954543 > > >> >> > > > state=OPEN, > > >> >> > > > > > > > > > >> > ts=1426860641783, > > >> >> > > > > > > > > > server={name_of_node_I_took_do > > >> wn},60020,1426860403261} > > >> >> > > > > > > > > > >> to > > >> >> > > > > > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543 > > >> state=OFFLINE, > > >> >> > > > > > > ts=1426861955191, > > >> >> > > > > > > > > > >> server= > > >> >> > > > > > > > > > >> > > > {name_of_node_I_took_down},60020,1426860403261} > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO > > >> >> > > > org.apache.hadoop.hbase.master > > >> >> > > > > > > > > > >> .RegionStates: > > >> >> > > > > > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 > from > > >> >> > > > > > > > > > >> {name_of_node_I_took_down} > > >> >> > > > > > > > > > >> > ,60020,1426860403261 > > >> >> > > > > > > > > > >> > At this point, note that it finished > > >> >> SplitLogManager > > >> >> > > task > > >> >> > > > at > > >> >> > > > > > > > > 14:32:35 > > >> >> > > > > > > > > > >> and > > >> >> > > > > > > > > > >> > started transitioning just after that. So > this > > >> is > > >> >> 15 > > >> >> > > > minutes > > >> >> > > > > > > that > > >> >> > > > > > > > > I'm > > >> >> > > > > > > > > > >> > talking about. > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > What am I missing? > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas > > Liochon > > >> < > > >> >> > > > > > > > nkey...@gmail.com> > > >> >> > > > > > > > > > >> wrote: > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > You've changed the value of > > >> >> hbase.zookeeper.timeout > > >> >> > to > > >> >> > > > 15 > > >> >> > > > > > > > > minutes? A > > >> >> > > > > > > > > > >> very > > >> >> > > > > > > > > > >> > > reasonable target is 1 minute before > > >> relocating > > >> >> the > > >> >> > > > > regions. > > >> >> > > > > > > > > That's > > >> >> > > > > > > > > > >> the > > >> >> > > > > > > > > > >> > > default iirc. You can push it to 20s, but > > then > > >> >> > > > > > > > > gc-stopping-the-world > > >> >> > > > > > > > > > >> > > becomes more of an issue. 15 minutes is > > >> really a > > >> >> > lot. > > >> >> > > > The > > >> >> > > > > > hdfs > > >> >> > > > > > > > > stale > > >> >> > > > > > > > > > >> mode > > >> >> > > > > > > > > > >> > > must always be used, with a lower timeout > > than > > >> >> the > > >> >> > > hbase > > >> >> > > > > > one. > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> > > Client side there should be nothing to do > > >> >> (excepted > > >> >> > > for > > >> >> > > > > > > advanced > > >> >> > > > > > > > > > >> stuff); > > >> >> > > > > > > > > > >> > at > > >> >> > > > > > > > > > >> > > each retry the client checks the location > of > > >> the > > >> >> > > > regions. > > >> >> > > > > If > > >> >> > > > > > > you > > >> >> > > > > > > > > > lower > > >> >> > > > > > > > > > >> > the > > >> >> > > > > > > > > > >> > > number of retry the client will fail > sooner, > > >> but > > >> >> > > usually > > >> >> > > > > you > > >> >> > > > > > > > don't > > >> >> > > > > > > > > > >> want > > >> >> > > > > > > > > > >> > the > > >> >> > > > > > > > > > >> > > client to fail, you want the servers to > > >> >> reallocate > > >> >> > > > > quickly. > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan > > Menges > > >> < > > >> >> > > > > > > > > > dejan.men...@gmail.com > > >> >> > > > > > > > > > >> > > > >> >> > > > > > > > > > >> > > wrote: > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> > > > Hi, > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > Sorry for little bit late update, but > > >> managed > > >> >> to > > >> >> > > > narrow > > >> >> > > > > it > > >> >> > > > > > > > > little > > >> >> > > > > > > > > > >> bit > > >> >> > > > > > > > > > >> > > down. > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > We didn't update yet, as we are using > > >> >> Hortonworks > > >> >> > > > > > > distribution > > >> >> > > > > > > > > > right > > >> >> > > > > > > > > > >> > now, > > >> >> > > > > > > > > > >> > > > and even if we update we will get > 0.98.4. > > >> >> However, > > >> >> > > > looks > > >> >> > > > > > > that > > >> >> > > > > > > > > > issue > > >> >> > > > > > > > > > >> > here > > >> >> > > > > > > > > > >> > > > was in our use case and configuration > > (still > > >> >> > looking > > >> >> > > > > into > > >> >> > > > > > > it). > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > Basically, initially I saw that when one > > >> server > > >> >> > goes > > >> >> > > > > down, > > >> >> > > > > > > we > > >> >> > > > > > > > > > start > > >> >> > > > > > > > > > >> > > having > > >> >> > > > > > > > > > >> > > > performance issues in general, but it > > >> managed > > >> >> to > > >> >> > be > > >> >> > > on > > >> >> > > > > our > > >> >> > > > > > > > > client > > >> >> > > > > > > > > > >> side, > > >> >> > > > > > > > > > >> > > due > > >> >> > > > > > > > > > >> > > > to caching, and clients were trying to > > >> >> reconnect > > >> >> > to > > >> >> > > > > nodes > > >> >> > > > > > > that > > >> >> > > > > > > > > > were > > >> >> > > > > > > > > > >> > > offline > > >> >> > > > > > > > > > >> > > > and later trying to get regions from > those > > >> >> nodes > > >> >> > > too. > > >> >> > > > > This > > >> >> > > > > > > is > > >> >> > > > > > > > > > >> basically > > >> >> > > > > > > > > > >> > > why > > >> >> > > > > > > > > > >> > > > on server side I didn't manage to see > > >> anything > > >> >> in > > >> >> > > logs > > >> >> > > > > > that > > >> >> > > > > > > > > would > > >> >> > > > > > > > > > >> be at > > >> >> > > > > > > > > > >> > > > least little bit interesting and point > me > > >> into > > >> >> > > desired > > >> >> > > > > > > > > direction. > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > Another question that popped up to me > is - > > >> in > > >> >> case > > >> >> > > > > server > > >> >> > > > > > is > > >> >> > > > > > > > > down > > >> >> > > > > > > > > > >> (and > > >> >> > > > > > > > > > >> > > with > > >> >> > > > > > > > > > >> > > > it DataNode and HRegionServer it was > > >> hosting) - > > >> >> > > what's > > >> >> > > > > > > optimal > > >> >> > > > > > > > > > time > > >> >> > > > > > > > > > >> to > > >> >> > > > > > > > > > >> > > set > > >> >> > > > > > > > > > >> > > > for HMaster to consider server dead > > reassign > > >> >> > regions > > >> >> > > > > > > somewhere > > >> >> > > > > > > > > > >> else, as > > >> >> > > > > > > > > > >> > > > this is another performance bottleneck > we > > >> hit > > >> >> > during > > >> >> > > > > > > inability > > >> >> > > > > > > > > to > > >> >> > > > > > > > > > >> > access > > >> >> > > > > > > > > > >> > > > regions? In our case it's configured to > 15 > > >> >> > minutes, > > >> >> > > > and > > >> >> > > > > > > simple > > >> >> > > > > > > > > > logic > > >> >> > > > > > > > > > >> > > tells > > >> >> > > > > > > > > > >> > > > me if you want it earlier then configure > > >> lower > > >> >> > > number > > >> >> > > > of > > >> >> > > > > > > > > retries, > > >> >> > > > > > > > > > >> but > > >> >> > > > > > > > > > >> > > issue > > >> >> > > > > > > > > > >> > > > is as always in details, so not sure if > > >> anyone > > >> >> > knows > > >> >> > > > > some > > >> >> > > > > > > > better > > >> >> > > > > > > > > > >> math > > >> >> > > > > > > > > > >> > for > > >> >> > > > > > > > > > >> > > > this? > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > And last question - is it possible to > > >> manually > > >> >> > force > > >> >> > > > > HBase > > >> >> > > > > > > to > > >> >> > > > > > > > > > >> reassign > > >> >> > > > > > > > > > >> > > > regions? In this case, while HMaster is > > >> >> retrying > > >> >> > to > > >> >> > > > > > contact > > >> >> > > > > > > > node > > >> >> > > > > > > > > > >> that's > > >> >> > > > > > > > > > >> > > > dead, it's impossible to force it using > > >> >> 'balancer' > > >> >> > > > > > command. > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > Thanks a lot! > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > Dejan > > >> >> > > > > > > > > > >> > > > > > >> >>