Strange issue when DataNode goes down

2015-03-16 Thread Dejan Menges
Hi All, We have a strange issue with HBase performance (overall cluster performance) in case one of datanodes in the cluster unexpectedly goes down. So scenario is like follows: - Cluster works fine, it's stable. - One DataNode unexpectedly goes down (PSU issue, network issue, anything) - Whole H

Re: Strange issue when DataNode goes down

2015-03-16 Thread Ted Yu
Have you examined region server logs (for the servers with bas performance) to see if there was some clue ? Taking a few jstack's may also help reveal something. BTW 0.98.11 has been released. You may want to consider upgrading. Cheers On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges wrote: > Hi

Re: Strange issue when DataNode goes down

2015-03-16 Thread Sean Busbey
Can you post some redacted log files from the period after the data node failed, up to the restart? -- Sean On Mar 16, 2015 8:41 AM, "Dejan Menges" wrote: > Hi All, > > We have a strange issue with HBase performance (overall cluster > performance) in case one of datanodes in the cluster unexpec

Re: Strange issue when DataNode goes down

2015-03-16 Thread Andrew Purtell
Is there a particular reason why you are using HBase 0.98.0? The latest 0.98 release is 0.98.11. There's a known performance issue with 0.98.0 pertaining to RPC that was fixed in later releases, you should move up from 0.98.0. In addition hundreds of improvements and bug fixes have gone into the te

Re: Strange issue when DataNode goes down

2015-03-17 Thread Dejan Menges
Hi, To be very honest - there's no particular reason why we stick to this one, beside just lack of time currently to go through upgrade process, but looks to me that's going to be next step. Had a crazy day, didn't have time to go through all logs again, plus one of the machines (last one where w

Re: Strange issue when DataNode goes down

2015-03-20 Thread Dejan Menges
Hi, Sorry for little bit late update, but managed to narrow it little bit down. We didn't update yet, as we are using Hortonworks distribution right now, and even if we update we will get 0.98.4. However, looks that issue here was in our use case and configuration (still looking into it). Basica

Re: Strange issue when DataNode goes down

2015-03-20 Thread Nicolas Liochon
You've changed the value of hbase.zookeeper.timeout to 15 minutes? A very reasonable target is 1 minute before relocating the regions. That's the default iirc. You can push it to 20s, but then gc-stopping-the-world becomes more of an issue. 15 minutes is really a lot. The hdfs stale mode must alway

Re: Strange issue when DataNode goes down

2015-03-20 Thread Dejan Menges
With client issue was that it was retrying connecting to the same region servers even when they were reassigned. Lowering it down helped in this specific use case, but yes, we still want servers to reallocate quickly. What got me here: http://hbase.apache.org/book.html#mttr I basically set confi

Re: Strange issue when DataNode goes down

2015-03-20 Thread Nicolas Liochon
The split is done by the region servers (the master coordinates). Is there some interesting stuff in their logs? On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges wrote: > With client issue was that it was retrying connecting to the same region > servers even when they were reassigned. Lowering it d

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Hi Nicolas, Please find log attached. As I see it now more clearly, it was trying to recover HDFS WALs from node that's dead: 2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after trying for 90ms (hbase.lease.recovery.timeout); continuing, but may b

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
And also, just checked - dfs.namenode.avoid.read.stale.datanode and dfs.namenode.avoid.write.stale.datanode are both true, and dfs.namenode.stale.datanode.interval is set to default 3. On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges wrote: > Hi Nicolas, > > Please find log attached. > > As I s

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
the attachments are rejected by the mailing list, can you put then on pastebin? stale is mandatory (so it's good), but the issue here is just before. The region server needs to read the file. In order to be sure that there is no data loss, it needs to "recover the lease", that means ensuring that

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
I found the issue and fixed it, and will try to explain here what was exactly in our case in case someone else finds this interesting too. So initially, we had (couple of times) some network and hardware issues in our datacenters. When one server would die (literary die, we had some issue with PSU

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
Thanks for the explanation. There is an issue if you modify this setting however. hbase tries to recover the lease (i.e. be sure that nobody is writing) If you change hbase.lease.recovery.timeout hbase will start the recovery (i.e. start to read) even if it's not sure that nobody's writing. That me

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Will take a look. Actually, if node is down (someone unplugged network cable, it just died, whatever) what's chance it's going to become live so write can continue? On the other side, HBase is not starting recovery trying to contact node which is not there anymore, and even elected as dead on Name

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Sorry, forgot to paste the log part: 2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after trying for 90ms (hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!; attempt=40 on file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intentio

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
If the node is actually down it's fine. But the node may not be that down (CAP theorem here); and then it's looking for trouble. HDFS, by default declare a node as dead after 10:30. 15 minutes is an extra security. It seems your hdfs settings are different (or there is a bug...). There should be so

Re: Strange issue when DataNode goes down

2015-03-23 Thread Bryan Beaudreault
So it is safe to set hbase.lease.recovery.timeout lower if you also set heartbeat.recheck.interval lower (lowering that 10.5 min dead node timer)? Or is it recommended to not touch either of those? Reading the above with interest, thanks for digging in here guys. On Mon, Mar 23, 2015 at 10:13 AM

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Interesting discussion I also found, gives me some more light on what Nicolas is talking about - https://issues.apache.org/jira/browse/HDFS-3703 On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault wrote: > So it is safe to set hbase.lease.recovery.timeout lower if you also > set heartbeat.recheck.

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
@bryan: yes, you can change hbase.lease.recovery.timeout if you changed he hdfs settings. But this setting is really for desperate cases. The recover Lease should have succeeded before. As well, if you depend on hbase.lease.recovery.timeout, it means that you're wasting recovery time: the lease sho

Re: Strange issue when DataNode goes down

2015-03-23 Thread Bryan Beaudreault
@Nicholas, I see, thanks. I'll keep the settings at default. So really if everything else is configured properly you should never reach the lease recovery timeout in any failure scenarios. It seems that the staleness check would be the thing that prevents this, correct? I'm surprised it didn't

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
I'm surprised by this as well. Staleness was configured also, MTTR from HBase book as described, and in this specific case - when machine really dies - even when NN makes DataNode dead, HBase was trying to replay WALs from dead node until timeout reached. Still reading this HDFS-3703, trying to ge

Re: Strange issue when DataNode goes down

2015-03-23 Thread Bryan Beaudreault
@Dejan, I've had staleness configured on my cluster for a while, but haven't needed it. Looking more closely at it thanks to this thread, I noticed though that I was missing two critical parameters. Considering you just now set this up, I'll guess that you probably didn't miss this (the docs used

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
stale should not help for recoverLease: it helps for reads, but it's the step after lease recovery. It's not needed in recoverLease because the recoverLease in hdfs just sorts the datanode by the heartbeat time, so, usually the stale datanode will be the last one of the list. On Mon, Mar 23, 201

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
Actually, double checking the final patch in HDFS-4721, the stale mode is taken in account. Bryan is right, it's worth checking the namenodes config. Especially, dfs.namenode.avoid.write.stale.datanode must be set to true on the namenode. On Mon, Mar 23, 2015 at 5:08 PM, Nicolas Liochon wrote: >

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
It was true all the time, together with dfs.namenode.avoid.read.stale.datanode. On Mon, Mar 23, 2015 at 5:29 PM Nicolas Liochon wrote: > Actually, double checking the final patch in HDFS-4721, the stale mode is > taken in account. Bryan is right, it's worth checking the namenodes config. > Espec

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
...and I also got sure that it's applied with hdfs getconf -confKey... On Mon, Mar 23, 2015 at 5:31 PM Dejan Menges wrote: > It was true all the time, together with dfs.namenode.avoid.read.stale. > datanode. > > On Mon, Mar 23, 2015 at 5:29 PM Nicolas Liochon wrote: > >> Actually, double checki

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
Ok, so hopefully there are some info in the namenode & datanode logs. On Mon, Mar 23, 2015 at 5:32 PM, Dejan Menges wrote: > ...and I also got sure that it's applied with hdfs getconf -confKey... > > On Mon, Mar 23, 2015 at 5:31 PM Dejan Menges > wrote: > > > It was true all the time, together

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Will do some deeper testing on this to try to narrow it down and will update then here for sure. On Mon, Mar 23, 2015 at 5:42 PM Nicolas Liochon wrote: > Ok, so hopefully there are some info in the namenode & datanode logs. > > On Mon, Mar 23, 2015 at 5:32 PM, Dejan Menges > wrote: > > > ...and