Re: Slow region server recoveries

2013-04-22 Thread Nicolas Liochon
The recovery starts at 27:40 (master log: 00:27:40,011), so before the datanode is known as stale. But the first attempt is cancelled, and a new one start at 28:10, before being cancelled again at 28:35 (that's HBASE-6738). These two attempts should see the datanode as stale. It seems it's not the

Re: Slow region server recoveries

2013-04-21 Thread Varun Sharma
Hi Ted, Nicholas, Thanks for the comments. We found some issues with lease recovery and I patched HBASE 8354 to ensure we don't see data loss. Could you please look at HDFS 4721 and HBASE 8389 ? Thanks Varun On Sat, Apr 20, 2013 at 10:52 AM, Varun Sharma va...@pinterest.com wrote: The

Re: Slow region server recoveries

2013-04-21 Thread Ted Yu
Varun: Thanks for trying out HBASE-8354 . Can you move the text in Environment section of HBASE-8389 to Description ? If you have a patch for HBASE-8389, can you upload it ? Cheers On Sun, Apr 21, 2013 at 10:38 AM, Varun Sharma va...@pinterest.com wrote: Hi Ted, Nicholas, Thanks for the

Re: Slow region server recoveries

2013-04-20 Thread Nicolas Liochon
Hi, I looked at it again with a fresh eye. As Varun was saying, the root cause is the wrong order of the block locations. The root cause of the root cause is actually simple: HBASE started the recovery while the node was not yet stale from an HDFS pov. Varun mentioned this timing: Lost Beat:

Re: Slow region server recoveries

2013-04-20 Thread Varun Sharma
Hi Nicholas, Regarding the following, I think this is not a recovery - the file below is an HFIle and is being accessed on a get request. On this cluster, I don't have block locality. I see these exceptions for a while and then they are gone, which means the stale node thing kicks in. 2013-04-19

Re: Slow region server recoveries

2013-04-20 Thread Varun Sharma
The important thing to note is the block for this rogue WAL is UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node thing kicks in correctly for UNDER_RECOVERY blocks but failed. On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma va...@pinterest.com wrote: Hi Nicholas,

Re: Slow region server recoveries

2013-04-19 Thread Nicolas Liochon
Thanks for the detailed scenario and analysis. I'm going to have a look. I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com timeouts), could you please send them directly to me? Thanks, Nicolas On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma va...@pinterest.com wrote: Hi

Re: Slow region server recoveries

2013-04-19 Thread Ted Yu
Can you show snippet from DN log which mentioned UNDER_RECOVERY ? Here is the criteria for stale node checking to kick in (from https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch ): + * Check if the datanode is in stale state. Here if + * the namenode

Re: Slow region server recoveries

2013-04-19 Thread Varun Sharma
here is the snippet 2013-04-19 00:27:38,337 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-696828882-10.168.7.226-1364886167971:blk_40107897639761277_174072 2013-04-19 00:27:38,337 INFO

Re: Slow region server recoveries

2013-04-19 Thread Varun Sharma
Hi Ted, I had a long offline discussion with nicholas on this. Looks like the last block which was still being written too, took an enormous time to recover. Here's what happened. a) Master split tasks and region servers process them b) Region server tries to recover lease for each WAL log - most

Re: Slow region server recoveries

2013-04-19 Thread Varun Sharma
This is 0.94.3 hbase... On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma va...@pinterest.com wrote: Hi Ted, I had a long offline discussion with nicholas on this. Looks like the last block which was still being written too, took an enormous time to recover. Here's what happened. a) Master

Slow region server recoveries due to lease recovery going to stale data node

2013-04-19 Thread Ted Yu
I think the issue would be more appropriate for hdfs-dev@ mailing list. Putting use@hbase as Bcc. -- Forwarded message -- From: Varun Sharma va...@pinterest.com Date: Fri, Apr 19, 2013 at 1:10 PM Subject: Re: Slow region server recoveries To: user@hbase.apache.org

Slow region server recoveries

2013-04-18 Thread Varun Sharma
Hi, We are facing problems with really slow HBase region server recoveries ~ 20 minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0. Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale node timeouts configured correctly. Time for dead node detection is still 10

Re: Slow region server recoveries

2013-04-18 Thread Ted Yu
Copying CDH Users mailing list. On Thu, Apr 18, 2013 at 6:37 PM, Varun Sharma va...@pinterest.com wrote: I am wondering if DFSClient caches the data node for a long period of time ? Varun On Thu, Apr 18, 2013 at 6:01 PM, Varun Sharma va...@pinterest.com wrote: Hi, We are facing