The recovery starts at 27:40 (master log: 00:27:40,011), so before the
datanode is known as stale.
But the first attempt is cancelled, and a new one start at 28:10, before
being cancelled again at 28:35 (that's HBASE-6738).
These two attempts should see the datanode as stale. It seems it's not the
Hi Ted, Nicholas,
Thanks for the comments. We found some issues with lease recovery and I
patched HBASE 8354 to ensure we don't see data loss. Could you please look
at HDFS 4721 and HBASE 8389 ?
Thanks
Varun
On Sat, Apr 20, 2013 at 10:52 AM, Varun Sharma va...@pinterest.com wrote:
The
Varun:
Thanks for trying out HBASE-8354 .
Can you move the text in Environment section of HBASE-8389 to Description ?
If you have a patch for HBASE-8389, can you upload it ?
Cheers
On Sun, Apr 21, 2013 at 10:38 AM, Varun Sharma va...@pinterest.com wrote:
Hi Ted, Nicholas,
Thanks for the
Hi,
I looked at it again with a fresh eye. As Varun was saying, the root cause
is the wrong order of the block locations.
The root cause of the root cause is actually simple: HBASE started the
recovery while the node was not yet stale from an HDFS pov.
Varun mentioned this timing:
Lost Beat:
Hi Nicholas,
Regarding the following, I think this is not a recovery - the file below is
an HFIle and is being accessed on a get request. On this cluster, I don't
have block locality. I see these exceptions for a while and then they are
gone, which means the stale node thing kicks in.
2013-04-19
The important thing to note is the block for this rogue WAL is
UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
thing kicks in correctly for UNDER_RECOVERY blocks but failed.
On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma va...@pinterest.com wrote:
Hi Nicholas,
Thanks for the detailed scenario and analysis. I'm going to have a look.
I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
timeouts), could you please send them directly to me?
Thanks,
Nicolas
On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma va...@pinterest.com wrote:
Hi
Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
Here is the criteria for stale node checking to kick in (from
https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
):
+ * Check if the datanode is in stale state. Here if
+ * the namenode
here is the snippet
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recover RBW replica
BP-696828882-10.168.7.226-1364886167971:blk_40107897639761277_174072
2013-04-19 00:27:38,337 INFO
Hi Ted,
I had a long offline discussion with nicholas on this. Looks like the last
block which was still being written too, took an enormous time to recover.
Here's what happened.
a) Master split tasks and region servers process them
b) Region server tries to recover lease for each WAL log - most
This is 0.94.3 hbase...
On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma va...@pinterest.com wrote:
Hi Ted,
I had a long offline discussion with nicholas on this. Looks like the last
block which was still being written too, took an enormous time to recover.
Here's what happened.
a) Master
I think the issue would be more appropriate for hdfs-dev@ mailing list.
Putting use@hbase as Bcc.
-- Forwarded message --
From: Varun Sharma va...@pinterest.com
Date: Fri, Apr 19, 2013 at 1:10 PM
Subject: Re: Slow region server recoveries
To: user@hbase.apache.org
Hi,
We are facing problems with really slow HBase region server recoveries ~ 20
minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale
node timeouts configured correctly. Time for dead node detection is still
10
Copying CDH Users mailing list.
On Thu, Apr 18, 2013 at 6:37 PM, Varun Sharma va...@pinterest.com wrote:
I am wondering if DFSClient caches the data node for a long period of time
?
Varun
On Thu, Apr 18, 2013 at 6:01 PM, Varun Sharma va...@pinterest.com wrote:
Hi,
We are facing
14 matches
Mail list logo