Re: manual merge

2015-03-23 Thread Abe Weinograd
Cool Michael, Thanks for the heads up. I will follow that JIRA. We are pre-splitting based on how we know the data to distribute across those 20 regions. We stayed with sequential keys so that the consumers could easily access the data (the reason you highlighted above and in the JIRA). Thanks

Re: HBase read/write statistics per table basis

2015-03-23 Thread Otis Gospodnetic
Hi, I *think* these metric are not available on a per-table basis otherwise we'd have it in SPM for HBase , and I know we don't. If these are available somewhere, I'm all eyeballs! Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Ela

HBase read/write statistics per table basis

2015-03-23 Thread Igotux
I’m planning to develop a dashboard which shows reads/writes on per table basis. Just wondering, do we already have any application, which supports giving that information ? If not, what should i explore to get that information. I know region server provides total reads/writes. But that is n

Re: manual merge

2015-03-23 Thread Michael Segel
Well with sequential data, you end up with your data being always added to the left of a region. So you’ll end up with your regions only 1/2 full after a split and then static. When you say you’re creating 20 new regions… is that from the volume of data or are you still ‘pre-splitting’ the tab

Re: Poll: HBase usage by HBase version

2015-03-23 Thread Otis Gospodnetic
Hi, I promised the results. Here they are, along with a short commentary: http://blog.sematext.com/2015/03/23/poll-results-hbase-version-distribution/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Thu, Mar

Re: manual merge

2015-03-23 Thread Abe Weinograd
HI Michael/Nick, We have a table with a sequential column (i know, very bad :) ) and we are constantly inserting to the end. We pre-split where we are inserting into 20 regions. When we started with 1, the balancer would pick up on that and would balance the load as we started to insert. Each l

Re: manual merge

2015-03-23 Thread Nick Dimiduk
Hi Abe, I believe at this time we only have the APIs for merging two regions in a single merge operation. Adding the ability to merge all regions within a row key range was a feature discussed recently. A compaction should not be necessary following a merge, however it is likely that the number of

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Will do some deeper testing on this to try to narrow it down and will update then here for sure. On Mon, Mar 23, 2015 at 5:42 PM Nicolas Liochon wrote: > Ok, so hopefully there are some info in the namenode & datanode logs. > > On Mon, Mar 23, 2015 at 5:32 PM, Dejan Menges > wrote: > > > ...and

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
Ok, so hopefully there are some info in the namenode & datanode logs. On Mon, Mar 23, 2015 at 5:32 PM, Dejan Menges wrote: > ...and I also got sure that it's applied with hdfs getconf -confKey... > > On Mon, Mar 23, 2015 at 5:31 PM Dejan Menges > wrote: > > > It was true all the time, together

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
...and I also got sure that it's applied with hdfs getconf -confKey... On Mon, Mar 23, 2015 at 5:31 PM Dejan Menges wrote: > It was true all the time, together with dfs.namenode.avoid.read.stale. > datanode. > > On Mon, Mar 23, 2015 at 5:29 PM Nicolas Liochon wrote: > >> Actually, double checki

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
It was true all the time, together with dfs.namenode.avoid.read.stale.datanode. On Mon, Mar 23, 2015 at 5:29 PM Nicolas Liochon wrote: > Actually, double checking the final patch in HDFS-4721, the stale mode is > taken in account. Bryan is right, it's worth checking the namenodes config. > Espec

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
Actually, double checking the final patch in HDFS-4721, the stale mode is taken in account. Bryan is right, it's worth checking the namenodes config. Especially, dfs.namenode.avoid.write.stale.datanode must be set to true on the namenode. On Mon, Mar 23, 2015 at 5:08 PM, Nicolas Liochon wrote: >

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
stale should not help for recoverLease: it helps for reads, but it's the step after lease recovery. It's not needed in recoverLease because the recoverLease in hdfs just sorts the datanode by the heartbeat time, so, usually the stale datanode will be the last one of the list. On Mon, Mar 23, 201

Re: Strange issue when DataNode goes down

2015-03-23 Thread Bryan Beaudreault
@Dejan, I've had staleness configured on my cluster for a while, but haven't needed it. Looking more closely at it thanks to this thread, I noticed though that I was missing two critical parameters. Considering you just now set this up, I'll guess that you probably didn't miss this (the docs used

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
I'm surprised by this as well. Staleness was configured also, MTTR from HBase book as described, and in this specific case - when machine really dies - even when NN makes DataNode dead, HBase was trying to replay WALs from dead node until timeout reached. Still reading this HDFS-3703, trying to ge

Re: Strange issue when DataNode goes down

2015-03-23 Thread Bryan Beaudreault
@Nicholas, I see, thanks. I'll keep the settings at default. So really if everything else is configured properly you should never reach the lease recovery timeout in any failure scenarios. It seems that the staleness check would be the thing that prevents this, correct? I'm surprised it didn't

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
@bryan: yes, you can change hbase.lease.recovery.timeout if you changed he hdfs settings. But this setting is really for desperate cases. The recover Lease should have succeeded before. As well, if you depend on hbase.lease.recovery.timeout, it means that you're wasting recovery time: the lease sho

Re: Recovering from corrupt blocks in HFile

2015-03-23 Thread Michael Segel
Ok, I’m still a bit slow this morning … coffee is not helping…. ;-) Are we talking HFile or just a single block in the HFile? While it may be too late for Mike Dillon, here’s the question that the HBase Devs are going to have to think… How and when do you check on the correctness of the hdf

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Interesting discussion I also found, gives me some more light on what Nicolas is talking about - https://issues.apache.org/jira/browse/HDFS-3703 On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault wrote: > So it is safe to set hbase.lease.recovery.timeout lower if you also > set heartbeat.recheck.

Re: Strange issue when DataNode goes down

2015-03-23 Thread Bryan Beaudreault
So it is safe to set hbase.lease.recovery.timeout lower if you also set heartbeat.recheck.interval lower (lowering that 10.5 min dead node timer)? Or is it recommended to not touch either of those? Reading the above with interest, thanks for digging in here guys. On Mon, Mar 23, 2015 at 10:13 AM

Re: introducing nodes w/ more storage

2015-03-23 Thread Michael Segel
@lars, How does the HDFS load balancer impact the load balancing of HBase? Of course there are two loads… one is the number of regions managed by a region server that’s HBase’s load, right? And then there’s the data distribution of HBase files that is really managed by HDFS load balancer, ri

Re: manual merge

2015-03-23 Thread Michael Segel
Hi, I’m trying to understand your problem. You pre-split your regions to help with some load balancing on the load. Ok. So how did you calculate the number of regions to pre-split? You said that the number of regions has grown. How were the initial regions. Did you increase the size of new

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
If the node is actually down it's fine. But the node may not be that down (CAP theorem here); and then it's looking for trouble. HDFS, by default declare a node as dead after 10:30. 15 minutes is an extra security. It seems your hdfs settings are different (or there is a bug...). There should be so

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Sorry, forgot to paste the log part: 2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after trying for 90ms (hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!; attempt=40 on file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intentio

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Will take a look. Actually, if node is down (someone unplugged network cable, it just died, whatever) what's chance it's going to become live so write can continue? On the other side, HBase is not starting recovery trying to contact node which is not there anymore, and even elected as dead on Name

manual merge

2015-03-23 Thread Abe Weinograd
Hello, We bulk load our table and during that process, pre-split regions to optimize load across servers. The number of regions build up and we manually are merging them back. Any merge of two regions is causing a compaction which slows down our merge process. We are merging two regions at a ti

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
Thanks for the explanation. There is an issue if you modify this setting however. hbase tries to recover the lease (i.e. be sure that nobody is writing) If you change hbase.lease.recovery.timeout hbase will start the recovery (i.e. start to read) even if it's not sure that nobody's writing. That me

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
I found the issue and fixed it, and will try to explain here what was exactly in our case in case someone else finds this interesting too. So initially, we had (couple of times) some network and hardware issues in our datacenters. When one server would die (literary die, we had some issue with PSU

Re: Strange issue when DataNode goes down

2015-03-23 Thread Nicolas Liochon
the attachments are rejected by the mailing list, can you put then on pastebin? stale is mandatory (so it's good), but the issue here is just before. The region server needs to read the file. In order to be sure that there is no data loss, it needs to "recover the lease", that means ensuring that

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
And also, just checked - dfs.namenode.avoid.read.stale.datanode and dfs.namenode.avoid.write.stale.datanode are both true, and dfs.namenode.stale.datanode.interval is set to default 3. On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges wrote: > Hi Nicolas, > > Please find log attached. > > As I s

Re: Strange issue when DataNode goes down

2015-03-23 Thread Dejan Menges
Hi Nicolas, Please find log attached. As I see it now more clearly, it was trying to recover HDFS WALs from node that's dead: 2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after trying for 90ms (hbase.lease.recovery.timeout); continuing, but may b