Re: Frequent Region Server Failures with namenode.LeaseExpiredException

2018-02-08 Thread Ted Yu
Do you use Phoenix functionality ?

If not, you can try disabling the Phoenix side altogether (removing Phoenix
coprocessors).

2.3.4 is really old - please upgrade to 2.6.3

You should consider asking on the vendor's community forum.

Cheers

On Thu, Feb 8, 2018 at 3:06 PM, anil gupta  wrote:

> Hi Folks,
>
> We are running a 60 Node MapReduce/HBase HDP cluster. HBase 1.1.2 , HDP:
> 2.3.4.0-3485. Phoenix is enabled on this cluster.
> Each slave has ~120gb ram. RS has 20 Gb heap, 12 disk of 2Tb each and 24
> cores.  This cluster has been running OK for last 2 years but recently with
> few disk failures(we unmounted those disks) it hasnt been running fine. I
> have checked hbck and hdfs fsck. Both of them report no inconsistency.
>
> Some our RegionServers keeps on aborting with following error:
> 1 ==>
> org.apache.hadoop.ipc.RemoteException(org.apache.
> hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /apps/hbase/data/data/default/DE.TABLE_NAME/35aa0de96715c33e1f0664aa4d9292
> ba/recovered.edits/03948161445.temp
> (inode 420864666): File does not exist. [Lease.  Holder:
> DFSClient_NONMAPREDUCE_-64710857_1, pendingcreates: 1]
>
> 2 ==> 2018-02-08 03:09:51,653 ERROR [regionserver/
> hdpslave26.bigdataprod1.com/1.16.6.56:16020] regionserver.HRegionServer:
> Shutdown / close of WAL failed:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> /apps/hbase/data/oldWALs/hdpslave26.bigdataprod1.com%
> 2C16020%2C1518027416930.default.1518085177903
> (inode 420996935): File is not open for writing. Holder
> DFSClient_NONMAPREDUCE_649736540_1 does not have any open files.
>
> All the LeaseExpiredException are happening for recovered.edits and
> oldWALs.
>
> HDFS is around 48% full. Most of the DN's have 30-40% space left on them.
> NN heap is at 60% use. I have tried googling around but cant find anything
> concrete to fix this problem. Currently, 15/60 nodes are already down in
> last 2 days.
> Can someone please point out what might be causing these RegionServer
> failures?
>
>
> --
> Thanks & Regards,
> Anil Gupta
>


Frequent Region Server Failures with namenode.LeaseExpiredException

2018-02-08 Thread anil gupta
Hi Folks,

We are running a 60 Node MapReduce/HBase HDP cluster. HBase 1.1.2 , HDP:
2.3.4.0-3485. Phoenix is enabled on this cluster.
Each slave has ~120gb ram. RS has 20 Gb heap, 12 disk of 2Tb each and 24
cores.  This cluster has been running OK for last 2 years but recently with
few disk failures(we unmounted those disks) it hasnt been running fine. I
have checked hbck and hdfs fsck. Both of them report no inconsistency.

Some our RegionServers keeps on aborting with following error:
1 ==>
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/apps/hbase/data/data/default/DE.TABLE_NAME/35aa0de96715c33e1f0664aa4d9292ba/recovered.edits/03948161445.temp
(inode 420864666): File does not exist. [Lease.  Holder:
DFSClient_NONMAPREDUCE_-64710857_1, pendingcreates: 1]

2 ==> 2018-02-08 03:09:51,653 ERROR [regionserver/
hdpslave26.bigdataprod1.com/1.16.6.56:16020] regionserver.HRegionServer:
Shutdown / close of WAL failed:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/apps/hbase/data/oldWALs/hdpslave26.bigdataprod1.com%2C16020%2C1518027416930.default.1518085177903
(inode 420996935): File is not open for writing. Holder
DFSClient_NONMAPREDUCE_649736540_1 does not have any open files.

All the LeaseExpiredException are happening for recovered.edits and
oldWALs.

HDFS is around 48% full. Most of the DN's have 30-40% space left on them.
NN heap is at 60% use. I have tried googling around but cant find anything
concrete to fix this problem. Currently, 15/60 nodes are already down in
last 2 days.
Can someone please point out what might be causing these RegionServer
failures?


-- 
Thanks & Regards,
Anil Gupta


Inconsistent rows exported/counted when looking at a set, unchanged past time frame.

2018-02-08 Thread Andrew Kettmann
First the version details:

Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1.
Hbase: Version 1.2.0-cdh5.8.0
HDFS/YARN: Hadoop 2.6.0-cdh5.8.0
Hbck and hdfs fsck return healthy

15 nodes, sized down recently from 30 (other service requirements reduced. 
Solr, etc)


The simplest example of the inconsistency is using rowcounter. If I run the 
same mapreduce job twice in a row, I get different counts:

hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter 
-Dmapreduce.map.speculative=false TABLENAME --starttime=148590720 
--endtime=148605840

Looking at 
org.​apache.​hadoop.​hbase.​mapreduce.​RowCounter​$RowCounterMapper​$Counters:
Run 1: 4876683
Run 2: 4866351

Similarly with exports of the same date/time. Consecutive runs of the export 
get different results:
hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.map.tasks.speculative.execution=false \
-Dmapred.reduce.tasks.speculative.execution=false \
TABLENAME \
HDFSPATH 1 148590720 148605840

From Map Input/output records:
Run 1: 4296778
Run 2: 4297307

None of the results show anything for spilled records, no failed maps. 
Sometimes the row count increases, sometimes it decreases. We aren’t using any 
row filter queries, we just want to export chunks of the data for a specific 
time range. This table is actively being read/written to, but I am asking about 
a date range in early 2017 in this case, so that should have no impact I would 
have thought. Another point is that the rowcount job and the export return 
ridiculously different numbers. There should be no older versions of rows 
involved as we are set to only keep the newest, and I can confirm that there 
are rows that are consistently missing from the exports. Table definition is 
below.

hbase(main):001:0> describe 'TABLENAME'
Table TABLENAME is ENABLED
TABLENAME
COLUMN FAMILIES DESCRIPTION
{NAME => 'text', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', 
REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', 
MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLO
CKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.2800 seconds

Any advice/suggestions would be greatly appreciated, are some of my assumptions 
wrong regarding import/export and that it should be consistent given consistent 
date/times?


Andrew Kettmann
Platform Services Group



Re: Anonymous survey: Apache HBase 1.x Usage

2018-02-08 Thread Yu Li
Great to know. There seems to be quite some adoption of 1.3 to support
moving our stable pointer.

Mind share the number of votes boss? Please forgive my ignorance if it's
shown in the inline image since I cannot read it (picture failed to open)

Best Regards,
Yu

On 8 February 2018 at 09:07, Andrew Purtell  wrote:

> Response to the survey so far. I think it confirms our expectations.
> Multiple choice was allowed so percentage will not add up to 100%.
>
> 1.0: 8%
> 1.1: 21%
> 1.2: 47%
> 1.3: 24%
> 1.4: 8%
> 1.5: 5%
>
> [image: Inline image 1]
>
> On Fri, Feb 2, 2018 at 3:40 PM, Andrew Purtell 
> wrote:
>
>> Please take this anonymous survey
>> ​ to ​
>> let us know what version of Apache HBase 1.x you are using in production
>> now or are planning to use in production in the next year or so.
>>
>> Multiple choices are allowed.
>>
>> ​There is no "I'm not using 1.x" choice. Consider upgrading! (smile)
>>
>> https://www.surveymonkey.com/r/8WQ8QY6
>>
>