Re: RegionServers Crashing every hour in production env

Pablo Musa Tue, 12 Mar 2013 10:25:08 -0700

Guys,
thank you very much for the help.

Yesterday I spent 14 hours trying to tune the whole cluster.

The cluster is not ready yet needs a lot of tunning, but at least isworking.


My first big problem was namenode + datanode GC. They were not using

CMS and thus were taking "incremental" time to run. Ii started in 0.01ms and

in 20 minutes was taking 150 secs.
After setting CMSGC this time is much smaller taking a maximum of 70 secs,
which is VERY HIGH, but for now does not stop HBase.

With this issue solved, it was clear that the RS was doing a long pause GC,
taking up to 220 secs. Zookeeper expired the RS and it shutdown.
I tried a lot of different flags configuration (MORE than 20), and could not

get small gcs. Eventually it would take more than 150 secs (zookeepertimeout)

and shutdown.

Finally I tried a config that so far, 12 hours, is working with a maximum GC
time of 90 secs. Which of course is a terrible problem since HBase is a

database, but at least the cluster is stable while I can tune it alittle more.


In my opinion, my biggest problem is to have a few "monster" machines in the

cluster instead of a bunch of commodities machines. I don't know ifthere are

a lot companies using this kind of machines inside a hadoop cluster, but
a fast search on google could not find a lot of tunes for big heap GCs.

I guess my next step will be search for big heap gc tuning.

Back to some questions ;)

> You have ganglia or tsdb running?

I use zabbix for now, and no there is nothing going on when the bigpause happens.


> When you see the big pause above, can you see anything going on on the
> machine? (swap, iowait, concurrent fat mapreduce job?)

> what are you doing during long GC happened? read or write? ifreading, what

> the block cache size?

The cpu for the RS process goes to 100% and the logs "pause", until itgets out.

Ex: [NewPar

IO and SWAP are normal. There is no MR running, just normal databaseload, which isvery low. I am probably doing reads AND writes to the database withdefault block

cache size.

One problem in this moment might be the big number of regions (1252)since I am

using only one RS to be able to track the problem.

The links and ideas were very helpful. Thank you very much guys.

I will post my future researches as I find a solution ;)

If you have more ideas or info (links, flag suggestions, etc.), pleasepost it :)


Abs,
Pablo

On 03/10/2013 11:24 PM, Andrew Purtell wrote:

Be careful with GC tuning, throwing changes at an application without
analysis of what is going on with the heap is shooting in the dark. One
particular good treatment of the subject is here:
http://java.dzone.com/articles/how-tame-java-gc-pauses

If you have made custom changes to blockcache or memstore configurations,
back them out until you're sure everything else is ok.

Watch carefully for swapping. Set the vm.swappiness sysctl to 0. Monitor
for spikes in page scanning or any swap activity. Nothing brings on
"Juliette" pauses better than a JVM partially swapped out. The Java GC
starts collection by examining the oldest pages, and those are the first
pages the OS swaps out...



On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu <azury...@gmail.com> wrote:

Hi Pablo,
It'a terrible for a long minor GC. I don't think there are swaping from
your vmstat log.
but I just suggest you
1) add following JVM options:
-XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods

2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g
3) what are you doing during long GC happened? read or write? if reading,
what the block cache size?




On Mon, Mar 11, 2013 at 6:41 AM, Stack <st...@duboce.net> wrote:

You could increase your zookeeper session timeout to 5 minutes while you
are figuring why these long pauses.
http://hbase.apache.org/book.html#zookeeper.session.timeout

Above, there is an outage for almost 5 minutes:

We slept 225100ms instead of 3000ms, this is likely due to a long

You have ganglia or tsdb running?  When you see the big pause above, can
you see anything going on on the machine?  (swap, iowait, concurrent fat
mapreduce job?)

St.Ack



On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa <pa...@psafe.com> wrote:

Hi Sreepathi,
they say in the book (or the site), we could try it to see if it is

really

a timeout error
or there is something more. But it is not recomended for production
environments.

I could give it a try if five minutes will ensure to us that the

problem

is the GC or
elsewhere!! Anyway, I think it is hard to beleive a GC is taking 2:30
minutes.

Abs,
Pablo


On 03/10/2013 04:06 PM, Sreepathi wrote:

Hi Stack/Ted/Pablo,

Should we increase the hbase.rpc.timeout property to 5 minutes (

ms

)  ?

Regards,
- Sreepathi

On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <pa...@psafe.com> wrote:

  That combo should be fine.

Great!!


  If JVM is full GC'ing, the application is stopped.

The below does not look like a full GC but that is a long pause in
system
time, enough to kill your zk session.

Exactly. This pause is really making the zk expire the RS which

shutsdown

(logs
in the end of the email).
But the question is: what is causing this pause??!!

  You swapping?
I don't think so (stats below).

  Hardware is good?
Yes, it is a 16 processor machine with 74GB of RAM and plenty disk

space.

Below are some metrics I have heard about. Hope it helps.


** I am having some problems with the datanodes[1] which are having
trouble to
write. I really think the issues are related, but cannot solve any of
them
:(

Thanks again,
Pablo

[1] http://mail-archives.apache.

****org/mod_mbox/hadoop-hdfs-user/****
201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr****

qfv...@mail.gmail.com%3E<http:**//mail-archives.apache.org/**
mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+**
jGOPUt15PwFjcCSzigE0APeM9FXaCr**qfv...@mail.gmail.com%3E<

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3ccajzooyfs-f1ks+jgoput15pwfjccszige0apem9fxacrqfv...@mail.gmail.com%3E

top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34,

2.55,

1.28
Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,
   0.0%st
Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k

buffers

Swap: 51609592k total,   128312k used, 51481280k free,  1353400k

cached

]$ vmstat -w
procs -------------------memory-----****------------- ---swap--
-----io----
--system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi bo
in
    cs  us sy  id wa st
   2  0     128312   32416928    5838288    5043560    0    0   202 53
  0
     0   2  1  96  1  0

]$ sar
02:20:01 PM     all     26.18      0.00      2.90      0.63 0.00
70.29
02:30:01 PM     all      1.66      0.00      1.25      1.05 0.00
96.04
02:40:01 PM     all     10.01      0.00      2.14      0.75 0.00
87.11
02:50:01 PM     all      0.76      0.00      0.80      1.03 0.00
97.40
03:00:01 PM     all      0.23      0.00      0.30      0.71 0.00
98.76
03:10:01 PM     all      0.22      0.00      0.30      0.66 0.00
98.82
03:20:01 PM     all      0.22      0.00      0.31      0.76 0.00
98.71
03:30:01 PM     all      0.24      0.00      0.31      0.64 0.00
98.81
03:40:01 PM     all      1.13      0.00      2.97      1.18 0.00
94.73
Average:        all      3.86      0.00      1.38      0.88 0.00
93.87

]$ iostat
Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002)     03/10/2013 _x86_64_
   (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
             1.86    0.00    0.96    0.78    0.00   96.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read Blk_wrtn
sda               1.23        20.26        23.53  521533196 605566924
sdb               6.51       921.55       241.90 23717850730

6225863488

sdc               6.22       921.83       236.41 23725181162

6084471192

sdd               6.25       925.13       237.26 23810004970

6106357880

sde               6.19       913.90       235.60 23521108818

6063722504

sdh               6.26       933.08       237.77 24014594546

6119511376

sdg               6.18       914.36       235.31 23532747378

6056257016

sdf               6.24       923.66       235.33 23772251810

6056604008

Some more logging which reinforce that the RS crash is happening

because

of
timeout. However this time the GC log is not accusing a big time.

#####RS LOG#####
2013-03-10 15:37:46,712 INFO org.apache.zookeeper.****ClientCnxn:

Client

session timed out, have not heard from server in 257739ms for

sessionid

0x13d3c4bcba6014a, closing socket connection and attempting reconnect
2013-03-10 15:37:46,712 INFO org.apache.zookeeper.****ClientCnxn:

Client

session timed out, have not heard from server in 226785ms for

sessionid

0x13d3c4bcba60149, closing socket connection and attempting reconnect
2013-03-10 15:37:46,712 DEBUG org.apache.hadoop.hbase.io.****
hfile.LruBlockCache:
Stats: total=61.91 MB, free=1.94 GB, max=2 GB, blocks=1254,
accesses=60087,
hits=58811, hitRatio=97.87%, , cachingAccesses=60069,

cachingHits=58811,

cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN
2013-03-10 15:37:46,712 WARN

org.apache.hadoop.hbase.util.****Sleeper:

We
slept 225100ms instead of 3000ms, this is likely due to a long

garbage

collecting pause and it's usually bad, see
http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
org/book.html#trouble.rs.**runtime.zkexpired<

http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>

2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.****DFSClient:
DFSOutputStream ResponseProcessor exception  for block
BP-43236042-172.17.2.10-****1362490844340:blk_-**
6834190810033122569_25150229

java.io.EOFException: Premature EOF: no length prefix available
          at org.apache.hadoop.hdfs.****protocol.HdfsProtoUtil.**
vintPrefixed(HdfsProtoUtil.****java:171)
          at org.apache.hadoop.hdfs.****protocol.datatransfer.**
PipelineAck.readFields(****PipelineAck.java:114)
          at

org.apache.hadoop.hdfs.****DFSOutputStream$DataStreamer$****

ResponseProcessor.run(****DFSOutputStream.java:670)
2013-03-10 15:37:46,716 ERROR org.apache.hadoop.hbase.**
regionserver.HRegionServer:
org.apache.hadoop.hbase.ipc.****CallerDisconnectedException: Aborting
call
get([B@7caf69ed, {"timeRange":[0,****9223372036854775807],"**
totalColumns":1,"cacheBlocks":****true,"families":{"details":[**"**
ALL"]},"maxVersions":1,"row":"****\\x00\\x00\\x00\\x00\\x00\\***
*x12\\x93@"}),
rpc version=1, client version=29, methodsFingerPrint=1891768260 from
172.17.1.71:51294 after 0 ms, since caller disconnected


#####GC LOG#####
2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530

secs]

354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03

sys=0.00,

real=0.00 secs]
2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370

secs]

353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04

sys=0.00,

real=0.00 secs]
2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570

secs]

353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03

sys=0.00,

real=0.01 secs]
2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440

secs]

353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05

sys=0.00,

real=0.01 secs]
2984.239: [GC 2984.239: [ParNew: 58680K->5282K(59008K), 0.0048070

secs]

362921K->310977K(1152704K), 0.0049180 secs] [Times: user=0.03

sys=0.00,

real=0.01 secs]
3050.620: [GC 3050.620: [ParNew: 57524K->5292K(59008K), 0.0055510

secs]

363219K->313384K(1152704K), 0.0056520 secs] [Times: user=0.04

sys=0.00,

real=0.01 secs]
3111.331: [GC 3111.331: [ParNew: 57542K->5637K(59008K), 0.0056900

secs]

365634K->318079K(1152704K), 0.0057950 secs] [Times: user=0.03

sys=0.00,

real=0.01 secs]
3163.510: [GC 3163.510: [ParNew: 58117K->1138K(59008K), 0.0046540

secs]

370559K->316086K(1152704K), 0.0047470 secs] [Times: user=0.03

sys=0.00,

real=0.01 secs]
3217.642: [GC 3217.643: [ParNew: 53618K->1001K(59008K), 0.0038530

secs]

368566K->315949K(1152704K), 0.0039450 secs] [Times: user=0.03

sys=0.00,

real=0.00 secs]
3464.067: [GC 3464.067: [ParNew: 53481K->5348K(59008K), 0.0068370

secs]

368429K->322343K(1152704K), 0.0069170 secs] [Times: user=0.07

sys=0.01,

real=0.00 secs]
3465.648: [GC 3465.648: [ParNew: 57828K->1609K(59008K), 0.0111520

secs]

374823K->321496K(1152704K), 0.0112170 secs] [Times: user=0.07

sys=0.00,

real=0.01 secs]
3466.291: [GC 3466.291: [ParNew: 54089K->2186K(59008K), 0.0093190

secs]

373976K->322073K(1152704K), 0.0093800 secs] [Times: user=0.06

sys=0.00,

real=0.01 secs]
Heap
   par new generation   total 59008K, used 15497K [0x00000005fae00000,
0x00000005fee00000, 0x00000005fee00000)
    eden space 52480K,  25% used [0x00000005fae00000,

0x00000005fbaff8c8,

0x00000005fe140000)
    from space 6528K,  33% used [0x00000005fe140000,

0x00000005fe362b70,

0x00000005fe7a0000)
    to   space 6528K,   0% used [0x00000005fe7a0000,

0x00000005fe7a0000,

0x00000005fee00000)
   concurrent mark-sweep generation total 1093696K, used 319887K
[0x00000005fee00000, 0x0000000641a10000, 0x00000007fae00000)
   concurrent-mark-sweep perm gen total 30464K, used 30427K
[0x00000007fae00000, 0x00000007fcbc0000, 0x0000000800000000)



On 03/08/2013 07:02 PM, Stack wrote:

  On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <pa...@psafe.com>

wrote:

   0.94 currently doesn't support hadoop 2.0

Can you deploy hadoop 1.1.1 instead ?

   I am using cdh4.2.0 which uses this version as default

installation.

I think it will be a problem for me to deploy 1.1.1 because I would
need
to
"upgrade" the whole cluster with 70TB of data (backup everything,

go

offline, etc.).

Is there a problem to use cdh4.2.0?
I should send my email to cdh list?


   That combo should be fine.

     You Full GC'ing around this time?

The GC shows it took a long time. However it does not make any

sense

to be it, since the same ammount of data was cleaned before and

AFTER

in just 0.01 secs!


   If JVM is full GC'ing, the application is stopped.

   [Times: user=0.08 sys=137.62, real=137.62 secs]

Besides the whole time was used by system. That is what is bugging

me.


   The below does not look like a full GC but that is a long pause

in

system
time, enough to kill your zk session.

You swapping?

Hardware is good?

St.Ack



     ...

1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990

secs]

275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03

sys=0.00,

real=0.01 secs]

1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000

secs]

269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04

sys=0.01,

real=0.00 secs]

1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times:

user=0.08

sys=137.62, real=137.62 secs]

1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960

secs]

287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05

sys=0.00,

real=0.01 secs]

1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650

secs]

283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06

sys=0.00,

real=0.01 secs]


I really appreciate you guys helping me to find out what is wrong.

Thanks,
Pablo



On 03/08/2013 02:11 PM, Stack wrote:

   What RAM says.

2013-03-07 17:24:57,887 INFO

org.apache.zookeeper.********ClientCnxn:

Client

session timed out, have not heard from server in 159348ms for
sessionid
0x13d3c4bcba600a7, closing socket connection and attempting

reconnect

You Full GC'ing around this time?

Put up your configs in a place where we can take a look?

St.Ack


On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.******com

<ramkrishna.s.vasudevan@gmail.

****
com <ramkrishna.s.vasudevan@gmail.**com<

ramkrishna.s.vasude...@gmail.com>

wrote:

    I think it is with your GC config.  What is your heap size?

  What

is
the

  data that you pump in and how much is the block cache size?

Regards
Ram

On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhih...@gmail.com>

wrote:

    0.94 currently doesn't support hadoop 2.0

  Can you deploy hadoop 1.1.1 instead ?

Are you using 0.94.5 ?

Thanks

On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pa...@psafe.com>

wrote:

    Hey guys,

  as I sent in an email a long time ago, the RSs in my cluster

did

not

   get

along

  and crashed 3 times a day. I tried a lot of options we

discussed

in

the
emails, but it not solved the problem. As I used an old version

of

   hadoop I

   thought this was the problem.

So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5

to

   hadoop

2.0.0

  - hbase 0.94 - zookeeper 3.4.5.

Unfortunately the RSs did not stop crashing, and worst! Now

they

crash
every
hour and some times when the RS that holds the .ROOT. crashes

all

   cluster

get

  stuck in transition and everything stops working.

In this case I need to clean zookeeper znodes, restart the

master

and

   the

RSs.

  To avoid this case I am running on production with only ONE RS

and

a
monitoring
script that check every minute, if the RS is ok. If not,

restart

it.
* This case does not get the cluster stuck.

This is driving me crazy, but I really cant find a solution for

the

cluster.
I tracked all logs from the start time 16:49 from all

interesting

nodes
(zoo,
namenode, master, rs, dn2, dn9, dn10) and copied here what I

think

is
usefull.

There are some strange errors in the DATANODE2, as an error
copiyng a

   block

   to itself.

The gc log points to GC timeout. However it is very weird that

the

RS

   spend

   so much time in GC while in the other cases it takes 0.001sec.

Besides,
the time
spent, is in sys which makes me think that might be a problem

in

   another

place.

  I know that it is a bunch of logs, and that it is very

difficult

to

   find

the

  problem without much context. But I REALLY need some help. If

it

is

not

   the

   solution, at least what I should read, where I should look, or
which

   cases

should monitor.

Thank you very much,
Pablo Musa

--
*Regards,*
--- *Sreepathi *



--
Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: RegionServers Crashing every hour in production env

Reply via email to