Ok so here is the latest.

http://imagebin.org/240392

I took a jmap on startup and one an hour after.

http://pastebin.com/xEkWid4f

I think the biggest deal is [B which may not be very helpful

num     #instances         #bytes  class name
----------------------------------------------
   1:      25094067     2319943656  [B
   2:      23720125     1518088000
org.apache.hadoop.hdfs.server.namenode.INodeFile
   3:      24460244     1174091712
org.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo
   4:      25671649     1134707328  [Ljava.lang.Object;
   5:      31106937      995421984  java.util.HashMap$Entry
   6:      23725233      570829968
[Lorg.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo;
   7:          2934      322685952  [Ljava.util.HashMap$Entry;



 num     #instances         #bytes  class name
----------------------------------------------
   1:      24739690     3727511000  [B
   2:      23280668     1489962752
org.apache.hadoop.hdfs.server.namenode.INodeFile
   3:      24850044     1192802112
org.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo
   4:      26124258     1157073272  [Ljava.lang.Object;
   5:      32142057     1028545824  java.util.HashMap$Entry
   6:      23307473      560625432
[Lorg.apache.hadoop.hdfs.server.namenode.BlocksMap$BlockInfo;

GC starts like this:

3.204: [GC 102656K->9625K(372032K), 0.0150300 secs]
3.519: [GC 112281K->21180K(372032K), 0.0741210 secs]
3.883: [GC 123836K->30729K(372032K), 0.0208900 secs]
4.194: [GC 132724K->45785K(372032K), 0.0293860 secs]
4.522: [GC 148441K->59282K(372032K), 0.0341330 secs]
4.844: [GC 161938K->70071K(372032K), 0.0284850 secs]
5.139: [GC 172727K->80624K(372032K), 0.0171910 secs]
5.338: [GC 183280K->90661K(372032K), 0.0184200 secs]
5.549: [GC 193317K->103126K(372032K), 0.0430080 secs]
5.775: [GC 205782K->113534K(372032K), 0.0359480 secs]
5.995: [GC 216190K->122832K(372032K), 0.0192900 secs]
6.192: [GC 225488K->131777K(372032K), 0.0183870 secs]

Then steadily increases

453.808: [GC 7482139K->7384396K(11240624K), 0.0208170 secs]
455.605: [GC 7487052K->7384177K(11240624K), 0.0206360 secs]
457.942: [GC 7486831K->7384131K(11240624K), 0.0189600 secs]
459.924: [GC 7486787K->7384141K(11240624K), 0.0193560 secs]
462.887: [GC 7486797K->7384151K(11240624K), 0.0189290 secs]


Until I triggered this full gc a moment ago.

6266.988: [GC 11255823K->10373641K(17194656K), 0.0331910 secs]
6280.083: [GC 11259721K->10373499K(17194656K), 0.0324870 secs]
6293.706: [GC 11259579K->10376656K(17194656K), 0.0324330 secs]
6309.781: [GC 11262736K->10376110K(17194656K), 0.0310330 secs]
6333.790: [GC 11262190K->10374348K(17194656K), 0.0297670 secs]
6333.934: [Full GC 10391746K->9722532K(17194656K), 63.9812940 secs]
6418.466: [GC 10608612K->9725743K(17201024K), 0.0339610 secs]
6421.420: [GC 10611823K->9760611K(17201024K), 0.1501610 secs]
6428.221: [GC 10646691K->9767236K(17201024K), 0.1503170 secs]
6437.431: [GC 10653316K->9734750K(17201024K), 0.0344960 secs]

Essentially gc sometimes clears some memory but not all and then the line
keeps rising. Delta is about 10-17 hours until the heap is exhaused.

On Sat, Dec 22, 2012 at 7:03 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> Blocks is ~26,000,000 Files is a bit higher ~27,000,000
>
> Currently running:
> [root@hnn217 ~]# java -version
> java version "1.7.0_09"
> Was running 1.6.0_23
>
> export JVM_OPTIONS="-XX:+UseCompressedOops -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly"
>
> I will grab the gc logs and the heap dump in a follow up.
>
>
>
> On Sat, Dec 22, 2012 at 1:32 PM, Suresh Srinivas 
> <sur...@hortonworks.com>wrote:
>
>> Please take a histo live dump when the memory is full. Note that this
>> causes full gc.
>> http://docs.oracle.com/javase/6/docs/technotes/tools/share/jmap.html
>>
>> What are the number of blocks you have on the system.
>>
>> Send the JVM options you are using. From earlier java versions which used
>> 1/8 of total heap for young gen, it has gone upto 1/3 of total heap. This
>> could also be the reason.
>>
>> Do you collect gc logs? Send that as well.
>>
>> Sent from a mobile device
>>
>> On Dec 22, 2012, at 9:51 AM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>> > Newer 1.6 are getting close to 1.7 so I am not going to fear a number
>> and
>> > fight the future.
>> >
>> > I have been aat around 27 million files for a while been as high as 30
>> > million I do not think that is related.
>> >
>> > I do not think it is related to checkpoints but I am considering
>> > raising/lowering the checkpoint triggers.
>> >
>> > On Saturday, December 22, 2012, Joep Rottinghuis <
>> jrottingh...@gmail.com>
>> > wrote:
>> >> Do your OOMs correlate with the secondary checkpointing?
>> >>
>> >> Joep
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Dec 22, 2012, at 7:42 AM, Michael Segel <michael_se...@hotmail.com>
>> > wrote:
>> >>
>> >>> Hey Silly question...
>> >>>
>> >>> How long have you had 27 million files?
>> >>>
>> >>> I mean can you correlate the number of files to the spat of OOMs?
>> >>>
>> >>> Even without problems... I'd say it would be a good idea to upgrade
>> due
>> > to the probability of a lot of code fixes...
>> >>>
>> >>> If you're running anything pre 1.x, going to 1.7 java wouldn't be a
>> good
>> > idea.  Having said that... outside of MapR, have any of the distros
>> > certified themselves on 1.7 yet?
>> >>>
>> >>> On Dec 22, 2012, at 6:54 AM, Edward Capriolo <edlinuxg...@gmail.com>
>> > wrote:
>> >>>
>> >>>> I will give this a go. I have actually went in JMX and manually
>> > triggered
>> >>>> GC no memory is returned. So I assumed something was leaking.
>> >>>>
>> >>>> On Fri, Dec 21, 2012 at 11:59 PM, Adam Faris <afa...@linkedin.com>
>> > wrote:
>> >>>>
>> >>>>> I know this will sound odd, but try reducing your heap size.   We
>> had
>> > an
>> >>>>> issue like this where GC kept falling behind and we either ran out
>> of
>> > heap
>> >>>>> or would be in full gc.  By reducing heap, we were forcing
>> concurrent
>> > mark
>> >>>>> sweep to occur and avoided both full GC and running out of heap
>> space
>> > as
>> >>>>> the JVM would collect objects more frequently.
>> >>>>>
>> >>>>> On Dec 21, 2012, at 8:24 PM, Edward Capriolo <edlinuxg...@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I have an old hadoop 0.20.2 cluster. Have not had any issues for a
>> > while.
>> >>>>>> (which is why I never bothered an upgrade)
>> >>>>>>
>> >>>>>> Suddenly it OOMed last week. Now the OOMs happen periodically. We
>> > have a
>> >>>>>> fairly large NameNode heap Xmx 17GB. It is a fairly large FS about
>> >>>>>> 27,000,000 files.
>> >>>>>>
>> >>>>>> So the strangest thing is that every 1 and 1/2 hour the NN memory
>> > usage
>> >>>>>> increases until the heap is full.
>> >>>>>>
>> >>>>>> http://imagebin.org/240287
>> >>>>>>
>> >>>>>> We tried failing over the NN to another machine. We change the Java
>> >>>>> version
>> >>>>>> from 1.6_23 -> 1.7.0.
>> >>>>>>
>> >>>>>> I have set the NameNode logs to debug and ALL and I have done the
>> same
>> >>>>> with
>> >>>>>> the data nodes.
>> >>>>>> Secondary NN is running and shipping edits and making new images.
>> >>>>>>
>> >>>>>> I am thinking something has corrupted the NN MetaData and after
>> enough
>> >>>>> time
>> >>>>>> it becomes a time bomb, but this is just a total shot in the dark.
>> > Does
>> >>>>>> anyone have any interesting trouble shooting ideas?
>> >>
>>
>
>

Reply via email to