On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <[email protected]> wrote: > 0.94 currently doesn't support hadoop 2.0 >> Can you deploy hadoop 1.1.1 instead ? >> > > I am using cdh4.2.0 which uses this version as default installation. > I think it will be a problem for me to deploy 1.1.1 because I would need to > "upgrade" the whole cluster with 70TB of data (backup everything, go > offline, etc.). > > Is there a problem to use cdh4.2.0? > I should send my email to cdh list? > > That combo should be fine.
> You Full GC'ing around this time? >> > > The GC shows it took a long time. However it does not make any sense > to be it, since the same ammount of data was cleaned before and AFTER > in just 0.01 secs! > > If JVM is full GC'ing, the application is stopped. > > [Times: user=0.08 sys=137.62, real=137.62 secs] > > Besides the whole time was used by system. That is what is bugging me. > > The below does not look like a full GC but that is a long pause in system time, enough to kill your zk session. You swapping? Hardware is good? St.Ack > ... > > > 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs] > 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00, > real=0.01 secs] > > 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs] > 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01, > real=0.00 secs] > > 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620 > secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08 > sys=137.62, real=137.62 secs] > > 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs] > 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00, > real=0.01 secs] > > 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs] > 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00, > real=0.01 secs] > > > I really appreciate you guys helping me to find out what is wrong. > > Thanks, > Pablo > > > > On 03/08/2013 02:11 PM, Stack wrote: > >> What RAM says. >> >> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.****ClientCnxn: Client >> >> session timed out, have not heard from server in 159348ms for sessionid >> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect >> >> You Full GC'ing around this time? >> >> Put up your configs in a place where we can take a look? >> >> St.Ack >> >> >> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan < >> ramkrishna.s.vasudevan@gmail.**com <[email protected]>> >> wrote: >> >> I think it is with your GC config. What is your heap size? What is the >>> data that you pump in and how much is the block cache size? >>> >>> Regards >>> Ram >>> >>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <[email protected]> wrote: >>> >>> 0.94 currently doesn't support hadoop 2.0 >>>> >>>> Can you deploy hadoop 1.1.1 instead ? >>>> >>>> Are you using 0.94.5 ? >>>> >>>> Thanks >>>> >>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[email protected]> wrote: >>>> >>>> Hey guys, >>>>> as I sent in an email a long time ago, the RSs in my cluster did not >>>>> >>>> get >>> >>>> along >>>>> and crashed 3 times a day. I tried a lot of options we discussed in the >>>>> emails, but it not solved the problem. As I used an old version of >>>>> >>>> hadoop I >>>> >>>>> thought this was the problem. >>>>> >>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to >>>>> >>>> hadoop >>> >>>> 2.0.0 >>>>> - hbase 0.94 - zookeeper 3.4.5. >>>>> >>>>> Unfortunately the RSs did not stop crashing, and worst! Now they crash >>>>> every >>>>> hour and some times when the RS that holds the .ROOT. crashes all >>>>> >>>> cluster >>> >>>> get >>>>> stuck in transition and everything stops working. >>>>> In this case I need to clean zookeeper znodes, restart the master and >>>>> >>>> the >>> >>>> RSs. >>>>> To avoid this case I am running on production with only ONE RS and a >>>>> monitoring >>>>> script that check every minute, if the RS is ok. If not, restart it. >>>>> * This case does not get the cluster stuck. >>>>> >>>>> This is driving me crazy, but I really cant find a solution for the >>>>> cluster. >>>>> I tracked all logs from the start time 16:49 from all interesting nodes >>>>> (zoo, >>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is >>>>> usefull. >>>>> >>>>> There are some strange errors in the DATANODE2, as an error copiyng a >>>>> >>>> block >>>> >>>>> to itself. >>>>> >>>>> The gc log points to GC timeout. However it is very weird that the RS >>>>> >>>> spend >>>> >>>>> so much time in GC while in the other cases it takes 0.001sec. Besides, >>>>> the time >>>>> spent, is in sys which makes me think that might be a problem in >>>>> >>>> another >>> >>>> place. >>>>> >>>>> I know that it is a bunch of logs, and that it is very difficult to >>>>> >>>> find >>> >>>> the >>>>> problem without much context. But I REALLY need some help. If it is not >>>>> >>>> the >>>> >>>>> solution, at least what I should read, where I should look, or which >>>>> >>>> cases >>>> >>>>> I >>>>> should monitor. >>>>> >>>>> Thank you very much, >>>>> Pablo Musa >>>>> >>>>> >
