So your region servers had their session expired but I don't see any
sign of GC activity. Usually that points to a case where zookeeper
isn't able to answer request fast enough because it is IO starved. Are
the zookeeper quorum member on the same nodes as the region servers?

J-D

On Wed, Jul 14, 2010 at 10:16 AM, Jinsong Hu <jinsong...@hotmail.com> wrote:
> I have uploaded the crashed log files to
>
> http://somewhere
> http://somewhere
>
> it includes the GC log and config files too. both of these regionserver
> crashed when I
> use only 3 mappers to insert records to hbase, and each task I limit the
> data rate
> to 80 records /second.
>
> The machine I use is relatively old, having only 4G of ram. 4 core CPU, and
> 250G disk.
> I run tasktracker , datanode and regionserver on them. I have 3 machines,
> only 1 regionserver
> is still up after the continuous insertion for overnight. before the test I
> created  hbase
> tables so I started with an empty table. I saw it running fine for several
> hours and this morning
> I check again, 2 regionserver died. I backup the log and restart the
> regionserver, and then
> I found the regionserver process is up, but not listening to any port.
>
> the log doesn't show any error:
> 2010-07-14 16:23:45,769 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
> : Set watcher on master address ZNode /hbase/master
> 2010-07-14 16:23:45,951 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
> : Read ZNode /hbase/master got 10.110.24.48:60000
> 2010-07-14 16:23:45,957 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> Telling master at 10.110.24.48:60000 that we are up
> 2010-07-14 16:23:46,283 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook:
> Installed shutdown hook thread: Shutdownhook:regionserver60020
>
>
> I am using the June 10 release of the cloudera distribution for both hadoop
> and hbase.
>
> Jimmy
>
>
>
> --------------------------------------------------
> From: "Jean-Daniel Cryans" <jdcry...@apache.org>
> Sent: Tuesday, July 13, 2010 4:24 PM
> To: <user@hbase.apache.org>
> Subject: Re: regionserver crash under heavy load
>
>> Your region server doesn't look much loaded from the metrics POV. But
>> I specifically asked for the lines around that, not just the dump,
>> since it will contain the reason for the shutdown.
>>
>>> I do notice that the disk usage is pretty high.  I am just thinking that
>>> our
>>> problem probably is a hardware limit. but the server should not crash
>>> when
>>> the hardware limit is reached.
>>
>> We still don't know why it crashed and it may not even be related to
>> HW limits, we need those bigger log traces. Also use pastebin.com or
>> anything like that.
>>
>>>
>>> do you have any idea when CDH3 official release will be out ?
>>
>> I don't work for cloudera, but IIRC the next beta for CDH3 is due for
>> September.
>>
>>>
>>> Jimmy
>>>
>>> --------------------------------------------------
>>> From: "Jean-Daniel Cryans" <jdcry...@apache.org>
>>> Sent: Tuesday, July 13, 2010 2:55 PM
>>> To: <user@hbase.apache.org>
>>> Subject: Re: regionserver crash under heavy load
>>>
>>>> Please use a pasting service for the log traces. I personally use
>>>> pastebin.com
>>>>
>>>> You probably had a GC that lasted too long, this is something out of
>>>> the control of the application (apart from trying to put as less data
>>>> in memory as possible, but you are inserting so...). Your log doesn't
>>>> contain enough information for us to tell, please look for a "Dump of
>>>> metrics" line and paste the lines around that.
>>>>
>>>> J-D
>>>>
>>>> On Tue, Jul 13, 2010 at 2:49 PM, Jinsong Hu <jinsong...@hotmail.com>
>>>> wrote:
>>>>>
>>>>> Hi, Todd:
>>>>>  I downloaded hadoop-0.20.2+320 and hbase-0.89.20100621+17 from CDH3
>>>>> and
>>>>> inserted data with full load, after a while the hbase regionserver
>>>>> crashed.
>>>>> I checked  system with "iostat -x 5" and notice the disk is pretty
>>>>> busy.
>>>>> Then I modified my client code and reduced the insertion rate by 6
>>>>> times,
>>>>> and the test runs fine.  Is there any way that regionserver be modified
>>>>> so
>>>>> that at least it doesn't crash under heavy load ?  I used apache hbase
>>>>> 0.20.5 distribution and the same problem happens. I am thinking that
>>>>> when
>>>>> the regionserver is too busy, it should throttle incoming data rate to
>>>>> protect the server.  Could this be done ?
>>>>>  Do you also know when the CDH3 official release will come out ? the
>>>>> one
>>>>> I
>>>>> downloaded is beta version.
>>>>>
>>>>> Jimmy
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2010-07-13 02:24:34,389 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>> Close
>>>>> d Spam_MsgEventTable,56-2010-05-19
>>>>> 10:09:02\x099a420f4f31748828fd24aeea1d06b294,
>>>>> 1278973678315.01dd22f517dabf53ddd135709b68ba6c.
>>>>> 2010-07-13 02:24:34,389 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>> aborting server at: m0002029.ppops.net,60020,1278969481450
>>>>> 2010-07-13 02:24:34,389 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
>>>>> : Closed connection with ZooKeeper; /hbase/root-region-server
>>>>> 2010-07-13 02:24:34,389 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>> regionserver60020 exiting
>>>>> 2010-07-13 02:24:34,608 INFO
>>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook:
>>>>> Shutdown hook starting; hbase.shutdown.hook=true;
>>>>> fsShutdownHook=Thread[Thread-1
>>>>> 0,5,main]
>>>>> 2010-07-13 02:24:34,608 INFO
>>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook:
>>>>> Starting fs shutdown hook thread.
>>>>> 2010-07-13 02:24:34,608 ERROR org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>> closin
>>>>> g file
>>>>>
>>>>> /hbase/.logs/m0002029.ppops.net,60020,1278969481450/10.110.24.79%3A60020.
>>>>> 1278987220794 : java.io.IOException: IOException
>>>>> flush:java.io.IOException:
>>>>> IOEx
>>>>> ception flush:java.io.IOException: IOException
>>>>> flush:java.io.IOException:
>>>>> IOExce
>>>>> ption flush:java.io.IOException: IOException flush:java.io.IOException:
>>>>> IOExcept
>>>>> ion flush:java.io.IOException: IOException flush:java.io.IOException:
>>>>> IOExceptio
>>>>> n flush:java.io.IOException: IOException flush:java.io.IOException:
>>>>> Error
>>>>> Recove
>>>>> ry for block blk_-1605696159279298313_2395924 failed  because recovery
>>>>> from
>>>>> prim
>>>>> ary datanode 10.110.24.80:50010 failed 6 times.  Pipeline was
>>>>> 10.110.24.80:50010
>>>>> . Aborting...
>>>>> java.io.IOException: IOException flush:java.io.IOException: IOException
>>>>> flush:ja
>>>>> va.io.IOException: IOException flush:java.io.IOException: IOException
>>>>> flush:java
>>>>> .io.IOException: IOException flush:java.io.IOException: IOException
>>>>> flush:java.i
>>>>> o.IOException: IOException flush:java.io.IOException: IOException
>>>>> flush:java.io.
>>>>> IOException: IOException flush:java.io.IOException: Error Recovery for
>>>>> block
>>>>> blk
>>>>> _-1605696159279298313_2395924 failed  because recovery from primary
>>>>> datanode
>>>>> 10.
>>>>> 110.24.80:50010 failed 6 times.  Pipeline was 10.110.24.80:50010.
>>>>> Aborting...
>>>>>     at
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:
>>>>> 3214)
>>>>>     at
>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:
>>>>> 97)
>>>>>     at
>>>>> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944
>>>>> )
>>>>>     at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>>>>>     at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>>>> sorImpl.java:25)
>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(S
>>>>> equenceFileLogWriter.java:124)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.hflush(HLog.java:826)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1004)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:817)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.j
>>>>> ava:1531)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.
>>>>> java:1703)
>>>>>     at
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionSe
>>>>> rver.java:2361)
>>>>>     at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>>>>>     at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>>>> sorImpl.java:25)
>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>     at
>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>>>>>     at
>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
>>>>> 919)
>>>>> 2010-07-13 02:24:34,610 ERROR org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>> closin
>>>>> g file
>>>>>
>>>>> /hbase/Spam_MsgEventTable/079c7de876422e57e5f09fef5d997e06/.tmp/677365813
>>>>> 4549268273 : java.io.IOException: All datanodes 10.110.24.80:50010 are
>>>>> bad.
>>>>> Abor
>>>>> ting...
>>>>> java.io.IOException: All datanodes 10.110.24.80:50010 are bad.
>>>>> Aborting...
>>>>>     at
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError
>>>>> (DFSClient.java:2603)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClien
>>>>> t.java:2139)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFS
>>>>> Client.java:2306)
>>>>> 2010-07-13 02:24:34,729 INFO
>>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook:
>>>>> Shutdown hook finished.
>>>>>
>>>>
>>>
>>
>

Reply via email to