Re: Frequent downs of region server

Edward J. Yoon Wed, 14 Jan 2009 01:48:14 -0800

All in 3 node :

node1 : namenode, datanode, jobtracker, tasktracker, hmaster, region
node2 : datanode, tasktracker, region
node3 : datanode, tasktracker, region
----


>> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DataStreamer Exception: java.io.IOException: Unable to create new
>> > block.
>> >        at

namenode logs:
----
2009-01-14 13:03:54,452 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 1
2009-01-14 13:03:54,452 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.allocateBlock:
/hbase/DenseMatrix_randllnma/compaction.dir/89428128/block/mapfiles/1969914437577830056/data.
blk_8609709792065065878_14543
2009-01-14 13:03:55,781 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: blockMap updated:
61.247.201.163:50010 is added to blk_8609709792065065878_14543 size
67108864
2009-01-14 13:03:55,782 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 1
2009-01-14 13:03:55,782 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.allocateBlock:
/hbase/DenseMatrix_randllnma/compaction.dir/89428128/block/mapfiles/1969914437577830056/data.
blk_7500745129745458361_14543
2009-01-14 13:03:56,057 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 1
2009-01-14 13:03:56,057 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 1
2009-01-14 13:03:56,058 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask 61.247.201.165:50010 to delete
blk_-3959166394378699051_13308 blk_502886823558166676_9198
blk_-5610559063210571767_11639 blk_6702150057882706492_13316
blk_-5444408121066664221_12917 blk_-3589842711099683684_13588
blk_6021853172562899162_13591 blk_4916378786226069241_10658
blk_-73948000127614266_14307 blk_2557829928774638206_14198
blk_-1274875861436874968_12707 blk_-5015487219925884636_6258
blk_-2299637893768828778_6867 blk_1098023530211333461_10832
blk_-3177548771524899519_11334 blk_7206161956876454365_13573
blk_5680570009091266394_10433 blk_3316251471276751802_9944
blk_2099728282945700283_11207 blk_6905553984719500492_13214
blk_-7106116525167441791_9576 blk_4379385578082585799_13472
blk_-7883876411219005437_13705 blk_463546149612246143_10886
blk_-7653965765662460467_12918 blk_3713772866976404885_9867
blk_6609568580654156411_13872 blk_4751791303428222695_13541
blk_4037366578825262863_8210 blk_8219092654455453467_12798
blk_-1422803420877997026_7957 blk_8300936932710895967_13735
blk_240135120783218949_6950 blk_-8159500787794013907_8394
blk_2688099293430950257_10472 blk_-1761018809405292520_13475
blk_-1128864877026465234_10130 blk_5407563808019590514_10763
blk_-1552641334423933916_12434 blk_-7029387758213145687_10061
blk_2577342369474690844_10606 blk_6631253829746909654_13551
blk_-7232622588691237910_9980 blk_2298610881366236419_13710
blk_-2887962599536680142_12816 blk_-2388595899338486062_11807
blk_7907922443118459013_11081 blk_6171415659899059586_11808
blk_-2039608738720071116_10676 blk_2020371844411420417_12545
blk_-7383687487311805520_13539 blk_-7482018213720259810_12328
blk_1264753810830518696_7837 blk_307963069359159831_11399
blk_4070872869299337885_10115 blk_-7670472609828967541_12755
blk_2012436361926027008_12584 blk_-1343739309916634143_13567
blk_8415337053938204597_12600 blk_-5704549986136158486_12545
blk_8394214138822655350_12974 blk_-2443471961554901789_12645
blk_-7237334559180624424_9959 blk_-8703341058779605761_12105
blk_-7047755282181199541_12189 blk_-7098214071255126673_12968
blk_7264472294671575886_8558 blk_-8073852726337049415_13892
blk_4600203458689631843_12005 blk_8852324706824983648_13549
blk_1303630004111722839_8720 blk_-1683388316308866611_13596
blk_2156472388468816613_10313 blk_5957706823407516027_13546
blk_-1563086059670811481_13053 blk_6849824098642582609_12915
blk_-4463282555782665548_10220 blk_8895781930292994008_10940
blk_-3629600968925541386_11846 blk_4934177062203626555_13161
blk_6073089223129638081_13858 blk_-941368068913837658_8515
blk_-8298444305398632744_12969 blk_2097411866805466421_10678
blk_5122633203781930902_13679 blk_3082337442730506186_13842
blk_5688615737901830577_11045 blk_8098726253380548026_13878
blk_8915658480780365903_12533 blk_-108377986544274284_12322
blk_-103346357025161241_13553 blk_-1120514732855196382_11335
blk_-5655595453584464335_9606 blk_-2459044407659470330_13493
blk_8600154252691938054_13724 blk_-6598484349484157044_13544
blk_6501730802180308257_12599 blk_7963054729340600565_11338
blk_3068539336658205460_8813 blk_-1021466638714964105_7505
2009-01-14 13:03:56,457 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.addStoredBlock: blockMap updated:
61.247.201.163:50010 is added to blk_7500745129745458361_14543 size
41025370
2009-01-14 13:03:56,458 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
transactions: 5731 Total time for transactions(ms): 46 Number of
syncs: 2923 SyncTimes(ms): 131124
2009-01-14 13:03:56,628 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel     ip=/61.247.201.165
 cmd=listStatus
src=/hbase/DenseMatrix_randllnma/1262088429/attribute/mapfiles
dst=null        perm=null
2009-01-14 13:03:56,629 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel     ip=/61.247.201.165
 cmd=listStatus
src=/hbase/DenseMatrix_randllnma/1262088429/block/mapfiles
dst=null        perm=null
2009-01-14 13:03:56,631 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel     ip=/61.247.201.165

On Wed, Jan 14, 2009 at 6:33 PM, Samuel Guo <[email protected]> wrote:
> 3 node cluster of Hadoop?
> 3 node cluster of HBase?
>
> Can you attach the logs of hadoop namenode, datanodes, hbase master, and
> hbase regionservers? Thanks in advance.
> I am doubting that too many files opened cause datanode use out all the
> xceivers .so DFSClient can create new block.
>
> On Wed, Jan 14, 2009 at 5:20 PM, Edward J. Yoon <[email protected]>wrote:
>
>> I tried to 10,000 by 10,000 mat-mat mult on 3 node.
>>
>> -random matrices successfully generated.
>> -collecting jobs are successfully done.
>> -successfully mult them in the map phase.
>>
>> And, during reduce job (sum operation and data insert operation) , the
>> following is happened.
>>
>> ---------- Forwarded message ----------
>> From: stack <[email protected]>
>> Date: Wed, Jan 14, 2009 at 3:50 PM
>> Subject: Re: Frequent downs of region server
>> To: [email protected]
>>
>>
>> Edward J. Yoon wrote:
>> > During write operation in reduce phase, region servers are killed.
>> > (64,000 rows with 10,000 columns, 3 node)
>>
>> 10k columns is probably over what hbase is currently able to do
>> (hbase-867).
>>
>> You've seen the notes at end of the
>> http://wiki.apache.org/hadoop/Hbase/Troubleshooting page?
>>
>> See other notes below:
>>
>> > ----
>> > 09/01/14 13:07:59 INFO mapred.JobClient:  map 100% reduce 36%
>> > 09/01/14 13:11:38 INFO mapred.JobClient:  map 100% reduce 33%
>> > 09/01/14 13:11:38 INFO mapred.JobClient: Task Id :
>> > attempt_200901140952_0010_r_000017_1, Status : FAILED
>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> > contact region server 61.247.201.163:60020 for region
>> > DenseMatrix_randgnegu,,1231905480938, row '000000000000287', but
>> > failed after 10 attempts.
>> > Exceptions:
>> > java.io.IOException: java.io.IOException: Server not running, aborting
>> >         at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2103)
>> >         at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1611)
>> > ----
>> >
>> You upped the hbase client timeouts?
>>
>> > And, I can't stop the hbase.
>> >
>> > [d8g053:/root]# hbase-trunk/bin/stop-hbase.sh
>> > stopping
>>
>> master...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
>> >
>> > Can it be recovered?
>>
>> What does master log say?  Why ain't it going down?  On tail of the log
>> it'll usually say why its staying up.  Probably a particular HRegionServer?
>>
>> >
>> > ----
>> > Region server log:
>> >
>> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DataStreamer Exception: java.io.IOException: Unable to create new
>> > block.
>> >         at
>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
>> >         at
>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>> >         at
>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>>
>> These look like issue that config. on the troubleshooting page might
>> address
>> (check your datanode logs).  You are using 0.18.0 hbase?
>>
>> St.Ack
>>
>>
>>
>> On Tue, Jan 13, 2009 at 8:42 PM, Edward J. Yoon <[email protected]
>> >wrote:
>>
>> > During write operation in reduce phase, region servers are killed.
>> > (64,000 rows with 10,000 columns, 3 node)
>> >
>> > ----
>> > 09/01/14 13:07:59 INFO mapred.JobClient:  map 100% reduce 36%
>> > 09/01/14 13:11:38 INFO mapred.JobClient:  map 100% reduce 33%
>> > 09/01/14 13:11:38 INFO mapred.JobClient: Task Id :
>> > attempt_200901140952_0010_r_000017_1, Status : FAILED
>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> > contact region server 61.247.201.163:60020 for region
>> > DenseMatrix_randgnegu,,1231905480938, row '000000000000287', but
>> > failed after 10 attempts.
>> > Exceptions:
>> > java.io.IOException: java.io.IOException: Server not running, aborting
>> >        at
>> >
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2103)
>> >        at
>> >
>> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1611)
>> > ----
>> >
>> > And, I can't stop the hbase.
>> >
>> > [d8g053:/root]# hbase-trunk/bin/stop-hbase.sh
>> > stopping
>> >
>> master...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
>> >
>> > Can it be recovered?
>> >
>> > ----
>> > Region server log:
>> >
>> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DataStreamer Exception: java.io.IOException: Unable to create new
>> > block.
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-4005955194083205373_14543 bad datanode[0]
>> > nodes == null
>> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient: Could
>> > not get block locations. Aborting...
>> > 2009-01-14 13:03:56,629 ERROR
>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> > Compaction/Split failed for region
>> > DenseMatrix_randllnma,000000000000,18,7-29116,1231898419257
>> > java.io.IOException: Could not read from stream
>> >        at
>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
>> >        at java.io.DataInputStream.readByte(DataInputStream.java:248)
>> >        at
>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
>> >        at
>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
>> >        at org.apache.hadoop.io.Text.readString(Text.java:400)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>> > 2009-01-14 13:03:56,631 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegion: starting  compaction on
>> > region DenseMatrix_randllnma,00000000000,16,19-26373,1231898311583
>> > 2009-01-14 13:03:56,692 INFO org.apache.hadoop.io.compress.CodecPool:
>> > Got brand-new decompressor
>> > 2009-01-14 13:03:56,692 INFO org.apache.hadoop.io.compress.CodecPool:
>> > Got brand-new decompressor
>> > 2009-01-14 13:03:56,693 INFO org.apache.hadoop.io.compress.CodecPool:
>> > Got brand-new decompressor
>> > 2009-01-14 13:03:56,693 INFO org.apache.hadoop.io.compress.CodecPool:
>> > Got brand-new decompressor
>> > 2009-01-14 13:03:57,521 INFO org.apache.hadoop.io.compress.CodecPool:
>> > Got brand-new compressor
>> > 2009-01-14 13:03:57,810 INFO org.apache.hadoop.hdfs.DFSClient:
>> > Exception in createBlockOutputStream java.io.IOException: Could not
>> > read from stream
>> > 2009-01-14 13:03:57,810 INFO org.apache.hadoop.hdfs.DFSClient:
>> > Abandoning block blk_-2612702056484946948_14554
>> > 2009-01-14 13:03:59,343 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DataStreamer Exception: java.io.IOException: Unable to create new
>> > block.
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>> >
>> > 2009-01-14 13:03:59,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-5255885897790790367_14543 bad datanode[0]
>> > nodes == null
>> > 2009-01-14 13:03:59,344 WARN org.apache.hadoop.hdfs.DFSClient: Could
>> > not get block locations. Aborting...
>> > 2009-01-14 13:03:59,344 FATAL
>> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog
>> > required. Forcing server shutdown
>> > org.apache.hadoop.hbase.DroppedSnapshotException: region:
>> > DenseMatrix_randgnegu,,1231905480938
>> >        at
>> >
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:896)
>> >        at
>> > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:789)
>> >        at
>> >
>> org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:227)
>> >        at
>> >
>> org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:137)
>> > Caused by: java.io.IOException: Could not read from stream
>> >        at
>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
>> >        at java.io.DataInputStream.readByte(DataInputStream.java:248)
>> >        at
>> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
>> >        at
>> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
>> >        at org.apache.hadoop.io.Text.readString(Text.java:400)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>> > 2009-01-14 13:03:59,359 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> > request=15, regions=48, stores=192, storefiles=756,
>> > storefileIndexSize=6, memcacheSize=338, usedHeap=395, maxHeap=971
>> > 2009-01-14 13:03:59,359 INFO
>> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher:
>> > regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher exiting
>> > 2009-01-14 13:03:59,368 INFO
>> > org.apache.hadoop.hbase.regionserver.HLog: Closed
>> > hdfs://
>> >
>> dev3.nm2.naver.com:9000/hbase/log_61.247.201.165_1231894400437_60020/hlog.dat.1231905813472
>> > ,
>> > entries=896500. New log writer:
>> > /hbase/log_61.247.201.165_1231894400437_60020/hlog.dat.1231905839367
>> >
>> > 2009-01-14 13:03:59,368 INFO
>> > org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon @ NHN, corp.
>> > [email protected]
>> > http://blog.udanax.org
>> >
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon @ NHN, corp.
>> [email protected]
>> http://blog.udanax.org
>>
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
[email protected]
http://blog.udanax.org

Re: Frequent downs of region server

Reply via email to