Hi Jean Daniel, I want to try the HBase0.19.1. And since the mapreduce of hadoop-0.19.1 is buggy, I want to use a stable hadoop 0.18.3, is the HBase 0.19.x can work on hadoop 0.18.3?
Schubert On Wed, Mar 18, 2009 at 12:26 AM, schubert zhang <[email protected]> wrote: > Jean Daniel, > Thank you very much and I will re-setup my cluster with HBase 0.19.1 > immediately. > And I will change my swappiness configuration immediately. > > For the RetriesExhaustedException exception, I double checked the logs of > each node. > > for example one such exception occur at 09/03/17 16:29:10 for Region: > 13975557...@2009-03-17 13:42:36.412 > > 09/03/17 16:29:10 INFO mapred.JobClient: map 69% reduce 0% > 09/03/17 16:29:10 INFO mapred.JobClient: Task Id : > attempt_200903171247_0387_m_000009_0, Status : FAILED > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact > region server Some server for region > TABA,13975557...@2009-03-1713:42:36.412,1237270434007, row > '13975559...@2009-03-1716:25:42.772', but failed after 4 attempts. > Exceptions: > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942) > at > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372) > at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316) > at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296) > at > net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21) > at > net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11) > at > org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395) > at net.sandmill.control.IndexingJob.map(IndexingJob.java:123) > at net.sandmill.control.IndexingJob.map(IndexingJob.java:27) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > HMaster log: > 2009-03-17 16:29:16,501 INFO org.apache.hadoop.hbase.master.ServerManager: > Received MSG_REPORT_CLOSE: > TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from > 10.24.1.20:60020 > 2009-03-17 16:29:18,707 INFO org.apache.hadoop.hbase.master.RegionManager: > assigning region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 to > server 10.24.1.14:60020 > 2009-03-17 16:29:19,795 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {regionname: .META.,,1, > startKey: <>, server: 10.24.1.12:60020} > 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scan of 34 row(s) of meta region {regionname: > .META.,,1, startKey: <>, server: 10.24.1.12:60020} complete > 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner: > All 1 .META. region(s) scanned > 2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager: > Received MSG_REPORT_PROCESS_OPEN: > TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from > 10.24.1.14:60020 > 2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager: > Received MSG_REPORT_OPEN: > TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from > 10.24.1.14:60020 > 2009-03-17 16:29:21,723 INFO > org.apache.hadoop.hbase.master.ProcessRegionOpen$1: > TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 open on > 10.24.1.14:60020 > 2009-03-17 16:29:21,723 INFO > org.apache.hadoop.hbase.master.ProcessRegionOpen$1: updating row > TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 in region .META.,,1 > with startcode 1237274685627 and server 10.24.1.14:60020 > > one node: > 2009-03-17 16:28:43,617 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 > 2009-03-17 16:28:43,617 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 > 2009-03-17 16:28:44,604 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007/168627948 > available > 2009-03-17 16:29:01,682 INFO org.apache.hadoop.hbase.regionserver.HLog: > Closed > hdfs://nd0-rack0-cloud:9000/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278496283, > entries=100008. New log writer: > /hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278541680 > 2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > TABA,13675519...@2009-03-1713:39:35.123,1237270466558 in 28sec > 2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > TABA,13702099...@2009-03-1714:00:04.791,1237270570714 > 2009-03-17 16:29:04,927 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > TABA,13702099...@2009-03-1714:00:04.791,1237270570714 in 0sec > > another node: > 2009-03-17 16:28:56,462 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > CDR,13975557...@2009-03-1713:42:36.412,1237270434007 > 2009-03-17 16:28:58,304 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > CDR,13775569...@2009-03-17 13:13:57.898,1237278535949 > 2009-03-17 16:28:58,305 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > CDR,13775545...@2009-03-17 13:03:35.830,1237278535949 > 2009-03-17 16:28:58,305 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > CDR,13775569...@2009-03-17 13:13:57.898,1237278535949 > 2009-03-17 16:28:58,343 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region CDR,13775569...@2009-03-17 13:13:57.898,1237278535949/1626944070 > available > 2009-03-17 16:28:58,343 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > CDR,13775545...@2009-03-17 13:03:35.830,1237278535949 > 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,387 INFO org.apache.hadoop.io.compress.CodecPool: Got > brand-new decompressor > 2009-03-17 16:28:58,388 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region CDR,13775545...@2009-03-17 13:03:35.830,1237278535949/1656930599 > available > 2009-03-17 16:29:01,325 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE: > CDR,13975557...@2009-03-17 13:42:36.412,1237270434007: Overloaded > > On Tue, Mar 17, 2009 at 8:02 PM, Jean-Daniel Cryans > <[email protected]>wrote: > >> Schubert, >> >> Yeah latency was always an issue in HBase since reliability was our >> focus. This is going to be fixed in 0.20 with a new file format and >> caching of hot cells. >> >> Regards your problem, please enable DEBUG and look in your >> regionserver logs to see what's happening. Usually, if you get a >> RetriesExhausted, it's because something is taking too long to get a >> region reassigned. >> >> I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has >> 2 fix for massive uploads like yours. >> >> I also see that there was some swap, please try setting the swappiness >> to something like 20. It helped me a lot! >> http://www.sollers.ca/blog/2008/swappiness/ >> >> J-D >> >> On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]> >> wrote: >> > This the "top" info of a regionserver/datanode/tasktracker node.We can >> see >> > the HRegionServer node is very heavy-loaded. >> > >> > top - 15:44:17 up 23:46, 3 users, load average: 1.72, 0.55, 0.23 >> > Tasks: 109 total, 1 running, 108 sleeping, 0 stopped, 0 zombie >> > Cpu(s): 56.2%us, 3.1%sy, 0.0%ni, 26.6%id, 12.7%wa, 0.2%hi, 1.2%si, >> > 0.0%st >> > Mem: 4043484k total, 2816028k used, 1227456k free, 18308k buffers >> > Swap: 2097144k total, 22944k used, 2074200k free, 1601760k cached >> > >> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> > >> > 21797 schubert 24 0 3414m 796m 9848 S 218 20.2 2:55.21 java >> > (HRegionServer) >> > 20545 schubert 24 0 3461m 98m 9356 S 15 2.5 10:49.64 java >> (DataNode) >> > >> > 22677 schubert 24 0 1417m 107m 9224 S 9 2.7 0:08.86 java >> (MapReduce >> > Child) >> > >> > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]> >> wrote: >> > >> >> Hi all, >> >> >> >> I am running a MapReduce Job to read files and insert rows into a HBase >> table. It like a ETL procedure in database world. >> >> >> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master, DELL2950 server >> with 4GB memory and 1TB disk on each node. >> >> >> >> Issue 1. >> >> >> >> Each time the MapReduce Job eat some files. And I want to get a balance >> in this pipe-line (I have defined the HBase table with TTL) since many new >> incoming files need to be eat. But sometimes many following >> RetriesExhaustedException happen and the job is blocked for a long time. And >> the the MapReduce job will be behindhand and cannot catch up with the speed >> of incoming files. >> >> >> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >> contact region server Some server for region >> TABA,113976305...@2009-03-1713:11:04.135,1237268430402, row >> '113976399...@2009-03-1713:40:12.213', but failed after 10 attempts. >> >> >> >> Exceptions: >> >> >> >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942) >> >> at >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372) >> >> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316) >> >> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296) >> >> at >> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21) >> >> at >> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11) >> >> at >> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395) >> >> at net.sandmill.control.IndexingJob.map(IndexingJob.java:117) >> >> at net.sandmill.control.IndexingJob.map(IndexingJob.java:27) >> >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) >> >> at org.apache.hadoop.mapred.Child.main(Child.java:158) >> >> >> >> I check the code and find the retry policy when this exception is like >> BSD TCP syn backoff policy: >> >> >> >> /** >> >> * This is a retry backoff multiplier table similar to the BSD TCP >> syn >> >> * backoff table, a bit more aggressive than simple exponential >> backoff. >> >> */ >> >> public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32 >> }; >> >> >> >> So, 10 retries will table long time: 71*2=142 seconds. So I think >> following >> >> tow parameters can be changed and I reduce them. >> >> hbase.client.pause = 1 (second) >> >> hbase.client.retries.number=4 >> >> Now, the block time should be 5 seconds. >> >> >> >> In fact, sometimes I have met such exception also when: >> >> getRegionServerWithRetries, and >> >> getRegionLocationForRowWithRetries, and >> >> processBatchOfRows >> >> >> >> I want to know what cause above exception happen. Region spliting? >> >> Compaction? or Region moving to other regionserver? >> >> This exception happens very frequently in my cluster. >> >> >> >> Issue2: Also above workflow, I found the regionserver nodes are very >> busy >> >> wigh very heavy load. >> >> The process of HRegionServer takes almost all of the used CPU load. >> >> In my previous test round, I ran 4 maptask to insert data in each node, >> now >> >> I just run 2 maptask in each node. I think the performance of >> HRegionServer >> >> is not so good to achieve some of my applications. Because except for >> the >> >> insert/input data procedure, we will query rows from the same table for >> >> online applications, which need low-latency. I my test experience, I >> find >> >> the online query (scan about 50 rows with startRow and endRow) latency >> is >> >> 2-6 seconds for the first query, and about 500ms for the re-query. And >> when >> >> the cluster is busy for inserting data to table, the query latency may >> as >> >> long as 10-30 seconds. And when the table is in sliping/compaction, >> maybe >> >> longer. >> >> >> >> >> >> >> > >> > >
