Re: Tow issues for sharing, when using MapReduce to insert rows to HBase

schubert zhang Tue, 17 Mar 2009 18:34:08 -0700

Hi Jean Daniel,
I want to try the HBase0.19.1.
And since the mapreduce of hadoop-0.19.1 is buggy, I want to use a stable
hadoop 0.18.3, is the HBase 0.19.x can work on hadoop 0.18.3?


Schubert

On Wed, Mar 18, 2009 at 12:26 AM, schubert zhang <[email protected]> wrote:

> Jean Daniel,
> Thank you very much and I will re-setup my cluster with HBase 0.19.1
> immediately.
> And I will change my swappiness configuration immediately.
>
> For the RetriesExhaustedException exception, I double checked the logs of
> each node.
>
> for example one such exception occur at 09/03/17 16:29:10 for Region:
> 13975557...@2009-03-17 13:42:36.412
>
> 09/03/17 16:29:10 INFO mapred.JobClient:  map 69% reduce 0%
> 09/03/17 16:29:10 INFO mapred.JobClient: Task Id :
> attempt_200903171247_0387_m_000009_0, Status : FAILED
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server Some server for region 
> TABA,13975557...@2009-03-1713:42:36.412,1237270434007, row 
> '13975559...@2009-03-1716:25:42.772', but failed after 4 attempts.
> Exceptions:
>
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>         at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>         at
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>         at
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>         at
> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:123)
>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> HMaster log:
> 2009-03-17 16:29:16,501 INFO org.apache.hadoop.hbase.master.ServerManager:
> Received MSG_REPORT_CLOSE: 
> TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
> 10.24.1.20:60020
> 2009-03-17 16:29:18,707 INFO org.apache.hadoop.hbase.master.RegionManager:
> assigning region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 to
> server 10.24.1.14:60020
> 2009-03-17 16:29:19,795 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {regionname: .META.,,1,
> startKey: <>, server: 10.24.1.12:60020}
> 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 34 row(s) of meta region {regionname:
> .META.,,1, startKey: <>, server: 10.24.1.12:60020} complete
> 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
> All 1 .META. region(s) scanned
> 2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager:
> Received MSG_REPORT_PROCESS_OPEN: 
> TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
> 10.24.1.14:60020
> 2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager:
> Received MSG_REPORT_OPEN: 
> TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
> 10.24.1.14:60020
> 2009-03-17 16:29:21,723 INFO
> org.apache.hadoop.hbase.master.ProcessRegionOpen$1:
> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 open on
> 10.24.1.14:60020
> 2009-03-17 16:29:21,723 INFO
> org.apache.hadoop.hbase.master.ProcessRegionOpen$1: updating row
> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 in region .META.,,1
> with startcode 1237274685627 and server 10.24.1.14:60020
>
> one node:
> 2009-03-17 16:28:43,617 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
> 2009-03-17 16:28:43,617 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
> 2009-03-17 16:28:44,604 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007/168627948
> available
> 2009-03-17 16:29:01,682 INFO org.apache.hadoop.hbase.regionserver.HLog:
> Closed
> hdfs://nd0-rack0-cloud:9000/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278496283,
> entries=100008. New log writer:
> /hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278541680
> 2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> compaction completed on region 
> TABA,13675519...@2009-03-1713:39:35.123,1237270466558 in 28sec
> 2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> starting  compaction on region 
> TABA,13702099...@2009-03-1714:00:04.791,1237270570714
> 2009-03-17 16:29:04,927 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> compaction completed on region 
> TABA,13702099...@2009-03-1714:00:04.791,1237270570714 in 0sec
>
> another node:
> 2009-03-17 16:28:56,462 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> starting  compaction on region 
> CDR,13975557...@2009-03-1713:42:36.412,1237270434007
> 2009-03-17 16:28:58,304 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
> 2009-03-17 16:28:58,305 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
> 2009-03-17 16:28:58,305 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
> 2009-03-17 16:28:58,343 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region CDR,13775569...@2009-03-17 13:13:57.898,1237278535949/1626944070
> available
> 2009-03-17 16:28:58,343 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,387 INFO org.apache.hadoop.io.compress.CodecPool: Got
> brand-new decompressor
> 2009-03-17 16:28:58,388 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region CDR,13775545...@2009-03-17 13:03:35.830,1237278535949/1656930599
> available
> 2009-03-17 16:29:01,325 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
> CDR,13975557...@2009-03-17 13:42:36.412,1237270434007: Overloaded
>
> On Tue, Mar 17, 2009 at 8:02 PM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> Schubert,
>>
>> Yeah latency was always an issue in HBase since reliability was our
>> focus. This is going to be fixed in 0.20 with a new file format and
>> caching of hot cells.
>>
>> Regards your problem, please enable DEBUG and look in your
>> regionserver logs to see what's happening. Usually, if you get a
>> RetriesExhausted, it's because something is taking too long to get a
>> region reassigned.
>>
>> I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has
>> 2 fix for massive uploads like yours.
>>
>> I also see that there was some swap, please try setting the swappiness
>> to something like 20. It helped me a lot!
>> http://www.sollers.ca/blog/2008/swappiness/
>>
>> J-D
>>
>> On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]>
>> wrote:
>> > This the "top" info of a regionserver/datanode/tasktracker node.We can
>> see
>> > the HRegionServer node is very heavy-loaded.
>> >
>> > top - 15:44:17 up 23:46,  3 users,  load average: 1.72, 0.55, 0.23
>> > Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
>> > Cpu(s): 56.2%us,  3.1%sy,  0.0%ni, 26.6%id, 12.7%wa,  0.2%hi,  1.2%si,
>> >  0.0%st
>> > Mem:   4043484k total,  2816028k used,  1227456k free,    18308k buffers
>> > Swap:  2097144k total,    22944k used,  2074200k free,  1601760k cached
>> >
>> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >
>> > 21797 schubert  24   0 3414m 796m 9848 S  218 20.2   2:55.21 java
>> > (HRegionServer)
>> > 20545 schubert  24   0 3461m  98m 9356 S   15  2.5  10:49.64 java
>> (DataNode)
>> >
>> > 22677 schubert  24   0 1417m 107m 9224 S    9  2.7   0:08.86 java
>> (MapReduce
>> > Child)
>> >
>> > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]>
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> I am running a MapReduce Job to read files and insert rows into a HBase
>> table. It like a ETL procedure in database world.
>> >>
>> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950 server
>> with 4GB memory and 1TB disk on each node.
>> >>
>> >> Issue 1.
>> >>
>> >> Each time the MapReduce Job eat some files. And I want to get a balance
>> in this pipe-line (I have defined the HBase table with TTL) since many new
>> incoming files need to be eat. But sometimes many following
>> RetriesExhaustedException happen and the job is blocked for a long time. And
>> the the MapReduce job will be behindhand and cannot catch up with the speed
>> of incoming files.
>> >>
>> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> contact region server Some server for region 
>> TABA,113976305...@2009-03-1713:11:04.135,1237268430402, row 
>> '113976399...@2009-03-1713:40:12.213', but failed after 10 attempts.
>> >>
>> >> Exceptions:
>> >>
>> >>       at
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>> >>       at
>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>> >>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>> >>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>> >>       at
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>> >>       at
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>> >>       at
>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>> >>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>> >>       at org.apache.hadoop.mapred.Child.main(Child.java:158)
>> >>
>> >> I check the code and find the retry policy when this exception is like
>> BSD TCP syn backoff policy:
>> >>
>> >>   /**
>> >>    * This is a retry backoff multiplier table similar to the BSD TCP
>> syn
>> >>    * backoff table, a bit more aggressive than simple exponential
>> backoff.
>> >>    */
>> >>   public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32
>> };
>> >>
>> >> So, 10 retries will table long time: 71*2=142 seconds. So I think
>> following
>> >> tow parameters can be changed and I reduce them.
>> >> hbase.client.pause = 1 (second)
>> >> hbase.client.retries.number=4
>> >> Now, the block time should be 5 seconds.
>> >>
>> >> In fact, sometimes I have met such exception also when:
>> >> getRegionServerWithRetries, and
>> >> getRegionLocationForRowWithRetries, and
>> >> processBatchOfRows
>> >>
>> >> I want to know what cause above exception happen. Region spliting?
>> >> Compaction? or Region moving to other regionserver?
>> >> This exception happens very frequently in my cluster.
>> >>
>> >> Issue2: Also above workflow, I found the regionserver nodes are very
>> busy
>> >> wigh very heavy load.
>> >> The process of HRegionServer takes almost all of the used CPU load.
>> >> In my previous test round, I ran 4 maptask to insert data in each node,
>> now
>> >> I just run 2 maptask in each node. I think the performance of
>> HRegionServer
>> >> is not so good to achieve some of my applications. Because except for
>> the
>> >> insert/input data procedure, we will query rows from the same table for
>> >> online applications, which need low-latency. I my test experience, I
>> find
>> >> the online query (scan about 50 rows with startRow and endRow) latency
>> is
>> >> 2-6 seconds for the first query, and about 500ms for the re-query. And
>> when
>> >> the cluster is busy for inserting data to table, the query latency may
>> as
>> >> long as 10-30 seconds. And when the table is in sliping/compaction,
>> maybe
>> >> longer.
>> >>
>> >>
>> >>
>> >
>>
>
>

Re: Tow issues for sharing, when using MapReduce to insert rows to HBase

Reply via email to