Schubert.

Regards your previous email, I didn't notice something obvious. You
should keep the retries at 10 since you could easily have to wait more
than 10 seconds under heavy load.

How many CPUs do you have on your Dell 2950?

Regards your last email, HBase 0.19 isn't compatible with Hadoop 0.18.
Why do you say it's buggy?

Thx,

J-D

On Tue, Mar 17, 2009 at 9:33 PM, schubert zhang <[email protected]> wrote:
> Hi Jean Daniel,
> I want to try the HBase0.19.1.
> And since the mapreduce of hadoop-0.19.1 is buggy, I want to use a stable
> hadoop 0.18.3, is the HBase 0.19.x can work on hadoop 0.18.3?
>
> Schubert
>
> On Wed, Mar 18, 2009 at 12:26 AM, schubert zhang <[email protected]> wrote:
>
>> Jean Daniel,
>> Thank you very much and I will re-setup my cluster with HBase 0.19.1
>> immediately.
>> And I will change my swappiness configuration immediately.
>>
>> For the RetriesExhaustedException exception, I double checked the logs of
>> each node.
>>
>> for example one such exception occur at 09/03/17 16:29:10 for Region:
>> 13975557...@2009-03-17 13:42:36.412
>>
>> 09/03/17 16:29:10 INFO mapred.JobClient:  map 69% reduce 0%
>> 09/03/17 16:29:10 INFO mapred.JobClient: Task Id :
>> attempt_200903171247_0387_m_000009_0, Status : FAILED
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
>> region server Some server for region 
>> TABA,13975557...@2009-03-1713:42:36.412,1237270434007, row 
>> '13975559...@2009-03-1716:25:42.772', but failed after 4 attempts.
>> Exceptions:
>>
>>         at
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>>         at
>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>>         at
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>>         at
>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>>         at
>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:123)
>>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> HMaster log:
>> 2009-03-17 16:29:16,501 INFO org.apache.hadoop.hbase.master.ServerManager:
>> Received MSG_REPORT_CLOSE: 
>> TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
>> 10.24.1.20:60020
>> 2009-03-17 16:29:18,707 INFO org.apache.hadoop.hbase.master.RegionManager:
>> assigning region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 to
>> server 10.24.1.14:60020
>> 2009-03-17 16:29:19,795 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.metaScanner scanning meta region {regionname: .META.,,1,
>> startKey: <>, server: 10.24.1.12:60020}
>> 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.metaScanner scan of 34 row(s) of meta region {regionname:
>> .META.,,1, startKey: <>, server: 10.24.1.12:60020} complete
>> 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> All 1 .META. region(s) scanned
>> 2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager:
>> Received MSG_REPORT_PROCESS_OPEN: 
>> TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
>> 10.24.1.14:60020
>> 2009-03-17 16:29:21,723 INFO org.apache.hadoop.hbase.master.ServerManager:
>> Received MSG_REPORT_OPEN: 
>> TABA,13975557...@2009-03-1713:42:36.412,1237270434007 from
>> 10.24.1.14:60020
>> 2009-03-17 16:29:21,723 INFO
>> org.apache.hadoop.hbase.master.ProcessRegionOpen$1:
>> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 open on
>> 10.24.1.14:60020
>> 2009-03-17 16:29:21,723 INFO
>> org.apache.hadoop.hbase.master.ProcessRegionOpen$1: updating row
>> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 in region .META.,,1
>> with startcode 1237274685627 and server 10.24.1.14:60020
>>
>> one node:
>> 2009-03-17 16:28:43,617 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
>> 2009-03-17 16:28:43,617 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
>> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
>> 2009-03-17 16:28:44,604 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007/168627948
>> available
>> 2009-03-17 16:29:01,682 INFO org.apache.hadoop.hbase.regionserver.HLog:
>> Closed
>> hdfs://nd0-rack0-cloud:9000/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278496283,
>> entries=100008. New log writer:
>> /hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278541680
>> 2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> compaction completed on region 
>> TABA,13675519...@2009-03-1713:39:35.123,1237270466558 in 28sec
>> 2009-03-17 16:29:04,920 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> starting  compaction on region 
>> TABA,13702099...@2009-03-1714:00:04.791,1237270570714
>> 2009-03-17 16:29:04,927 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> compaction completed on region 
>> TABA,13702099...@2009-03-1714:00:04.791,1237270570714 in 0sec
>>
>> another node:
>> 2009-03-17 16:28:56,462 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> starting  compaction on region 
>> CDR,13975557...@2009-03-1713:42:36.412,1237270434007
>> 2009-03-17 16:28:58,304 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
>> 2009-03-17 16:28:58,305 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
>> 2009-03-17 16:28:58,305 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
>> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
>> 2009-03-17 16:28:58,343 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> region CDR,13775569...@2009-03-17 13:13:57.898,1237278535949/1626944070
>> available
>> 2009-03-17 16:28:58,343 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
>> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
>> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,387 INFO org.apache.hadoop.io.compress.CodecPool: Got
>> brand-new decompressor
>> 2009-03-17 16:28:58,388 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>> region CDR,13775545...@2009-03-17 13:03:35.830,1237278535949/1656930599
>> available
>> 2009-03-17 16:29:01,325 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
>> CDR,13975557...@2009-03-17 13:42:36.412,1237270434007: Overloaded
>>
>> On Tue, Mar 17, 2009 at 8:02 PM, Jean-Daniel Cryans 
>> <[email protected]>wrote:
>>
>>> Schubert,
>>>
>>> Yeah latency was always an issue in HBase since reliability was our
>>> focus. This is going to be fixed in 0.20 with a new file format and
>>> caching of hot cells.
>>>
>>> Regards your problem, please enable DEBUG and look in your
>>> regionserver logs to see what's happening. Usually, if you get a
>>> RetriesExhausted, it's because something is taking too long to get a
>>> region reassigned.
>>>
>>> I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has
>>> 2 fix for massive uploads like yours.
>>>
>>> I also see that there was some swap, please try setting the swappiness
>>> to something like 20. It helped me a lot!
>>> http://www.sollers.ca/blog/2008/swappiness/
>>>
>>> J-D
>>>
>>> On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]>
>>> wrote:
>>> > This the "top" info of a regionserver/datanode/tasktracker node.We can
>>> see
>>> > the HRegionServer node is very heavy-loaded.
>>> >
>>> > top - 15:44:17 up 23:46,  3 users,  load average: 1.72, 0.55, 0.23
>>> > Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
>>> > Cpu(s): 56.2%us,  3.1%sy,  0.0%ni, 26.6%id, 12.7%wa,  0.2%hi,  1.2%si,
>>> >  0.0%st
>>> > Mem:   4043484k total,  2816028k used,  1227456k free,    18308k buffers
>>> > Swap:  2097144k total,    22944k used,  2074200k free,  1601760k cached
>>> >
>>> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> >
>>> > 21797 schubert  24   0 3414m 796m 9848 S  218 20.2   2:55.21 java
>>> > (HRegionServer)
>>> > 20545 schubert  24   0 3461m  98m 9356 S   15  2.5  10:49.64 java
>>> (DataNode)
>>> >
>>> > 22677 schubert  24   0 1417m 107m 9224 S    9  2.7   0:08.86 java
>>> (MapReduce
>>> > Child)
>>> >
>>> > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]>
>>> wrote:
>>> >
>>> >> Hi all,
>>> >>
>>> >> I am running a MapReduce Job to read files and insert rows into a HBase
>>> table. It like a ETL procedure in database world.
>>> >>
>>> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950 server
>>> with 4GB memory and 1TB disk on each node.
>>> >>
>>> >> Issue 1.
>>> >>
>>> >> Each time the MapReduce Job eat some files. And I want to get a balance
>>> in this pipe-line (I have defined the HBase table with TTL) since many new
>>> incoming files need to be eat. But sometimes many following
>>> RetriesExhaustedException happen and the job is blocked for a long time. And
>>> the the MapReduce job will be behindhand and cannot catch up with the speed
>>> of incoming files.
>>> >>
>>> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>> contact region server Some server for region 
>>> TABA,113976305...@2009-03-1713:11:04.135,1237268430402, row 
>>> '113976399...@2009-03-1713:40:12.213', but failed after 10 attempts.
>>> >>
>>> >> Exceptions:
>>> >>
>>> >>       at
>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
>>> >>       at
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
>>> >>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
>>> >>       at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
>>> >>       at
>>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
>>> >>       at
>>> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
>>> >>       at
>>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
>>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
>>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
>>> >>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>>> >>       at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>> >>
>>> >> I check the code and find the retry policy when this exception is like
>>> BSD TCP syn backoff policy:
>>> >>
>>> >>   /**
>>> >>    * This is a retry backoff multiplier table similar to the BSD TCP
>>> syn
>>> >>    * backoff table, a bit more aggressive than simple exponential
>>> backoff.
>>> >>    */
>>> >>   public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32
>>> };
>>> >>
>>> >> So, 10 retries will table long time: 71*2=142 seconds. So I think
>>> following
>>> >> tow parameters can be changed and I reduce them.
>>> >> hbase.client.pause = 1 (second)
>>> >> hbase.client.retries.number=4
>>> >> Now, the block time should be 5 seconds.
>>> >>
>>> >> In fact, sometimes I have met such exception also when:
>>> >> getRegionServerWithRetries, and
>>> >> getRegionLocationForRowWithRetries, and
>>> >> processBatchOfRows
>>> >>
>>> >> I want to know what cause above exception happen. Region spliting?
>>> >> Compaction? or Region moving to other regionserver?
>>> >> This exception happens very frequently in my cluster.
>>> >>
>>> >> Issue2: Also above workflow, I found the regionserver nodes are very
>>> busy
>>> >> wigh very heavy load.
>>> >> The process of HRegionServer takes almost all of the used CPU load.
>>> >> In my previous test round, I ran 4 maptask to insert data in each node,
>>> now
>>> >> I just run 2 maptask in each node. I think the performance of
>>> HRegionServer
>>> >> is not so good to achieve some of my applications. Because except for
>>> the
>>> >> insert/input data procedure, we will query rows from the same table for
>>> >> online applications, which need low-latency. I my test experience, I
>>> find
>>> >> the online query (scan about 50 rows with startRow and endRow) latency
>>> is
>>> >> 2-6 seconds for the first query, and about 500ms for the re-query. And
>>> when
>>> >> the cluster is busy for inserting data to table, the query latency may
>>> as
>>> >> long as 10-30 seconds. And when the table is in sliping/compaction,
>>> maybe
>>> >> longer.
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Reply via email to