Jean Daniel,
1. If I keep the retries at 10, the task may wait for a long time (142
seconds). So I want to try to configure
hbase.client.pause = 2
hbase.client.retries.number=5
Then the waiting time would be 14 seconds.

2. My DELL 2950, each node have 1 CPU with 4 cores. but there is one node
(as a slave) have 2 CPU with 2*4=8 cores.

3. The issue of hadoop 0.19.1 MapReduce is:
   (1) Some times, some task slots on some tasktracker are un-usable.
        For example, my total MapTask Capacity is 20, and there are 40 maps
need to be process, but only 16 or 13 running tasks at most.
        I think may be it is caused by taskracker black-list feature, this
issue may occur after some failed tasks.
   (2) I am currently checking the logs of mapreduce, since last night, my
MapReduce job was hung up. It stop at 79.9% for about 13 hours.
        The issue like this:
https://issues.apache.org/jira/browse/HADOOP-5367
        After a long time running of MapReduce (about 200 jobs have
completed). The MapReduce job is huaguped forever.

       The JobTracker repeatly logs:
       2009-03-17 16:29:39,997 INFO org.apache.hadoop.mapred.JobTracker:
Serious problem. While updating status, cannot find taskid
attempt_200903171247_0387_m_000015_1

       All TaskTrackers may hungup at the sametime, since the logs of each
TaskTracer stop at that time.

       And nefore the hangup. I can also find odds and ends such logs of
JobTracker, such as.

       2009-03-17 16:29:21,767 INFO org.apache.hadoop.mapred.JobTracker:
Serious problem. While updating status, cannot find taskid
attempt_200903171247_0387_m_000015_1
Schubert

On Wed, Mar 18, 2009 at 9:46 AM, Jean-Daniel Cryans <[email protected]>wrote:

> Schubert.
>
> Regards your previous email, I didn't notice something obvious. You
> should keep the retries at 10 since you could easily have to wait more
> than 10 seconds under heavy load.
>
> How many CPUs do you have on your Dell 2950?
>
> Regards your last email, HBase 0.19 isn't compatible with Hadoop 0.18.
> Why do you say it's buggy?
>
> Thx,
>
> J-D
>
> On Tue, Mar 17, 2009 at 9:33 PM, schubert zhang <[email protected]> wrote:
> > Hi Jean Daniel,
> > I want to try the HBase0.19.1.
> > And since the mapreduce of hadoop-0.19.1 is buggy, I want to use a stable
> > hadoop 0.18.3, is the HBase 0.19.x can work on hadoop 0.18.3?
> >
> > Schubert
> >
> > On Wed, Mar 18, 2009 at 12:26 AM, schubert zhang <[email protected]>
> wrote:
> >
> >> Jean Daniel,
> >> Thank you very much and I will re-setup my cluster with HBase 0.19.1
> >> immediately.
> >> And I will change my swappiness configuration immediately.
> >>
> >> For the RetriesExhaustedException exception, I double checked the logs
> of
> >> each node.
> >>
> >> for example one such exception occur at 09/03/17 16:29:10 for Region:
> >> 13975557...@2009-03-17 13:42:36.412
> >>
> >> 09/03/17 16:29:10 INFO mapred.JobClient:  map 69% reduce 0%
> >> 09/03/17 16:29:10 INFO mapred.JobClient: Task Id :
> >> attempt_200903171247_0387_m_000009_0, Status : FAILED
> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact
> >> region server Some server for region 
> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007,
> row '13975559...@2009-03-1716:25:42.772', but failed after 4 attempts.
> >> Exceptions:
> >>
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
> >>         at
> >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
> >>         at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
> >>         at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
> >>         at
> >>
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
> >>         at
> >>
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
> >>         at
> >>
> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
> >>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:123)
> >>         at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
> >>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> >>         at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>
> >> HMaster log:
> >> 2009-03-17 16:29:16,501 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> >> Received MSG_REPORT_CLOSE: 
> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007
> from
> >> 10.24.1.20:60020
> >> 2009-03-17 16:29:18,707 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> >> assigning region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
> to
> >> server 10.24.1.14:60020
> >> 2009-03-17 16:29:19,795 INFO org.apache.hadoop.hbase.master.BaseScanner:
> >> RegionManager.metaScanner scanning meta region {regionname: .META.,,1,
> >> startKey: <>, server: 10.24.1.12:60020}
> >> 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
> >> RegionManager.metaScanner scan of 34 row(s) of meta region {regionname:
> >> .META.,,1, startKey: <>, server: 10.24.1.12:60020} complete
> >> 2009-03-17 16:29:19,857 INFO org.apache.hadoop.hbase.master.BaseScanner:
> >> All 1 .META. region(s) scanned
> >> 2009-03-17 16:29:21,723 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> >> Received MSG_REPORT_PROCESS_OPEN: 
> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007
> from
> >> 10.24.1.14:60020
> >> 2009-03-17 16:29:21,723 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> >> Received MSG_REPORT_OPEN: 
> >> TABA,13975557...@2009-03-1713:42:36.412,1237270434007
> from
> >> 10.24.1.14:60020
> >> 2009-03-17 16:29:21,723 INFO
> >> org.apache.hadoop.hbase.master.ProcessRegionOpen$1:
> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 open on
> >> 10.24.1.14:60020
> >> 2009-03-17 16:29:21,723 INFO
> >> org.apache.hadoop.hbase.master.ProcessRegionOpen$1: updating row
> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007 in region
> .META.,,1
> >> with startcode 1237274685627 and server 10.24.1.14:60020
> >>
> >> one node:
> >> 2009-03-17 16:28:43,617 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
> >> 2009-03-17 16:28:43,617 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> MSG_REGION_OPEN:
> >> TABA,13975557...@2009-03-17 13:42:36.412,1237270434007
> >> 2009-03-17 16:28:44,604 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> region TABA,13975557...@2009-03-17 13:42:36.412,1237270434007/168627948
> >> available
> >> 2009-03-17 16:29:01,682 INFO org.apache.hadoop.hbase.regionserver.HLog:
> >> Closed
> >>
> hdfs://nd0-rack0-cloud:9000/hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278496283,
> >> entries=100008. New log writer:
> >> /hbase/log_10.24.1.14_1237274685627_60020/hlog.dat.1237278541680
> >> 2009-03-17 16:29:04,920 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> compaction completed on region 
> >> TABA,13675519...@2009-03-1713:39:35.123,1237270466558
> in 28sec
> >> 2009-03-17 16:29:04,920 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> starting  compaction on region TABA,13702099...@2009-03-1714
> :00:04.791,1237270570714
> >> 2009-03-17 16:29:04,927 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> compaction completed on region 
> >> TABA,13702099...@2009-03-1714:00:04.791,1237270570714
> in 0sec
> >>
> >> another node:
> >> 2009-03-17 16:28:56,462 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> starting  compaction on region CDR,13975557...@2009-03-1713
> :42:36.412,1237270434007
> >> 2009-03-17 16:28:58,304 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> >> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
> >> 2009-03-17 16:28:58,305 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> >> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
> >> 2009-03-17 16:28:58,305 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> MSG_REGION_OPEN:
> >> CDR,13775569...@2009-03-17 13:13:57.898,1237278535949
> >> 2009-03-17 16:28:58,343 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> region CDR,13775569...@2009-03-17 13:13:57.898,1237278535949/1626944070
> >> available
> >> 2009-03-17 16:28:58,343 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> MSG_REGION_OPEN:
> >> CDR,13775545...@2009-03-17 13:03:35.830,1237278535949
> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,380 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,383 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,384 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,386 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,387 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> >> brand-new decompressor
> >> 2009-03-17 16:28:58,388 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> region CDR,13775545...@2009-03-17 13:03:35.830,1237278535949/1656930599
> >> available
> >> 2009-03-17 16:29:01,325 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
> >> CDR,13975557...@2009-03-17 13:42:36.412,1237270434007: Overloaded
> >>
> >> On Tue, Mar 17, 2009 at 8:02 PM, Jean-Daniel Cryans <
> [email protected]>wrote:
> >>
> >>> Schubert,
> >>>
> >>> Yeah latency was always an issue in HBase since reliability was our
> >>> focus. This is going to be fixed in 0.20 with a new file format and
> >>> caching of hot cells.
> >>>
> >>> Regards your problem, please enable DEBUG and look in your
> >>> regionserver logs to see what's happening. Usually, if you get a
> >>> RetriesExhausted, it's because something is taking too long to get a
> >>> region reassigned.
> >>>
> >>> I also suggest that you upgrade to HBase 0.19.1 candidate 2 which has
> >>> 2 fix for massive uploads like yours.
> >>>
> >>> I also see that there was some swap, please try setting the swappiness
> >>> to something like 20. It helped me a lot!
> >>> http://www.sollers.ca/blog/2008/swappiness/
> >>>
> >>> J-D
> >>>
> >>> On Tue, Mar 17, 2009 at 3:49 AM, schubert zhang <[email protected]>
> >>> wrote:
> >>> > This the "top" info of a regionserver/datanode/tasktracker node.We
> can
> >>> see
> >>> > the HRegionServer node is very heavy-loaded.
> >>> >
> >>> > top - 15:44:17 up 23:46,  3 users,  load average: 1.72, 0.55, 0.23
> >>> > Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
> >>> > Cpu(s): 56.2%us,  3.1%sy,  0.0%ni, 26.6%id, 12.7%wa,  0.2%hi,
>  1.2%si,
> >>> >  0.0%st
> >>> > Mem:   4043484k total,  2816028k used,  1227456k free,    18308k
> buffers
> >>> > Swap:  2097144k total,    22944k used,  2074200k free,  1601760k
> cached
> >>> >
> >>> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>> >
> >>> > 21797 schubert  24   0 3414m 796m 9848 S  218 20.2   2:55.21 java
> >>> > (HRegionServer)
> >>> > 20545 schubert  24   0 3461m  98m 9356 S   15  2.5  10:49.64 java
> >>> (DataNode)
> >>> >
> >>> > 22677 schubert  24   0 1417m 107m 9224 S    9  2.7   0:08.86 java
> >>> (MapReduce
> >>> > Child)
> >>> >
> >>> > On Tue, Mar 17, 2009 at 3:38 PM, schubert zhang <[email protected]>
> >>> wrote:
> >>> >
> >>> >> Hi all,
> >>> >>
> >>> >> I am running a MapReduce Job to read files and insert rows into a
> HBase
> >>> table. It like a ETL procedure in database world.
> >>> >>
> >>> >> Hadoop 0.19.1, HBase 0.19.0, 5 slaves and 1 master,  DELL2950 server
> >>> with 4GB memory and 1TB disk on each node.
> >>> >>
> >>> >> Issue 1.
> >>> >>
> >>> >> Each time the MapReduce Job eat some files. And I want to get a
> balance
> >>> in this pipe-line (I have defined the HBase table with TTL) since many
> new
> >>> incoming files need to be eat. But sometimes many following
> >>> RetriesExhaustedException happen and the job is blocked for a long
> time. And
> >>> the the MapReduce job will be behindhand and cannot catch up with the
> speed
> >>> of incoming files.
> >>> >>
> >>> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> >>> contact region server Some server for region
> TABA,113976305...@2009-03-1713:11:04.135,1237268430402, row
> '113976399...@2009-03-1713:40:12.213', but failed after 10 attempts.
> >>> >>
> >>> >> Exceptions:
> >>> >>
> >>> >>       at
> >>>
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:942)
> >>> >>       at
> >>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1372)
> >>> >>       at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1316)
> >>> >>       at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1296)
> >>> >>       at
> >>>
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:21)
> >>> >>       at
> >>>
> net.sandmill.control.TableRowRecordWriter.write(TableRowRecordWriter.java:11)
> >>> >>       at
> >>>
> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:395)
> >>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:117)
> >>> >>       at net.sandmill.control.IndexingJob.map(IndexingJob.java:27)
> >>> >>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>> >>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> >>> >>       at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>> >>
> >>> >> I check the code and find the retry policy when this exception is
> like
> >>> BSD TCP syn backoff policy:
> >>> >>
> >>> >>   /**
> >>> >>    * This is a retry backoff multiplier table similar to the BSD TCP
> >>> syn
> >>> >>    * backoff table, a bit more aggressive than simple exponential
> >>> backoff.
> >>> >>    */
> >>> >>   public static int RETRY_BACKOFF[] = { 1, 1, 1, 2, 2, 4, 4, 8, 16,
> 32
> >>> };
> >>> >>
> >>> >> So, 10 retries will table long time: 71*2=142 seconds. So I think
> >>> following
> >>> >> tow parameters can be changed and I reduce them.
> >>> >> hbase.client.pause = 1 (second)
> >>> >> hbase.client.retries.number=4
> >>> >> Now, the block time should be 5 seconds.
> >>> >>
> >>> >> In fact, sometimes I have met such exception also when:
> >>> >> getRegionServerWithRetries, and
> >>> >> getRegionLocationForRowWithRetries, and
> >>> >> processBatchOfRows
> >>> >>
> >>> >> I want to know what cause above exception happen. Region spliting?
> >>> >> Compaction? or Region moving to other regionserver?
> >>> >> This exception happens very frequently in my cluster.
> >>> >>
> >>> >> Issue2: Also above workflow, I found the regionserver nodes are very
> >>> busy
> >>> >> wigh very heavy load.
> >>> >> The process of HRegionServer takes almost all of the used CPU load.
> >>> >> In my previous test round, I ran 4 maptask to insert data in each
> node,
> >>> now
> >>> >> I just run 2 maptask in each node. I think the performance of
> >>> HRegionServer
> >>> >> is not so good to achieve some of my applications. Because except
> for
> >>> the
> >>> >> insert/input data procedure, we will query rows from the same table
> for
> >>> >> online applications, which need low-latency. I my test experience, I
> >>> find
> >>> >> the online query (scan about 50 rows with startRow and endRow)
> latency
> >>> is
> >>> >> 2-6 seconds for the first query, and about 500ms for the re-query.
> And
> >>> when
> >>> >> the cluster is busy for inserting data to table, the query latency
> may
> >>> as
> >>> >> long as 10-30 seconds. And when the table is in sliping/compaction,
> >>> maybe
> >>> >> longer.
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Reply via email to