Re: how to do parallel scanning in map reduce using hbase as input?

Stack Mon, 21 Jul 2014 22:38:28 -0700

On Mon, Jul 21, 2014 at 10:32 PM, Li Li <[email protected]> wrote:

> 1. yes, I have 20 concurrent running mappers.
> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> set 8 mappers, it hit oov exception and load average is high
>



What is OOV?

Do you have to have a reducer?

Load average is high?  How high?



> 3. fast mapper only use 1 minute. following is the statistics
>


So each region is only taking 1 minute to scan?  1.4Gs scanned?

Can you add other counters to your MR job so we can get more of an idea of
what is going on in it?

Please answer my other questions.

Thanks,
St.Ack


> HBase Counters
> REMOTE_RPC_CALLS 0
> RPC_CALLS 523
> RPC_RETRIES 0
> NOT_SERVING_REGION_EXCEPTION 0
> NUM_SCANNER_RESTARTS 0
> MILLIS_BETWEEN_NEXTS 62,415
> BYTES_IN_RESULTS 1,380,694,667
> BYTES_IN_REMOTE_RESULTS 0
> REGIONS_SCANNED 1
> REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
> FILE_BYTES_READ 120,508,552
> HDFS_BYTES_READ 176
> FILE_BYTES_WRITTEN 241,000,600
>
> File Input Format Counters
> Bytes Read 0
>
> Map-Reduce Framework
> Map output materialized bytes 120,448,992
> Combine output records 0
> Map input records 5,208,607
> Physical memory (bytes) snapshot 965,730,304
> Spilled Records 10,417,214
> Map output bytes 282,122,973
> CPU time spent (ms) 82,610
> Total committed heap usage (bytes) 1,061,158,912
> Virtual memory (bytes) snapshot 1,681,047,552
> Combine input records 0
> Map output records 5,208,607
> SPLIT_RAW_BYTES 176
>
>
> On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote:
> > How many regions now?
> >
> > You still have 20 concurrent mappers running?  Are your machines loaded
> w/
> > 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
> you
> > get an idea of your scan rates?  Are all map tasks scanning at same rate?
> >  Does one task lag the others?  Do you emit stats on each map task such
> as
> > rows processed? Can you figure your bottleneck? Are you seeking disk all
> > the time?  Anything else running while this big scan is going on?  How
> big
> > are your cells?  Do you have one or more column families?  How many
> columns?
> >
> > For average region size, do du on the hdfs region directories and then
> sum
> > and divide by region count.
> >
> > St.Ack
> >
> >
> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote:
> >
> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >> hours to finish a map reduce job.
> >>
> >> ---------- Forwarded message ----------
> >> From: Li Li <[email protected]>
> >> Date: Thu, Jun 26, 2014 at 3:34 PM
> >> Subject: how to do parallel scanning in map reduce using hbase as input?
> >> To: [email protected]
> >>
> >>
> >> my table has about 700 million rows and about 80 regions. each task
> >> tracker is configured with 4 mappers and 4 reducers at the same time.
> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >> mappers running. it takes more than an hour to finish mapper stage.
> >> The hbase cluster's load is very low, about 2,000 request per second.
> >> I think one mapper for a region is too small. How can I run more than
> >> one mapper for a region so that it can take full advantage of
> >> computing resources?
> >>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to