Re: how to do parallel scanning in map reduce using hbase as input?

Stack Mon, 21 Jul 2014 22:58:24 -0700

On Mon, Jul 21, 2014 at 10:53 PM, Li Li <[email protected]> wrote:

> Sorry, I enter tab and it send my unfinished post. See the following
> mail for answers of other questions.
>
> I forget the exception's detail. It throws exception in terminal.



What exception is thrown?



> The
> default io.sort.mb is 100 and I set it to 500 to speed up reducer.


Do you have to have a reducer?  If you could skip the shuffle...



> So
> I set mapred.child.java.opts to 1g
> The datanode/regionserver has 16GB memory but free memory


Does the RS use the 16G?



> for
> map-reduce is about 5gb. So I can't add more mappers
>
>
> How much RAM in these machines?
St.Ack



>
> On Tue, Jul 22, 2014 at 1:37 PM, Stack <[email protected]> wrote:
> > On Mon, Jul 21, 2014 at 10:32 PM, Li Li <[email protected]> wrote:
> >
> >> 1. yes, I have 20 concurrent running mappers.
> >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> >> set 8 mappers, it hit oov exception and load average is high
> >>
> >
> >
> > What is OOV?
> >
> > Do you have to have a reducer?
> >
> > Load average is high?  How high?
> >
> >
> >
> >> 3. fast mapper only use 1 minute. following is the statistics
> >>
> >
> >
> > So each region is only taking 1 minute to scan?  1.4Gs scanned?
> >
> > Can you add other counters to your MR job so we can get more of an idea
> of
> > what is going on in it?
> >
> > Please answer my other questions.
> >
> > Thanks,
> > St.Ack
> >
> >
> >> HBase Counters
> >> REMOTE_RPC_CALLS 0
> >> RPC_CALLS 523
> >> RPC_RETRIES 0
> >> NOT_SERVING_REGION_EXCEPTION 0
> >> NUM_SCANNER_RESTARTS 0
> >> MILLIS_BETWEEN_NEXTS 62,415
> >> BYTES_IN_RESULTS 1,380,694,667
> >> BYTES_IN_REMOTE_RESULTS 0
> >> REGIONS_SCANNED 1
> >> REMOTE_RPC_RETRIES 0
> >>
> >> FileSystemCounters
> >> FILE_BYTES_READ 120,508,552
> >> HDFS_BYTES_READ 176
> >> FILE_BYTES_WRITTEN 241,000,600
> >>
> >> File Input Format Counters
> >> Bytes Read 0
> >>
> >> Map-Reduce Framework
> >> Map output materialized bytes 120,448,992
> >> Combine output records 0
> >> Map input records 5,208,607
> >> Physical memory (bytes) snapshot 965,730,304
> >> Spilled Records 10,417,214
> >> Map output bytes 282,122,973
> >> CPU time spent (ms) 82,610
> >> Total committed heap usage (bytes) 1,061,158,912
> >> Virtual memory (bytes) snapshot 1,681,047,552
> >> Combine input records 0
> >> Map output records 5,208,607
> >> SPLIT_RAW_BYTES 176
> >>
> >>
> >> On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote:
> >> > How many regions now?
> >> >
> >> > You still have 20 concurrent mappers running?  Are your machines
> loaded
> >> w/
> >> > 4 map tasks on each?  Can you up the number of concurrent mappers?
>  Can
> >> you
> >> > get an idea of your scan rates?  Are all map tasks scanning at same
> rate?
> >> >  Does one task lag the others?  Do you emit stats on each map task
> such
> >> as
> >> > rows processed? Can you figure your bottleneck? Are you seeking disk
> all
> >> > the time?  Anything else running while this big scan is going on?  How
> >> big
> >> > are your cells?  Do you have one or more column families?  How many
> >> columns?
> >> >
> >> > For average region size, do du on the hdfs region directories and then
> >> sum
> >> > and divide by region count.
> >> >
> >> > St.Ack
> >> >
> >> >
> >> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote:
> >> >
> >> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >> >> hours to finish a map reduce job.
> >> >>
> >> >> ---------- Forwarded message ----------
> >> >> From: Li Li <[email protected]>
> >> >> Date: Thu, Jun 26, 2014 at 3:34 PM
> >> >> Subject: how to do parallel scanning in map reduce using hbase as
> input?
> >> >> To: [email protected]
> >> >>
> >> >>
> >> >> my table has about 700 million rows and about 80 regions. each task
> >> >> tracker is configured with 4 mappers and 4 reducers at the same time.
> >> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >> >> mappers running. it takes more than an hour to finish mapper stage.
> >> >> The hbase cluster's load is very low, about 2,000 request per second.
> >> >> I think one mapper for a region is too small. How can I run more than
> >> >> one mapper for a region so that it can take full advantage of
> >> >> computing resources?
> >> >>
> >>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to