Re: how to do parallel scanning in map reduce using hbase as input?

Stack Tue, 22 Jul 2014 15:29:51 -0700

On Mon, Jul 21, 2014 at 11:08 PM, Li Li <[email protected]> wrote:

> On Tue, Jul 22, 2014 at 1:54 PM, Stack <[email protected]> wrote:
> > On Mon, Jul 21, 2014 at 10:47 PM, Li Li <[email protected]> wrote:
> >
> >> sorry. I have not finished it.
> >> 1. yes, I have 20 concurrent running mappers.
> >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> >> set 8 mappers, it hit oov exception and load average is high
> >> 3. fast mapper only use 1 minute. following is the statistics
> >> HBase Counters
> >>   REMOTE_RPC_CALLS 0
> >>   RPC_CALLS 523
> >>   RPC_RETRIES 0
> >>   NOT_SERVING_REGION_EXCEPTION 0
> >>   NUM_SCANNER_RESTARTS 0
> >>   MILLIS_BETWEEN_NEXTS 62,415
> >>   BYTES_IN_RESULTS 1,380,694,667
> >>   BYTES_IN_REMOTE_RESULTS 0
> >>   REGIONS_SCANNED 1
> >>   REMOTE_RPC_RETRIES 0
> >>
> >> FileSystemCounters
> >>   FILE_BYTES_READ 120,508,552
> >>   HDFS_BYTES_READ 176
> >>   FILE_BYTES_WRITTEN 241,000,600
> >>
> >> File Input Format Counters
> >>   Bytes Read 0
> >>
> >> Map-Reduce Framework
> >>   Map output materialized bytes 120,448,992
> >>   Combine output records 0
> >>   Map input records 5,208,607
> >>   Physical memory (bytes) snapshot 965,730,304
> >>   Spilled Records 10,417,214
> >>   Map output bytes 282,122,973
> >>   CPU time spent (ms) 82,610
> >>   Total committed heap usage (bytes) 1,061,158,912
> >>   Virtual memory (bytes) snapshot 1,681,047,552
> >>   Combine input records 0
> >>   Map output records 5,208,607
> >>   SPLIT_RAW_BYTES 176
> >>
> >>  slow mapper cost 25 minutes
> >>
> >
> >
> > So some mappers take 1 minute and others take 25 minutes?
> yes
>



If you look at the job on the tail, is it waiting on a single mapper to
finish?  Or do all finish at around the same time?  If this the case, then
you need to work on either speeding up the scan or upping the parallelism.
 If the skew is adding a long tail to your completion, get to know your key
space better and do splits so the data is evenly spread among the map tasks.

I have asked a few times if the reduce phase is necessary?  If you could
redo your job so this was not needed, that'd save a bunch.



> >
> > Do the map tasks balance each other out as they run or are you waiting on
> > one to complete, a really big one?
> >
> sorry, the fatest mapper takes 6 minutes and the slowest mapper takes 45
> minutes
> for the fastest, map input records is 4,469,570
> for the slowest one, input records is 22,335,536
>

Yeah, but is the skew the reason your job takes too long to finish?




> >
> >
> >> HBase Counters
> >>   REMOTE_RPC_CALLS 0
> >>   RPC_CALLS 2,268
> >>   RPC_RETRIES 0
> >>   NOT_SERVING_REGION_EXCEPTION 0
> >>   NUM_SCANNER_RESTARTS 0
> >>   MILLIS_BETWEEN_NEXTS 907,402
> >>   BYTES_IN_RESULTS 9,459,568,932
> >>   BYTES_IN_REMOTE_RESULTS 0
> >>   REGIONS_SCANNED 1
> >>   REMOTE_RPC_RETRIES 0
> >>
> >> FileSystemCounters
> >>   FILE_BYTES_READ 2,274,832,004
> >>   HDFS_BYTES_READ 161
> >>   FILE_BYTES_WRITTEN 3,770,108,961
> >>
> >> File Input Format Counters
> >>   Bytes Read 0
> >>
> >> Map-Reduce Framework
> >>   Map output materialized bytes 1,495,451,997
> >>   Combine output records 0
> >>   Map input records 22,659,551
> >>   Physical memory (bytes) snapshot 976,842,752
> >>   Spilled Records 57,085,847
> >>   Map output bytes 3,348,373,811
> >>   CPU time spent (ms) 1,134,640
> >>   Total committed heap usage (bytes) 945,291,264
> >>   Virtual memory (bytes) snapshot 1,699,991,552
> >>   Combine input records 0
> >>   Map output records 22,644,687
> >>   SPLIT_RAW_BYTES 161
> >>
> >> 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
> >> replication factor is 2
> >>
> >
> > Make it 3 to be safe?
> yes, but I have not enough disk. the total disk usage(including non
> hdfs) is about 60%
>

In general, you may need more machines in the mix if you want to meet your
requirement.  Or do you think the current hardware set is currently
underutilized?



> >
> >
> >
> >> 5. for block information,
> >> one column family file:
> >> Name Type Size Replication Block Size Modification Time Permission Owner
> >> Group
> >> b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
> >> 10:16 rw-r--r-- hadoop supergroup
> >> dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
> >> rw-r--r-- hadoop supergroup
> >> ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
> >> rw-r--r-- hadoop supergroup
> >>
> >> another example
> >> 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
> >> 20:32 rw-r--r-- hadoop supergroup
> >> 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
> >> 20:55 rw-r--r-- hadoop supergroup
> >> 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
> >> 13:19 rw-r--r-- hadoop supergroup
> >> c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
> >> rw-r--r-- hadoop supergroup
> >
> >
> >
> >> 6. I have only one column family for this table
> >>
> >> 7. each row has less than 10 columns
> >>
> >>
> > Ok.
> >
> > Does each cell have many versions or just one?
> just one version
> >
>

But are you overwriting the data so though you are fetching one version
only, there may be many present?



> >
> >
> >> 8. region info in web ui
> >> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> >> Storefile Size Index Size Bloom Size
> >> mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
> >> mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
> >> mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
> >> mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
> >> mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k
> >>
> >> 9. url_db has 84 regions
> >>
> >>
> > What version of HBase?  You've set scan caching to be a decent number?
> >  1000 or so (presuming cells are not massive)?
> 0.96.2-hadoop1, r1581096
> I have set cache to 10,000
>
>

Good.
St.Ack



> >
> > St.Ack
> >
> >
> >
> >> On Tue, Jul 22, 2014 at 1:32 PM, Li Li <[email protected]> wrote:
> >> > 1. yes, I have 20 concurrent running mappers.
> >> > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> >> > set 8 mappers, it hit oov exception and load average is high
> >> > 3. fast mapper only use 1 minute. following is the statistics
> >> > HBase Counters
> >> > REMOTE_RPC_CALLS 0
> >> > RPC_CALLS 523
> >> > RPC_RETRIES 0
> >> > NOT_SERVING_REGION_EXCEPTION 0
> >> > NUM_SCANNER_RESTARTS 0
> >> > MILLIS_BETWEEN_NEXTS 62,415
> >> > BYTES_IN_RESULTS 1,380,694,667
> >> > BYTES_IN_REMOTE_RESULTS 0
> >> > REGIONS_SCANNED 1
> >> > REMOTE_RPC_RETRIES 0
> >> >
> >> > FileSystemCounters
> >> > FILE_BYTES_READ 120,508,552
> >> > HDFS_BYTES_READ 176
> >> > FILE_BYTES_WRITTEN 241,000,600
> >> >
> >> > File Input Format Counters
> >> > Bytes Read 0
> >> >
> >> > Map-Reduce Framework
> >> > Map output materialized bytes 120,448,992
> >> > Combine output records 0
> >> > Map input records 5,208,607
> >> > Physical memory (bytes) snapshot 965,730,304
> >> > Spilled Records 10,417,214
> >> > Map output bytes 282,122,973
> >> > CPU time spent (ms) 82,610
> >> > Total committed heap usage (bytes) 1,061,158,912
> >> > Virtual memory (bytes) snapshot 1,681,047,552
> >> > Combine input records 0
> >> > Map output records 5,208,607
> >> > SPLIT_RAW_BYTES 176
> >> >
> >> >
> >> > On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote:
> >> >> How many regions now?
> >> >>
> >> >> You still have 20 concurrent mappers running?  Are your machines
> loaded
> >> w/
> >> >> 4 map tasks on each?  Can you up the number of concurrent mappers?
>  Can
> >> you
> >> >> get an idea of your scan rates?  Are all map tasks scanning at same
> >> rate?
> >> >>  Does one task lag the others?  Do you emit stats on each map task
> such
> >> as
> >> >> rows processed? Can you figure your bottleneck? Are you seeking disk
> all
> >> >> the time?  Anything else running while this big scan is going on?
>  How
> >> big
> >> >> are your cells?  Do you have one or more column families?  How many
> >> columns?
> >> >>
> >> >> For average region size, do du on the hdfs region directories and
> then
> >> sum
> >> >> and divide by region count.
> >> >>
> >> >> St.Ack
> >> >>
> >> >>
> >> >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote:
> >> >>
> >> >>> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >> >>> hours to finish a map reduce job.
> >> >>>
> >> >>> ---------- Forwarded message ----------
> >> >>> From: Li Li <[email protected]>
> >> >>> Date: Thu, Jun 26, 2014 at 3:34 PM
> >> >>> Subject: how to do parallel scanning in map reduce using hbase as
> >> input?
> >> >>> To: [email protected]
> >> >>>
> >> >>>
> >> >>> my table has about 700 million rows and about 80 regions. each task
> >> >>> tracker is configured with 4 mappers and 4 reducers at the same
> time.
> >> >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >> >>> mappers running. it takes more than an hour to finish mapper stage.
> >> >>> The hbase cluster's load is very low, about 2,000 request per
> second.
> >> >>> I think one mapper for a region is too small. How can I run more
> than
> >> >>> one mapper for a region so that it can take full advantage of
> >> >>> computing resources?
> >> >>>
> >>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to