Re: how to do parallel scanning in map reduce using hbase as input?

Stack Mon, 21 Jul 2014 22:55:24 -0700

On Mon, Jul 21, 2014 at 10:47 PM, Li Li <[email protected]> wrote:

> sorry. I have not finished it.
> 1. yes, I have 20 concurrent running mappers.
> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> set 8 mappers, it hit oov exception and load average is high
> 3. fast mapper only use 1 minute. following is the statistics
> HBase Counters
>   REMOTE_RPC_CALLS 0
>   RPC_CALLS 523
>   RPC_RETRIES 0
>   NOT_SERVING_REGION_EXCEPTION 0
>   NUM_SCANNER_RESTARTS 0
>   MILLIS_BETWEEN_NEXTS 62,415
>   BYTES_IN_RESULTS 1,380,694,667
>   BYTES_IN_REMOTE_RESULTS 0
>   REGIONS_SCANNED 1
>   REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
>   FILE_BYTES_READ 120,508,552
>   HDFS_BYTES_READ 176
>   FILE_BYTES_WRITTEN 241,000,600
>
> File Input Format Counters
>   Bytes Read 0
>
> Map-Reduce Framework
>   Map output materialized bytes 120,448,992
>   Combine output records 0
>   Map input records 5,208,607
>   Physical memory (bytes) snapshot 965,730,304
>   Spilled Records 10,417,214
>   Map output bytes 282,122,973
>   CPU time spent (ms) 82,610
>   Total committed heap usage (bytes) 1,061,158,912
>   Virtual memory (bytes) snapshot 1,681,047,552
>   Combine input records 0
>   Map output records 5,208,607
>   SPLIT_RAW_BYTES 176
>
>  slow mapper cost 25 minutes
>



So some mappers take 1 minute and others take 25 minutes?

Do the map tasks balance each other out as they run or are you waiting on
one to complete, a really big one?



> HBase Counters
>   REMOTE_RPC_CALLS 0
>   RPC_CALLS 2,268
>   RPC_RETRIES 0
>   NOT_SERVING_REGION_EXCEPTION 0
>   NUM_SCANNER_RESTARTS 0
>   MILLIS_BETWEEN_NEXTS 907,402
>   BYTES_IN_RESULTS 9,459,568,932
>   BYTES_IN_REMOTE_RESULTS 0
>   REGIONS_SCANNED 1
>   REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
>   FILE_BYTES_READ 2,274,832,004
>   HDFS_BYTES_READ 161
>   FILE_BYTES_WRITTEN 3,770,108,961
>
> File Input Format Counters
>   Bytes Read 0
>
> Map-Reduce Framework
>   Map output materialized bytes 1,495,451,997
>   Combine output records 0
>   Map input records 22,659,551
>   Physical memory (bytes) snapshot 976,842,752
>   Spilled Records 57,085,847
>   Map output bytes 3,348,373,811
>   CPU time spent (ms) 1,134,640
>   Total committed heap usage (bytes) 945,291,264
>   Virtual memory (bytes) snapshot 1,699,991,552
>   Combine input records 0
>   Map output records 22,644,687
>   SPLIT_RAW_BYTES 161
>
> 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
> replication factor is 2
>

Make it 3 to be safe?



> 5. for block information,
> one column family file:
> Name Type Size Replication Block Size Modification Time Permission Owner
> Group
> b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
> 10:16 rw-r--r-- hadoop supergroup
> dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
> rw-r--r-- hadoop supergroup
> ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
> rw-r--r-- hadoop supergroup
>
> another example
> 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
> 20:32 rw-r--r-- hadoop supergroup
> 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
> 20:55 rw-r--r-- hadoop supergroup
> 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
> 13:19 rw-r--r-- hadoop supergroup
> c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
> rw-r--r-- hadoop supergroup



> 6. I have only one column family for this table
>
> 7. each row has less than 10 columns
>
>
Ok.

Does each cell have many versions or just one?



> 8. region info in web ui
> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
> Storefile Size Index Size Bloom Size
> mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
> mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
> mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
> mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
> mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k
>
> 9. url_db has 84 regions
>
>
What version of HBase?  You've set scan caching to be a decent number?
 1000 or so (presuming cells are not massive)?

St.Ack



> On Tue, Jul 22, 2014 at 1:32 PM, Li Li <[email protected]> wrote:
> > 1. yes, I have 20 concurrent running mappers.
> > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> > set 8 mappers, it hit oov exception and load average is high
> > 3. fast mapper only use 1 minute. following is the statistics
> > HBase Counters
> > REMOTE_RPC_CALLS 0
> > RPC_CALLS 523
> > RPC_RETRIES 0
> > NOT_SERVING_REGION_EXCEPTION 0
> > NUM_SCANNER_RESTARTS 0
> > MILLIS_BETWEEN_NEXTS 62,415
> > BYTES_IN_RESULTS 1,380,694,667
> > BYTES_IN_REMOTE_RESULTS 0
> > REGIONS_SCANNED 1
> > REMOTE_RPC_RETRIES 0
> >
> > FileSystemCounters
> > FILE_BYTES_READ 120,508,552
> > HDFS_BYTES_READ 176
> > FILE_BYTES_WRITTEN 241,000,600
> >
> > File Input Format Counters
> > Bytes Read 0
> >
> > Map-Reduce Framework
> > Map output materialized bytes 120,448,992
> > Combine output records 0
> > Map input records 5,208,607
> > Physical memory (bytes) snapshot 965,730,304
> > Spilled Records 10,417,214
> > Map output bytes 282,122,973
> > CPU time spent (ms) 82,610
> > Total committed heap usage (bytes) 1,061,158,912
> > Virtual memory (bytes) snapshot 1,681,047,552
> > Combine input records 0
> > Map output records 5,208,607
> > SPLIT_RAW_BYTES 176
> >
> >
> > On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote:
> >> How many regions now?
> >>
> >> You still have 20 concurrent mappers running?  Are your machines loaded
> w/
> >> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
> you
> >> get an idea of your scan rates?  Are all map tasks scanning at same
> rate?
> >>  Does one task lag the others?  Do you emit stats on each map task such
> as
> >> rows processed? Can you figure your bottleneck? Are you seeking disk all
> >> the time?  Anything else running while this big scan is going on?  How
> big
> >> are your cells?  Do you have one or more column families?  How many
> columns?
> >>
> >> For average region size, do du on the hdfs region directories and then
> sum
> >> and divide by region count.
> >>
> >> St.Ack
> >>
> >>
> >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote:
> >>
> >>> anyone could help? now I have about 1.1 billion nodes and it takes 2
> >>> hours to finish a map reduce job.
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: Li Li <[email protected]>
> >>> Date: Thu, Jun 26, 2014 at 3:34 PM
> >>> Subject: how to do parallel scanning in map reduce using hbase as
> input?
> >>> To: [email protected]
> >>>
> >>>
> >>> my table has about 700 million rows and about 80 regions. each task
> >>> tracker is configured with 4 mappers and 4 reducers at the same time.
> >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
> >>> mappers running. it takes more than an hour to finish mapper stage.
> >>> The hbase cluster's load is very low, about 2,000 request per second.
> >>> I think one mapper for a region is too small. How can I run more than
> >>> one mapper for a region so that it can take full advantage of
> >>> computing resources?
> >>>
>

Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to