Re: how to do parallel scanning in map reduce using hbase as input?

Li Li Mon, 21 Jul 2014 23:09:07 -0700

On Tue, Jul 22, 2014 at 1:54 PM, Stack <[email protected]> wrote:
> On Mon, Jul 21, 2014 at 10:47 PM, Li Li <[email protected]> wrote:
>
>> sorry. I have not finished it.
>> 1. yes, I have 20 concurrent running mappers.
>> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> set 8 mappers, it hit oov exception and load average is high
>> 3. fast mapper only use 1 minute. following is the statistics
>> HBase Counters
>>   REMOTE_RPC_CALLS 0
>>   RPC_CALLS 523
>>   RPC_RETRIES 0
>>   NOT_SERVING_REGION_EXCEPTION 0
>>   NUM_SCANNER_RESTARTS 0
>>   MILLIS_BETWEEN_NEXTS 62,415
>>   BYTES_IN_RESULTS 1,380,694,667
>>   BYTES_IN_REMOTE_RESULTS 0
>>   REGIONS_SCANNED 1
>>   REMOTE_RPC_RETRIES 0
>>
>> FileSystemCounters
>>   FILE_BYTES_READ 120,508,552
>>   HDFS_BYTES_READ 176
>>   FILE_BYTES_WRITTEN 241,000,600
>>
>> File Input Format Counters
>>   Bytes Read 0
>>
>> Map-Reduce Framework
>>   Map output materialized bytes 120,448,992
>>   Combine output records 0
>>   Map input records 5,208,607
>>   Physical memory (bytes) snapshot 965,730,304
>>   Spilled Records 10,417,214
>>   Map output bytes 282,122,973
>>   CPU time spent (ms) 82,610
>>   Total committed heap usage (bytes) 1,061,158,912
>>   Virtual memory (bytes) snapshot 1,681,047,552
>>   Combine input records 0
>>   Map output records 5,208,607
>>   SPLIT_RAW_BYTES 176
>>
>>  slow mapper cost 25 minutes
>>
>
>
> So some mappers take 1 minute and others take 25 minutes?
yes
>
> Do the map tasks balance each other out as they run or are you waiting on
> one to complete, a really big one?
>
sorry, the fatest mapper takes 6 minutes and the slowest mapper takes 45 minutes
for the fastest, map input records is 4,469,570
for the slowest one, input records is 22,335,536
>
>
>> HBase Counters
>>   REMOTE_RPC_CALLS 0
>>   RPC_CALLS 2,268
>>   RPC_RETRIES 0
>>   NOT_SERVING_REGION_EXCEPTION 0
>>   NUM_SCANNER_RESTARTS 0
>>   MILLIS_BETWEEN_NEXTS 907,402
>>   BYTES_IN_RESULTS 9,459,568,932
>>   BYTES_IN_REMOTE_RESULTS 0
>>   REGIONS_SCANNED 1
>>   REMOTE_RPC_RETRIES 0
>>
>> FileSystemCounters
>>   FILE_BYTES_READ 2,274,832,004
>>   HDFS_BYTES_READ 161
>>   FILE_BYTES_WRITTEN 3,770,108,961
>>
>> File Input Format Counters
>>   Bytes Read 0
>>
>> Map-Reduce Framework
>>   Map output materialized bytes 1,495,451,997
>>   Combine output records 0
>>   Map input records 22,659,551
>>   Physical memory (bytes) snapshot 976,842,752
>>   Spilled Records 57,085,847
>>   Map output bytes 3,348,373,811
>>   CPU time spent (ms) 1,134,640
>>   Total committed heap usage (bytes) 945,291,264
>>   Virtual memory (bytes) snapshot 1,699,991,552
>>   Combine input records 0
>>   Map output records 22,644,687
>>   SPLIT_RAW_BYTES 161
>>
>> 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
>> replication factor is 2
>>
>
> Make it 3 to be safe?
yes, but I have not enough disk. the total disk usage(including non
hdfs) is about 60%
>
>
>
>> 5. for block information,
>> one column family file:
>> Name Type Size Replication Block Size Modification Time Permission Owner
>> Group
>> b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
>> 10:16 rw-r--r-- hadoop supergroup
>> dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
>> rw-r--r-- hadoop supergroup
>> ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
>> rw-r--r-- hadoop supergroup
>>
>> another example
>> 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
>> 20:32 rw-r--r-- hadoop supergroup
>> 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
>> 20:55 rw-r--r-- hadoop supergroup
>> 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
>> 13:19 rw-r--r-- hadoop supergroup
>> c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
>> rw-r--r-- hadoop supergroup
>
>
>
>> 6. I have only one column family for this table
>>
>> 7. each row has less than 10 columns
>>
>>
> Ok.
>
> Does each cell have many versions or just one?
just one version
>
>
>
>> 8. region info in web ui
>> ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
>> Storefile Size Index Size Bloom Size
>> mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
>> mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
>> mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
>> mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
>> mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k
>>
>> 9. url_db has 84 regions
>>
>>
> What version of HBase?  You've set scan caching to be a decent number?
>  1000 or so (presuming cells are not massive)?
0.96.2-hadoop1, r1581096
I have set cache to 10,000


>
> St.Ack
>
>
>
>> On Tue, Jul 22, 2014 at 1:32 PM, Li Li <[email protected]> wrote:
>> > 1. yes, I have 20 concurrent running mappers.
>> > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> > set 8 mappers, it hit oov exception and load average is high
>> > 3. fast mapper only use 1 minute. following is the statistics
>> > HBase Counters
>> > REMOTE_RPC_CALLS 0
>> > RPC_CALLS 523
>> > RPC_RETRIES 0
>> > NOT_SERVING_REGION_EXCEPTION 0
>> > NUM_SCANNER_RESTARTS 0
>> > MILLIS_BETWEEN_NEXTS 62,415
>> > BYTES_IN_RESULTS 1,380,694,667
>> > BYTES_IN_REMOTE_RESULTS 0
>> > REGIONS_SCANNED 1
>> > REMOTE_RPC_RETRIES 0
>> >
>> > FileSystemCounters
>> > FILE_BYTES_READ 120,508,552
>> > HDFS_BYTES_READ 176
>> > FILE_BYTES_WRITTEN 241,000,600
>> >
>> > File Input Format Counters
>> > Bytes Read 0
>> >
>> > Map-Reduce Framework
>> > Map output materialized bytes 120,448,992
>> > Combine output records 0
>> > Map input records 5,208,607
>> > Physical memory (bytes) snapshot 965,730,304
>> > Spilled Records 10,417,214
>> > Map output bytes 282,122,973
>> > CPU time spent (ms) 82,610
>> > Total committed heap usage (bytes) 1,061,158,912
>> > Virtual memory (bytes) snapshot 1,681,047,552
>> > Combine input records 0
>> > Map output records 5,208,607
>> > SPLIT_RAW_BYTES 176
>> >
>> >
>> > On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote:
>> >> How many regions now?
>> >>
>> >> You still have 20 concurrent mappers running?  Are your machines loaded
>> w/
>> >> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
>> you
>> >> get an idea of your scan rates?  Are all map tasks scanning at same
>> rate?
>> >>  Does one task lag the others?  Do you emit stats on each map task such
>> as
>> >> rows processed? Can you figure your bottleneck? Are you seeking disk all
>> >> the time?  Anything else running while this big scan is going on?  How
>> big
>> >> are your cells?  Do you have one or more column families?  How many
>> columns?
>> >>
>> >> For average region size, do du on the hdfs region directories and then
>> sum
>> >> and divide by region count.
>> >>
>> >> St.Ack
>> >>
>> >>
>> >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote:
>> >>
>> >>> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> >>> hours to finish a map reduce job.
>> >>>
>> >>> ---------- Forwarded message ----------
>> >>> From: Li Li <[email protected]>
>> >>> Date: Thu, Jun 26, 2014 at 3:34 PM
>> >>> Subject: how to do parallel scanning in map reduce using hbase as
>> input?
>> >>> To: [email protected]
>> >>>
>> >>>
>> >>> my table has about 700 million rows and about 80 regions. each task
>> >>> tracker is configured with 4 mappers and 4 reducers at the same time.
>> >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> >>> mappers running. it takes more than an hour to finish mapper stage.
>> >>> The hbase cluster's load is very low, about 2,000 request per second.
>> >>> I think one mapper for a region is too small. How can I run more than
>> >>> one mapper for a region so that it can take full advantage of
>> >>> computing resources?
>> >>>
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to