On Mon, Jul 21, 2014 at 10:47 PM, Li Li <[email protected]> wrote: > sorry. I have not finished it. > 1. yes, I have 20 concurrent running mappers. > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I > set 8 mappers, it hit oov exception and load average is high > 3. fast mapper only use 1 minute. following is the statistics > HBase Counters > REMOTE_RPC_CALLS 0 > RPC_CALLS 523 > RPC_RETRIES 0 > NOT_SERVING_REGION_EXCEPTION 0 > NUM_SCANNER_RESTARTS 0 > MILLIS_BETWEEN_NEXTS 62,415 > BYTES_IN_RESULTS 1,380,694,667 > BYTES_IN_REMOTE_RESULTS 0 > REGIONS_SCANNED 1 > REMOTE_RPC_RETRIES 0 > > FileSystemCounters > FILE_BYTES_READ 120,508,552 > HDFS_BYTES_READ 176 > FILE_BYTES_WRITTEN 241,000,600 > > File Input Format Counters > Bytes Read 0 > > Map-Reduce Framework > Map output materialized bytes 120,448,992 > Combine output records 0 > Map input records 5,208,607 > Physical memory (bytes) snapshot 965,730,304 > Spilled Records 10,417,214 > Map output bytes 282,122,973 > CPU time spent (ms) 82,610 > Total committed heap usage (bytes) 1,061,158,912 > Virtual memory (bytes) snapshot 1,681,047,552 > Combine input records 0 > Map output records 5,208,607 > SPLIT_RAW_BYTES 176 > > slow mapper cost 25 minutes >
So some mappers take 1 minute and others take 25 minutes? Do the map tasks balance each other out as they run or are you waiting on one to complete, a really big one? > HBase Counters > REMOTE_RPC_CALLS 0 > RPC_CALLS 2,268 > RPC_RETRIES 0 > NOT_SERVING_REGION_EXCEPTION 0 > NUM_SCANNER_RESTARTS 0 > MILLIS_BETWEEN_NEXTS 907,402 > BYTES_IN_RESULTS 9,459,568,932 > BYTES_IN_REMOTE_RESULTS 0 > REGIONS_SCANNED 1 > REMOTE_RPC_RETRIES 0 > > FileSystemCounters > FILE_BYTES_READ 2,274,832,004 > HDFS_BYTES_READ 161 > FILE_BYTES_WRITTEN 3,770,108,961 > > File Input Format Counters > Bytes Read 0 > > Map-Reduce Framework > Map output materialized bytes 1,495,451,997 > Combine output records 0 > Map input records 22,659,551 > Physical memory (bytes) snapshot 976,842,752 > Spilled Records 57,085,847 > Map output bytes 3,348,373,811 > CPU time spent (ms) 1,134,640 > Total committed heap usage (bytes) 945,291,264 > Virtual memory (bytes) snapshot 1,699,991,552 > Combine input records 0 > Map output records 22,644,687 > SPLIT_RAW_BYTES 161 > > 4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the > replication factor is 2 > Make it 3 to be safe? > 5. for block information, > one column family file: > Name Type Size Replication Block Size Modification Time Permission Owner > Group > b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22 > 10:16 rw-r--r-- hadoop supergroup > dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24 > rw-r--r-- hadoop supergroup > ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18 > rw-r--r-- hadoop supergroup > > another example > 1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18 > 20:32 rw-r--r-- hadoop supergroup > 532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21 > 20:55 rw-r--r-- hadoop supergroup > 55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22 > 13:19 rw-r--r-- hadoop supergroup > c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55 > rw-r--r-- hadoop supergroup > 6. I have only one column family for this table > > 7. each row has less than 10 columns > > Ok. Does each cell have many versions or just one? > 8. region info in web ui > ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed > Storefile Size Index Size Bloom Size > mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k > mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k > mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k > mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k > mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k > > 9. url_db has 84 regions > > What version of HBase? You've set scan caching to be a decent number? 1000 or so (presuming cells are not massive)? St.Ack > On Tue, Jul 22, 2014 at 1:32 PM, Li Li <[email protected]> wrote: > > 1. yes, I have 20 concurrent running mappers. > > 2. I can't add more mappers because I set io.sort.mb to 500mb and if I > > set 8 mappers, it hit oov exception and load average is high > > 3. fast mapper only use 1 minute. following is the statistics > > HBase Counters > > REMOTE_RPC_CALLS 0 > > RPC_CALLS 523 > > RPC_RETRIES 0 > > NOT_SERVING_REGION_EXCEPTION 0 > > NUM_SCANNER_RESTARTS 0 > > MILLIS_BETWEEN_NEXTS 62,415 > > BYTES_IN_RESULTS 1,380,694,667 > > BYTES_IN_REMOTE_RESULTS 0 > > REGIONS_SCANNED 1 > > REMOTE_RPC_RETRIES 0 > > > > FileSystemCounters > > FILE_BYTES_READ 120,508,552 > > HDFS_BYTES_READ 176 > > FILE_BYTES_WRITTEN 241,000,600 > > > > File Input Format Counters > > Bytes Read 0 > > > > Map-Reduce Framework > > Map output materialized bytes 120,448,992 > > Combine output records 0 > > Map input records 5,208,607 > > Physical memory (bytes) snapshot 965,730,304 > > Spilled Records 10,417,214 > > Map output bytes 282,122,973 > > CPU time spent (ms) 82,610 > > Total committed heap usage (bytes) 1,061,158,912 > > Virtual memory (bytes) snapshot 1,681,047,552 > > Combine input records 0 > > Map output records 5,208,607 > > SPLIT_RAW_BYTES 176 > > > > > > On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote: > >> How many regions now? > >> > >> You still have 20 concurrent mappers running? Are your machines loaded > w/ > >> 4 map tasks on each? Can you up the number of concurrent mappers? Can > you > >> get an idea of your scan rates? Are all map tasks scanning at same > rate? > >> Does one task lag the others? Do you emit stats on each map task such > as > >> rows processed? Can you figure your bottleneck? Are you seeking disk all > >> the time? Anything else running while this big scan is going on? How > big > >> are your cells? Do you have one or more column families? How many > columns? > >> > >> For average region size, do du on the hdfs region directories and then > sum > >> and divide by region count. > >> > >> St.Ack > >> > >> > >> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote: > >> > >>> anyone could help? now I have about 1.1 billion nodes and it takes 2 > >>> hours to finish a map reduce job. > >>> > >>> ---------- Forwarded message ---------- > >>> From: Li Li <[email protected]> > >>> Date: Thu, Jun 26, 2014 at 3:34 PM > >>> Subject: how to do parallel scanning in map reduce using hbase as > input? > >>> To: [email protected] > >>> > >>> > >>> my table has about 700 million rows and about 80 regions. each task > >>> tracker is configured with 4 mappers and 4 reducers at the same time. > >>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20 > >>> mappers running. it takes more than an hour to finish mapper stage. > >>> The hbase cluster's load is very low, about 2,000 request per second. > >>> I think one mapper for a region is too small. How can I run more than > >>> one mapper for a region so that it can take full advantage of > >>> computing resources? > >>> >
