On Mon, Jul 21, 2014 at 10:53 PM, Li Li <[email protected]> wrote: > Sorry, I enter tab and it send my unfinished post. See the following > mail for answers of other questions. > > I forget the exception's detail. It throws exception in terminal.
What exception is thrown? > The > default io.sort.mb is 100 and I set it to 500 to speed up reducer. Do you have to have a reducer? If you could skip the shuffle... > So > I set mapred.child.java.opts to 1g > The datanode/regionserver has 16GB memory but free memory Does the RS use the 16G? > for > map-reduce is about 5gb. So I can't add more mappers > > > How much RAM in these machines? St.Ack > > On Tue, Jul 22, 2014 at 1:37 PM, Stack <[email protected]> wrote: > > On Mon, Jul 21, 2014 at 10:32 PM, Li Li <[email protected]> wrote: > > > >> 1. yes, I have 20 concurrent running mappers. > >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I > >> set 8 mappers, it hit oov exception and load average is high > >> > > > > > > What is OOV? > > > > Do you have to have a reducer? > > > > Load average is high? How high? > > > > > > > >> 3. fast mapper only use 1 minute. following is the statistics > >> > > > > > > So each region is only taking 1 minute to scan? 1.4Gs scanned? > > > > Can you add other counters to your MR job so we can get more of an idea > of > > what is going on in it? > > > > Please answer my other questions. > > > > Thanks, > > St.Ack > > > > > >> HBase Counters > >> REMOTE_RPC_CALLS 0 > >> RPC_CALLS 523 > >> RPC_RETRIES 0 > >> NOT_SERVING_REGION_EXCEPTION 0 > >> NUM_SCANNER_RESTARTS 0 > >> MILLIS_BETWEEN_NEXTS 62,415 > >> BYTES_IN_RESULTS 1,380,694,667 > >> BYTES_IN_REMOTE_RESULTS 0 > >> REGIONS_SCANNED 1 > >> REMOTE_RPC_RETRIES 0 > >> > >> FileSystemCounters > >> FILE_BYTES_READ 120,508,552 > >> HDFS_BYTES_READ 176 > >> FILE_BYTES_WRITTEN 241,000,600 > >> > >> File Input Format Counters > >> Bytes Read 0 > >> > >> Map-Reduce Framework > >> Map output materialized bytes 120,448,992 > >> Combine output records 0 > >> Map input records 5,208,607 > >> Physical memory (bytes) snapshot 965,730,304 > >> Spilled Records 10,417,214 > >> Map output bytes 282,122,973 > >> CPU time spent (ms) 82,610 > >> Total committed heap usage (bytes) 1,061,158,912 > >> Virtual memory (bytes) snapshot 1,681,047,552 > >> Combine input records 0 > >> Map output records 5,208,607 > >> SPLIT_RAW_BYTES 176 > >> > >> > >> On Tue, Jul 22, 2014 at 12:11 PM, Stack <[email protected]> wrote: > >> > How many regions now? > >> > > >> > You still have 20 concurrent mappers running? Are your machines > loaded > >> w/ > >> > 4 map tasks on each? Can you up the number of concurrent mappers? > Can > >> you > >> > get an idea of your scan rates? Are all map tasks scanning at same > rate? > >> > Does one task lag the others? Do you emit stats on each map task > such > >> as > >> > rows processed? Can you figure your bottleneck? Are you seeking disk > all > >> > the time? Anything else running while this big scan is going on? How > >> big > >> > are your cells? Do you have one or more column families? How many > >> columns? > >> > > >> > For average region size, do du on the hdfs region directories and then > >> sum > >> > and divide by region count. > >> > > >> > St.Ack > >> > > >> > > >> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <[email protected]> wrote: > >> > > >> >> anyone could help? now I have about 1.1 billion nodes and it takes 2 > >> >> hours to finish a map reduce job. > >> >> > >> >> ---------- Forwarded message ---------- > >> >> From: Li Li <[email protected]> > >> >> Date: Thu, Jun 26, 2014 at 3:34 PM > >> >> Subject: how to do parallel scanning in map reduce using hbase as > input? > >> >> To: [email protected] > >> >> > >> >> > >> >> my table has about 700 million rows and about 80 regions. each task > >> >> tracker is configured with 4 mappers and 4 reducers at the same time. > >> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20 > >> >> mappers running. it takes more than an hour to finish mapper stage. > >> >> The hbase cluster's load is very low, about 2,000 request per second. > >> >> I think one mapper for a region is too small. How can I run more than > >> >> one mapper for a region so that it can take full advantage of > >> >> computing resources? > >> >> > >> >
