Re: how to do parallel scanning in map reduce using hbase as input?

Li Li Mon, 21 Jul 2014 22:54:06 -0700

Sorry, I enter tab and it send my unfinished post. See the following
mail for answers of other questions.


I forget the exception's detail. It throws exception in terminal. The
default io.sort.mb is 100 and I set it to 500 to speed up reducer. So
I set mapred.child.java.opts to 1g
The datanode/regionserver has 16GB memory but free memory for
map-reduce is about 5gb. So I can't add more mappers



On Tue, Jul 22, 2014 at 1:37 PM, Stack <st...@duboce.net> wrote:
> On Mon, Jul 21, 2014 at 10:32 PM, Li Li <fancye...@gmail.com> wrote:
>
>> 1. yes, I have 20 concurrent running mappers.
>> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> set 8 mappers, it hit oov exception and load average is high
>>
>
>
> What is OOV?
>
> Do you have to have a reducer?
>
> Load average is high?  How high?
>
>
>
>> 3. fast mapper only use 1 minute. following is the statistics
>>
>
>
> So each region is only taking 1 minute to scan?  1.4Gs scanned?
>
> Can you add other counters to your MR job so we can get more of an idea of
> what is going on in it?
>
> Please answer my other questions.
>
> Thanks,
> St.Ack
>
>
>> HBase Counters
>> REMOTE_RPC_CALLS 0
>> RPC_CALLS 523
>> RPC_RETRIES 0
>> NOT_SERVING_REGION_EXCEPTION 0
>> NUM_SCANNER_RESTARTS 0
>> MILLIS_BETWEEN_NEXTS 62,415
>> BYTES_IN_RESULTS 1,380,694,667
>> BYTES_IN_REMOTE_RESULTS 0
>> REGIONS_SCANNED 1
>> REMOTE_RPC_RETRIES 0
>>
>> FileSystemCounters
>> FILE_BYTES_READ 120,508,552
>> HDFS_BYTES_READ 176
>> FILE_BYTES_WRITTEN 241,000,600
>>
>> File Input Format Counters
>> Bytes Read 0
>>
>> Map-Reduce Framework
>> Map output materialized bytes 120,448,992
>> Combine output records 0
>> Map input records 5,208,607
>> Physical memory (bytes) snapshot 965,730,304
>> Spilled Records 10,417,214
>> Map output bytes 282,122,973
>> CPU time spent (ms) 82,610
>> Total committed heap usage (bytes) 1,061,158,912
>> Virtual memory (bytes) snapshot 1,681,047,552
>> Combine input records 0
>> Map output records 5,208,607
>> SPLIT_RAW_BYTES 176
>>
>>
>> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
>> > How many regions now?
>> >
>> > You still have 20 concurrent mappers running?  Are your machines loaded
>> w/
>> > 4 map tasks on each?  Can you up the number of concurrent mappers?  Can
>> you
>> > get an idea of your scan rates?  Are all map tasks scanning at same rate?
>> >  Does one task lag the others?  Do you emit stats on each map task such
>> as
>> > rows processed? Can you figure your bottleneck? Are you seeking disk all
>> > the time?  Anything else running while this big scan is going on?  How
>> big
>> > are your cells?  Do you have one or more column families?  How many
>> columns?
>> >
>> > For average region size, do du on the hdfs region directories and then
>> sum
>> > and divide by region count.
>> >
>> > St.Ack
>> >
>> >
>> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fancye...@gmail.com> wrote:
>> >
>> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> >> hours to finish a map reduce job.
>> >>
>> >> ---------- Forwarded message ----------
>> >> From: Li Li <fancye...@gmail.com>
>> >> Date: Thu, Jun 26, 2014 at 3:34 PM
>> >> Subject: how to do parallel scanning in map reduce using hbase as input?
>> >> To: u...@hbase.apache.org
>> >>
>> >>
>> >> my table has about 700 million rows and about 80 regions. each task
>> >> tracker is configured with 4 mappers and 4 reducers at the same time.
>> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> >> mappers running. it takes more than an hour to finish mapper stage.
>> >> The hbase cluster's load is very low, about 2,000 request per second.
>> >> I think one mapper for a region is too small. How can I run more than
>> >> one mapper for a region so that it can take full advantage of
>> >> computing resources?
>> >>
>>

Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to