Re: how to do parallel scanning in map reduce using hbase as input?

Li Li Mon, 21 Jul 2014 23:12:07 -0700
On Tue, Jul 22, 2014 at 1:57 PM, Stack <st...@duboce.net> wrote:
> On Mon, Jul 21, 2014 at 10:53 PM, Li Li <fancye...@gmail.com> wrote:
>
>> Sorry, I enter tab and it send my unfinished post. See the following
>> mail for answers of other questions.
>>
>> I forget the exception's detail. It throws exception in terminal.
>
>
> What exception is thrown?
I forget it. maybe I can retry it with 8 mapper configuration. it
seems like out of memory exception
>
>
>
>> The
>> default io.sort.mb is 100 and I set it to 500 to speed up reducer.
>
>
> Do you have to have a reducer?  If you could skip the shuffle...
I have 8 reducers
>
>
>
>> So
>> I set mapred.child.java.opts to 1g
>> The datanode/regionserver has 16GB memory but free memory
>
>
> Does the RS use the 16G?
the RS use 8G and there are datanode and tasktracker in this machine
>
>
>
>> for
>> map-reduce is about 5gb. So I can't add more mappers
>>
>>
>> How much RAM in these machines?
16GB
> St.Ack
>
>
>
>>
>> On Tue, Jul 22, 2014 at 1:37 PM, Stack <st...@duboce.net> wrote:
>> > On Mon, Jul 21, 2014 at 10:32 PM, Li Li <fancye...@gmail.com> wrote:
>> >
>> >> 1. yes, I have 20 concurrent running mappers.
>> >> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
>> >> set 8 mappers, it hit oov exception and load average is high
>> >>
>> >
>> >
>> > What is OOV?
>> >
>> > Do you have to have a reducer?
>> >
>> > Load average is high?  How high?
>> >
>> >
>> >
>> >> 3. fast mapper only use 1 minute. following is the statistics
>> >>
>> >
>> >
>> > So each region is only taking 1 minute to scan?  1.4Gs scanned?
>> >
>> > Can you add other counters to your MR job so we can get more of an idea
>> of
>> > what is going on in it?
>> >
>> > Please answer my other questions.
>> >
>> > Thanks,
>> > St.Ack
>> >
>> >
>> >> HBase Counters
>> >> REMOTE_RPC_CALLS 0
>> >> RPC_CALLS 523
>> >> RPC_RETRIES 0
>> >> NOT_SERVING_REGION_EXCEPTION 0
>> >> NUM_SCANNER_RESTARTS 0
>> >> MILLIS_BETWEEN_NEXTS 62,415
>> >> BYTES_IN_RESULTS 1,380,694,667
>> >> BYTES_IN_REMOTE_RESULTS 0
>> >> REGIONS_SCANNED 1
>> >> REMOTE_RPC_RETRIES 0
>> >>
>> >> FileSystemCounters
>> >> FILE_BYTES_READ 120,508,552
>> >> HDFS_BYTES_READ 176
>> >> FILE_BYTES_WRITTEN 241,000,600
>> >>
>> >> File Input Format Counters
>> >> Bytes Read 0
>> >>
>> >> Map-Reduce Framework
>> >> Map output materialized bytes 120,448,992
>> >> Combine output records 0
>> >> Map input records 5,208,607
>> >> Physical memory (bytes) snapshot 965,730,304
>> >> Spilled Records 10,417,214
>> >> Map output bytes 282,122,973
>> >> CPU time spent (ms) 82,610
>> >> Total committed heap usage (bytes) 1,061,158,912
>> >> Virtual memory (bytes) snapshot 1,681,047,552
>> >> Combine input records 0
>> >> Map output records 5,208,607
>> >> SPLIT_RAW_BYTES 176
>> >>
>> >>
>> >> On Tue, Jul 22, 2014 at 12:11 PM, Stack <st...@duboce.net> wrote:
>> >> > How many regions now?
>> >> >
>> >> > You still have 20 concurrent mappers running?  Are your machines
>> loaded
>> >> w/
>> >> > 4 map tasks on each?  Can you up the number of concurrent mappers?
>>  Can
>> >> you
>> >> > get an idea of your scan rates?  Are all map tasks scanning at same
>> rate?
>> >> >  Does one task lag the others?  Do you emit stats on each map task
>> such
>> >> as
>> >> > rows processed? Can you figure your bottleneck? Are you seeking disk
>> all
>> >> > the time?  Anything else running while this big scan is going on?  How
>> >> big
>> >> > are your cells?  Do you have one or more column families?  How many
>> >> columns?
>> >> >
>> >> > For average region size, do du on the hdfs region directories and then
>> >> sum
>> >> > and divide by region count.
>> >> >
>> >> > St.Ack
>> >> >
>> >> >
>> >> > On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fancye...@gmail.com> wrote:
>> >> >
>> >> >> anyone could help? now I have about 1.1 billion nodes and it takes 2
>> >> >> hours to finish a map reduce job.
>> >> >>
>> >> >> ---------- Forwarded message ----------
>> >> >> From: Li Li <fancye...@gmail.com>
>> >> >> Date: Thu, Jun 26, 2014 at 3:34 PM
>> >> >> Subject: how to do parallel scanning in map reduce using hbase as
>> input?
>> >> >> To: u...@hbase.apache.org
>> >> >>
>> >> >>
>> >> >> my table has about 700 million rows and about 80 regions. each task
>> >> >> tracker is configured with 4 mappers and 4 reducers at the same time.
>> >> >> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>> >> >> mappers running. it takes more than an hour to finish mapper stage.
>> >> >> The hbase cluster's load is very low, about 2,000 request per second.
>> >> >> I think one mapper for a region is too small. How can I run more than
>> >> >> one mapper for a region so that it can take full advantage of
>> >> >> computing resources?
>> >> >>
>> >>
>>
Re: how to do parallel scanning in map reduce using hbase as input?

Reply via email to