RE: How to best decide mapper output/reducer input for a huge string?

John Lilley Sat, 21 Sep 2013 16:09:23 -0700

Pavan,
How large are the rows in HBase?  22 million rows is not very much but you 
mentioned "huge strings".  Can you tell which part of the processing is the 
limiting factor (read from HBase, mapper output, reducers)?
John

From: Pavan Sudheendra [mailto:pavan0...@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map 
output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire 
table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any 
thing wrong with my design? Does it require any kind of re-work? Thank you so 
much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota 
<pradeep...@gmail.com<mailto:pradeep...@gmail.com>> wrote:
One thing that comes to mind is that your keys are Strings which are highly 
inefficient. You might get a lot better performance if you write a custom 
writable for your Key object using the appropriate data types. For example, use 
a long (LongWritable) for timestamps. This should make (de)serialization a lot 
faster. If HouseHoldId is an integer, your speed of comparisons for sorting 
will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase 
compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra 
<pavan0...@gmail.com<mailto:pavan0...@gmail.com>> wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of 
<K,V> is not of much use to me.. But i'm hoping to change that if it leads to 
faster execution.. I'm kind of a newbie so looking to make the map/reduce job 
run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems 
if i write a map output for each and every row of a 19 m row HBase table, its 
taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a 
MR job.. So, cannot use either of that.. Do you recommend me to try something 
if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota 
<pradeep...@gmail.com<mailto:pradeep...@gmail.com>> wrote:
I'm sorry but I don't understand your question. Is the output of the mapper 
you're describing the key portion? If it is the key, then your data should 
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if 
you have a need for a non lexical sort order. The GroupingComparator will tell 
Hadoop how to group your data for the reducer. All KV-pairs from the same group 
will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, 
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as 
Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier 
to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra 
<pavan0...@gmail.com<mailto:pavan0...@gmail.com>> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out 
as one huge string for the reducer to do some computation and dump into a HBase 
Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID 
timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using 
this technique. I'm not interested in the V part of pair so i'm kind of 
ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not 
desirable at all. I'm supposed to optimize this somehow to run a lot faster 
somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

                                       Table1,           // input HBase table 
name

                                       scan,

                                       AnalyzeMapper.class,    // mapper

                                       Text.class,             // mapper output 
key

                                       IntWritable.class,      // mapper output 
value

                                       job);

                TableMapReduceUtil.initTableReducerJob(

                                        OutputTable,                // output 
table

                                        AnalyzeReducerTable.class,  // reducer 
class

                                        job);

                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a 8 
node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?

--
Regards-
Pavan

--
Regards-
Pavan

--
Regards-
Pavan

RE: How to best decide mapper output/reducer input for a huge string?

Reply via email to