On Thu, May 10, 2012 at 7:46 PM, Michael Segel
<michael_se...@hotmail.com> wrote:
> "Writing, it may make sense to avoid the reduce step and write yourself back 
> into HBase from inside your map. You'd do this when your job does not need 
> the sort and collation that mapreduce does on the map emitted data; on 
> insert, HBase 'sorts' so there is no point double-sorting (and shuffling data 
> around your mapreduce cluster) unless you need to. If you do not need the 
> reduce, you might just have your map emit counts of records processed just so 
> the framework's report at the end of your job has meaning or set the number 
> of reduces to zero and use TableOutputFormat. See example code below. If 
> running the reduce step makes sense in your case, its usually better to have 
> lots of reducers so load is spread across the HBase cluster."
>
> This isn't 100% true.
>
> I'd lose the quotes around 'sorts' because the data is sorted on key values. 
> period.
>

Sounds good.


> I'd ask that you reconsider the following phrase...
> "You'd do this when your job does not need the sort and collation that 
> mapreduce does on the map emitted data;"
>

What would you suggest instead.

> I realize I went to this little midwestern school (tOSU), where ENG meant you 
> were in the college of engineering and not an English Major, so I'm not sure 
> if I am parsing that statement correctly.
>

ditto

The above phrase is mine.  I'm bad at writing so need help.


> If you refactor your M/R , HBase can be used for the 'collation' .  (If you 
> make your Mapper a null writable and manually write the output to HBase 
> within Mapper.map(), you can write to N tables without a problem. So you can 
> write the record out, update a table where you are keeping counters, stats, 
> etc ... )  So I am still at a loss to find an example of where you would need 
> a reducer.
>

Can you make a patch.

I'm for making a stronger statement about reduce, that its rare if
ever its needed.  Lets get it in the doc.


> So one has to ask what would cause a write to be blocked
> GC ? Eran says he's already tuned it.
> MSLABS? Eran says that's covered.
>
> Table splits?
> Eran says that the table's region sizes are 256MB (default) and the other 
> table is 512MB.
> If the table is constantly splitting, then you need to increase the region 
> size. Again we don't have enough information to diagnose if this is the issue.
>
> We don't know things about his cluster like the number of nodes, how much 
> memory on each node, as well as which version of HBase.
>
> I realize that these are all pretty basic issues, but sometimes its the 
> little things that will trip you up.
>

Above is generally good advice.

Thanks Michael.

St.Ack

Reply via email to