Stack, 

Since you brought it up...
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink.

"Writing, it may make sense to avoid the reduce step and write yourself back 
into HBase from inside your map. You'd do this when your job does not need the 
sort and collation that mapreduce does on the map emitted data; on insert, 
HBase 'sorts' so there is no point double-sorting (and shuffling data around 
your mapreduce cluster) unless you need to. If you do not need the reduce, you 
might just have your map emit counts of records processed just so the 
framework's report at the end of your job has meaning or set the number of 
reduces to zero and use TableOutputFormat. See example code below. If running 
the reduce step makes sense in your case, its usually better to have lots of 
reducers so load is spread across the HBase cluster."

This isn't 100% true. 

I'd lose the quotes around 'sorts' because the data is sorted on key values. 
period.

I'd ask that you reconsider the following phrase...
"You'd do this when your job does not need the sort and collation that 
mapreduce does on the map emitted data;" 

I realize I went to this little midwestern school (tOSU), where ENG meant you 
were in the college of engineering and not an English Major, so I'm not sure if 
I am parsing that statement correctly. 

If you refactor your M/R , HBase can be used for the 'collation' .  (If you 
make your Mapper a null writable and manually write the output to HBase within 
Mapper.map(), you can write to N tables without a problem. So you can write the 
record out, update a table where you are keeping counters, stats, etc ... )  So 
I am still at a loss to find an example of where you would need a reducer. 

Don't get me wrong. I do believe that there are cases where you may need a 
reducer, just as I believe that there is intelligent life on other planets. I 
just haven't found it yet. Of course YMMV.

Which is why I ask you to think really long and hard on this issue. 


With respect to Eran's problem... 

He's writing sorted output to Hbase. 
He stated that this problem happens with heavy writes. 
And that its worse when he has more reducers. (Something recommended in the 
paragraph...)

So one has to ask what would cause a write to be blocked 
GC ? Eran says he's already tuned it.
MSLABS? Eran says that's covered. 

Table splits? 
Eran says that the table's region sizes are 256MB (default) and the other table 
is 512MB. 
If the table is constantly splitting, then you need to increase the region 
size. Again we don't have enough information to diagnose if this is the issue.

We don't know things about his cluster like the number of nodes, how much 
memory on each node, as well as which version of HBase.

I realize that these are all pretty basic issues, but sometimes its the little 
things that will trip you up. 

HTH

-Mike




Reply via email to