On Thu, May 10, 2012 at 7:46 PM, Michael Segel <michael_se...@hotmail.com> wrote: > "Writing, it may make sense to avoid the reduce step and write yourself back > into HBase from inside your map. You'd do this when your job does not need > the sort and collation that mapreduce does on the map emitted data; on > insert, HBase 'sorts' so there is no point double-sorting (and shuffling data > around your mapreduce cluster) unless you need to. If you do not need the > reduce, you might just have your map emit counts of records processed just so > the framework's report at the end of your job has meaning or set the number > of reduces to zero and use TableOutputFormat. See example code below. If > running the reduce step makes sense in your case, its usually better to have > lots of reducers so load is spread across the HBase cluster." > > This isn't 100% true. > > I'd lose the quotes around 'sorts' because the data is sorted on key values. > period. >
Sounds good. > I'd ask that you reconsider the following phrase... > "You'd do this when your job does not need the sort and collation that > mapreduce does on the map emitted data;" > What would you suggest instead. > I realize I went to this little midwestern school (tOSU), where ENG meant you > were in the college of engineering and not an English Major, so I'm not sure > if I am parsing that statement correctly. > ditto The above phrase is mine. I'm bad at writing so need help. > If you refactor your M/R , HBase can be used for the 'collation' . (If you > make your Mapper a null writable and manually write the output to HBase > within Mapper.map(), you can write to N tables without a problem. So you can > write the record out, update a table where you are keeping counters, stats, > etc ... ) So I am still at a loss to find an example of where you would need > a reducer. > Can you make a patch. I'm for making a stronger statement about reduce, that its rare if ever its needed. Lets get it in the doc. > So one has to ask what would cause a write to be blocked > GC ? Eran says he's already tuned it. > MSLABS? Eran says that's covered. > > Table splits? > Eran says that the table's region sizes are 256MB (default) and the other > table is 512MB. > If the table is constantly splitting, then you need to increase the region > size. Again we don't have enough information to diagnose if this is the issue. > > We don't know things about his cluster like the number of nodes, how much > memory on each node, as well as which version of HBase. > > I realize that these are all pretty basic issues, but sometimes its the > little things that will trip you up. > Above is generally good advice. Thanks Michael. St.Ack