Re: your question #1, you won't be able to pass information from mappers to reducers by using static variables. Since map tasks run in different JVM instances than reduce tasks, the value of the static variable will never be sent from the mapper JVM to the reducer JVM.
It might work in standalone mode, but that's probably not the case for your production environment. Re: your question #2, google for "hadoop secondary sort." Some vague advice on your algorithm to determine the best splits for your data: if you don't need the splits to be optimal, you might try randomly sampling your keys instead of processing all of them. This might not even require mapreduce. Best, Dave On Sat, May 12, 2012 at 4:21 PM, Something Something < mailinglist...@gmail.com> wrote: > Hello, > > This is really a MapReduce question, but the output from this will be used > to create regions for an HBase table. Here's what I want to do: > > Take an input file that contains data about users. > Sort this file by a key (which consists of a few fields from the row) > After every x # of rows write the key. > > > Here's how I was going to structure my MapReduce: > > public Splitter { > > static int counter; > > private Mapper { > map() { > Build key by concatenating fields > Write key > increment counter; > } > } > > // # of reducers will be set to 1. My understanding is that this will > send the lines to reducer in sorted order one at a time - is this a correct > assumption? > private Reducer { > static long i; > reduce() { > static long splitSize = counter / 300; // 300 is region size > if (i == 0 || i == splitSize) { > Write key; // this will be used as a 'startkey'. > i = 0; > } > i++; > } > } > } > > To summarize, there are 2 questions: > > 1) I am passing # of rows processed by Mapper to Reducer via a static > counter. Would this work? Is there a better way? > 2) If I set # of reducers to 1, would the lines be sent to reducer in > sorted order one at a time? > > Thanks in advance for the help. >