Re: MR job for creating splits

Something Something Sun, 13 May 2012 10:04:42 -0700

>>Why do you need to know this?  Were you trying to do a percentage of rows
per region?
Yes, that's exactly why I want to know this.  Calculating % is the best way
of distributing keys evenly across regions - me thinks. Everything else is
just approximation.



>>Otherwise just have a member variable of your reducer class and increment
it on each call to reduce().
Tried it, but didn't work.  Basically, we should be able to find out in a
reducer how many rows were created by all Mappers.  I am a bit surprised MR
framework doesn't provide this.

>>I think you'll be better of finding a way to do it not using percentage
if possible.
Yes, if everything else fails, approximation is the way to go.

>>Try calculating the size of the data instead perhaps.
Correct, but as I said in previous email, I would rather not run another MR
job just to calculate COUNT(*).  There are over 8 Billion rows & sorting
them is a bit expensive operation.

>>we tallied up the KeyValue.getLenght() for each KeyValue in a row until
the size reached a certain limit.
The keyword in this line is "certain".  How do you come up with that
"certain" number, by approximation right?  If everything else fails, that's
what I will do.

Thanks for your help.

On Sun, May 13, 2012 at 9:35 AM, Bryan Beaudreault <bbeaudrea...@hubspot.com
> wrote:

> Why do you need to know this?  Were you trying to do a percentage of rows
> per region?  Otherwise just have a member variable of your reducer class
> and increment it on each call to reduce().  I think you'll be better of
> finding a way to do it not using percentage if possible.  Try calculating
> the size of the data instead perhaps.  You should have that available since
> you are trying to bulkload anyway (which requires Put or KeyValue values,
> both of which you can get the size from).
>
> On Sun, May 13, 2012 at 2:11 AM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > Is there no way to find out inside a single reducer how many records were
> > created by all the Mappers?  I tried several ways but nothing works.  For
> > example, I tried this:
> >
> > reporter.getCounter(Task.Counter.REDUCE_INPUT_RECORDS).getValue();
> >
> > It's not working for me.  Should this have worked?  Am I just doing
> > something dumb?  I would rather not create another MR job just to count #
> > of lines.
> >
> >
> > On Sat, May 12, 2012 at 7:07 PM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com
> > > wrote:
> >
> > > I did a very similar approach and it worked fine for me.  Just spot
> check
> > > the regions after to make sure they look lexicographically sorted.  I
> > used
> > > ImmutableBytesWritable as my key, and the default hadoop sorting for
> that
> > > turned out to sort lexicographically as required.  Our hbase rows
> varied
> > in
> > > size, so instead of doing a count of the number of rows, we tallied up
> > the
> > > KeyValue.getLenght() for each KeyValue in a row until the size reached
> a
> > > certain limit.
> > >
> > > On Sat, May 12, 2012 at 7:21 PM, Something Something <
> > > mailinglist...@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > This is really a MapReduce question, but the output from this will be
> > > used
> > > > to create regions for an HBase table.  Here's what I want to do:
> > > >
> > > > Take an input file that contains data about users.
> > > > Sort this file by a key (which consists of a few fields from the row)
> > > > After every x # of rows write the key.
> > > >
> > > >
> > > > Here's how I was going to structure my MapReduce:
> > > >
> > > > public Splitter {
> > > >
> > > >    static int counter;
> > > >
> > > >    private Mapper {
> > > >        map() {
> > > >            Build key by concatenating fields
> > > >            Write key
> > > >            increment counter;
> > > >        }
> > > >    }
> > > >
> > > >    //  # of reducers will be set to 1.  My understanding is that this
> > > will
> > > > send the lines to reducer in sorted order one at a time - is this a
> > > correct
> > > > assumption?
> > > >    private Reducer {
> > > >         static long i;
> > > >         reduce() {
> > > >             static long splitSize = counter / 300;  //  300 is region
> > > size
> > > >             if (i == 0 || i == splitSize) {
> > > >                 Write key;  // this will be used as a 'startkey'.
> > > >                  i = 0;
> > > >             }
> > > >             i++;
> > > >         }
> > > >    }
> > > > }
> > > >
> > > > To summarize, there are 2 questions:
> > > >
> > > > 1)  I am passing # of rows processed by Mapper to Reducer via a
> static
> > > > counter.  Would this work?  Is there a better way?
> > > > 2)  If I set # of reducers to 1, would the lines be sent to reducer
> in
> > > > sorted order one at a time?
> > > >
> > > > Thanks in advance for the help.
> > > >
> > >
> >
>

Re: MR job for creating splits

Reply via email to