Re: MapReduce: Reducers partitions.

Jean-Marc Spaggiari Wed, 10 Apr 2013 12:02:16 -0700

Hi Greame,

No. The reducer will simply write on the table the same way you are doing a
regular Put. If a split is required because of the size, then the region
will be split, but at the end, there will not necessary be any region
split.


In the usecase described below, all the 600 lines will "simply" go into the
only region in the table and no split will occur.

The goal is to partition the data for the reducer only. Not in the table.

JM

2013/4/10 Graeme Wallace <graeme.wall...@farecompare.com>

> Whats the behavior then if you return hash % num_reducers and you have no
> splits defined. When the reducer writes to the table does the region server
> local to the reducer create a new region ?
>
> Graeme
>
>
> On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
>
> > So.
> >
> > I looked at the code, and I have one comment/suggestion here.
> >
> > If the table we are outputing to has regions, then partitions are build
> > around that, and that's fine. But if the table is totally empty with a
> > single region, even if we setNumReduceTasks to 2 or more, all the keys
> will
> > go on the same first reducer because of this:
> >     if (this.startKeys.length == 1){
> >       return 0;
> >     }
> > I think it will be better to return something like keycrc%numPartitions
> > instead. That still allow the application to spread the reducing process
> > over multinode(racks) even if there is only one region in the table.
> >
> > In my usecase, I have millions of lines producing some statistics. At the
> > end, I will have only about 600 lines, but it will take a lot of map and
> > reduce time to go from millions to 600, that's why I'm looking to have
> more
> > than one reducer. However, with only 600 lines, it's very difficult to
> > pre-split the table. Keys are all very close.
> >
> > Does anyone see anything wrong with changing this default behaviour when
> > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> >
> > JM
> >
> > 2013/4/10 Jean-Marc Spaggiari <jean-m...@spaggiari.org>
> >
> > > Thanks Ted.
> > >
> > > It's exactly where I was looking at now. I was close. I will take a
> > deeper
> > > look.
> > >
> > > Thanks Nitin for the link. I will read that too.
> > >
> > > JM
> > >
> > > 2013/4/10 Nitin Pawar <nitinpawar...@gmail.com>
> > >
> > >> To add what Ted said,
> > >>
> > >> the same discussion happened on the question Jean asked
> > >>
> > >> https://issues.apache.org/jira/browse/HBASE-1287
> > >>
> > >>
> > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> > >>
> > >> > Jean-Marc:
> > >> > Take a look at HRegionPartitioner which is in both mapred and
> > mapreduce
> > >> > packages:
> > >> >
> > >> >  * This is used to partition the output keys into groups of keys.
> > >> >
> > >> >  * Keys are grouped according to the regions that currently exist
> > >> >
> > >> >  * so that each reducer fills a single region so load is
> distributed.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > >> > jean-m...@spaggiari.org> wrote:
> > >> >
> > >> > > Hi Nitin,
> > >> > >
> > >> > > You got my question correctly.
> > >> > >
> > >> > > However, I'm wondering how it's working when it's done into HBase.
> > Do
> > >> > > we have defaults partionners so we have the same garantee that
> > records
> > >> > > mapping to one key go to the same reducer. Or do we have to
> > implement
> > >> > > this one our own.
> > >> > >
> > >> > > JM
> > >> > >
> > >> > > 2013/4/10 Nitin Pawar <nitinpawar...@gmail.com>:
> > >> > > > I hope i understood what you are asking is this . If not then
> > >> pardon me
> > >> > > :)
> > >> > > > from the hadoop developer handbook few lines
> > >> > > >
> > >> > > > The*Partitioner* class determines which partition a given (key,
> > >> value)
> > >> > > pair
> > >> > > > will go to. The default partitioner computes a hash value for
> the
> > >> key
> > >> > and
> > >> > > > assigns the partition based on this result. It garantees that
> all
> > >> the
> > >> > > > records mapping to one key go to same reducer
> > >> > > >
> > >> > > > You can write your custom partitioner as well
> > >> > > > here is the link :
> > >> > > >
> > >> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
> > >> > > > jean-m...@spaggiari.org> wrote:
> > >> > > >
> > >> > > >> Hi,
> > >> > > >>
> > >> > > >> quick question. How are the data from the map tasks
> partitionned
> > >> for
> > >> > > >> the reducers?
> > >> > > >>
> > >> > > >> If there is 1 reducer, it's easy, but if there is more, are all
> > >> they
> > >> > > >> same keys garanteed to end on the same reducer? Or not
> necessary?
> > >>  If
> > >> > > >> they are not, how can we provide a partionning function?
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >>
> > >> > > >> JM
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Nitin Pawar
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Nitin Pawar
> > >>
> > >
> > >
> >
>
>
>
> --
> Graeme Wallace
> CTO
> FareCompare.com
> O: 972 588 1414
> M: 214 681 9018
>

Re: MapReduce: Reducers partitions.

Reply via email to