Re: Custom Input Split

Rakhi Khatwani Wed, 22 Apr 2009 10:16:31 -0700

Hi Stack,
      ya i needed the result to feed a program,
thanks for the suggestions though, i ll try out the Counter.ROWS thing
tomorrow.


Thanks,
Raakhi

On Wed, Apr 22, 2009 at 10:36 PM, stack <st...@duboce.net> wrote:

> So you need the result to feed a program?
>
> Maybe someone else knows how to ask a finished mapreduce job questions
> about
> its counters?   There must be a way?
>
> Or, yeah, I suppose, I don't believe RowCounter writes the count to the
> filesystem.  You'd need to add that if you can't figure a way to ask the
> finished RowCounter job what the value of its Counter.ROWS counter was.
>
> St.Ack
>
> On Wed, Apr 22, 2009 at 9:50 AM, Rakhi Khatwani <rakhi.khatw...@gmail.com
> >wrote:
>
> > Hi St Ack,
> >          well i did go through the usage... where we were supposed to
> > mention 3 parameters, OutputDir, TableName and Columns
> > what i actually wanted is an int value count, which contains the number
> of
> > rows in the table.
> > i guess this program seems to store the o/p in some output dir... correct
> > me
> > if i am going wrong.
> >
> > Thanks,
> > Raakhi
> >
> > On Wed, Apr 22, 2009 at 8:25 AM, stack <st...@duboce.net> wrote:
> >
> > > Oh, and the reason to use a MR job counting rows is because if many, a
> > > single process would take too long (If you know you have a small table,
> > use
> > > the 'count' command in shell).
> > >
> > > St.Ack
> > >
> > > On Wed, Apr 22, 2009 at 9:06 AM, Stack <saint....@gmail.com> wrote:
> > >
> > > > If you run
> > > >
> > > > ./bin/hadoop -jar hbase.jar rowcounter
> > > >
> > > > It will emit usage.  You are a smart fellow. I think you can take it
> > from
> > > > there.
> > > >
> > > > Stack
> > > >
> > > >
> > > >
> > > >
> > > > On Apr 22, 2009, at 5:48, Rakhi Khatwani <rakhi.khatw...@gmail.com>
> > > wrote:
> > > >
> > > >  Hi Lars,
> > > >>          Thanks for the suggesstion, I also figured out my problem
> > using
> > > >> TableInputFormatBase.
> > > >>
> > > >> but my table had only one region but i still wanted to split the
> input
> > > >> into
> > > >> 4 maps.
> > > >> so i am basically overriding the getInputSplits() method in
> > > >> TableInputFormatBase.
> > > >>
> > > >> One more question
> > > >> is there any method in hbase API which can count the number of rows
> in
> > a
> > > >> table?
> > > >> i tried googling it and all i came across is a RowCounter class
> which
> > is
> > > a
> > > >> mapreduce job to count the number of rows. but i really dont know
> how
> > to
> > > >> use
> > > >> it. any suggestions?
> > > >>
> > > >> thanks,
> > > >> Raakhi
> > > >>
> > > >>
> > > >> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <l...@worldlingo.com>
> > > wrote:
> > > >>
> > > >>  Hi Rakhi,
> > > >>>
> > > >>> This is all done in the TableInputFormatBase class, which you can
> > > extend
> > > >>> and then override the getSplits() function:
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
> > > >>>
> > > >>> This is where you can then specify how many rows per map are
> > assigned.
> > > >>> Really straight forward as I see it. I have used it to implement a
> > > >>> special
> > > >>> "only use N regions" support where I can run a sample subset
> against
> > a
> > > MR
> > > >>> job. For example only map 5 out if 8K regions of a table.
> > > >>>
> > > >>> The default one will always split all regions into N maps. Hence
> the
> > > >>> recommendation to set the number of maps to the number of regions
> in
> > a
> > > >>> table. If you set it to something lower than it will split the
> > regions
> > > >>> into
> > > >>> a smaller number but with more rows per map, i.e. each map gets
> more
> > > than
> > > >>> one region to process.
> > > >>>
> > > >>> Look into the source of the above class and it should be obvious -
> I
> > > >>> hope.
> > > >>>
> > > >>> Lars
> > > >>>
> > > >>>
> > > >>>
> > > >>> Rakhi Khatwani wrote:
> > > >>>
> > > >>>  Hi,
> > > >>>>   I have a table with N records,
> > > >>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
> > > >>>>   is there a way i can create my own custom input split so that i
> > can
> > > >>>> send 'n' records to each map??
> > > >>>>  if there is a way, can i have a sample code snippet to gain
> better
> > > >>>> understanding?
> > > >>>>
> > > >>>> Thanks
> > > >>>> Raakhi.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > >
> >
>

Re: Custom Input Split

Reply via email to