I guess the list part could be trivial.  I did a very simple test program
that used an input file listing one key per line as the source to my
MapReduce program.

Basically, I did a FileInputFormat.addInputPaths() to point at the input
file, and my Mapper did a bunch of Get()¹s for each key.  I¹m guessing it¹s
fine for a small set of keys you¹re looking to operate on.

The problem I¹d see though, is that a few thousand lines would wind up being
that only a single Map job would be kicked off to process those lines.  I
guess it depends on where the brunt of your work will be done in the Map or
Reduce phase.  Based on the 1k count of records, I¹d almost recommend doing
the brunt of work in the Reduce, and manually specify the number of your
reducers via Job.setNumReduceTasks();

Hope that helps

--Rick


On 4/20/10 12:36 PM, "Michael Segel" <michael_se...@hotmail.com> wrote:

> 
> Going back to the OP's question... using get() within a M/R, the answer is
> yes.
> 
> However you have a problem in that you need to have to somehow determine which
> row_id you want to retrieve.
> 
> Since you're starting with a list of row_ids, then that should be the source
> for your m/r.
> So you'd have to work out your mapper to take the data from this list as your
> source and then within each m/r 's setup(), you connect to HBase to be used in
> each iteration of map().
> 
> I have a process where I scan one column family in a table, and based on
> information in the record, I have to perform a get() so what you want to do is
> possible in a M/R.
> 
> I don't have a good code example for your specific use case. The issue isn't
> in connecting to hbase or doing the get. (That's trivial) The hard part is
> writing a mapper that takes a list in memory as its input source.
> 
> Now here's the point where someone from Cloudera, Yahoo! or somewhere else
> says that even that piece is trivial and here's how to do it. :-)
> 
> -Mike
> 
> 
>> > Date: Tue, 20 Apr 2010 10:15:52 +0200
>> > Subject: Re: Get operation in HBase Map-Reduce methods
>> > From: jdcry...@apache.org
>> > To: hbase-user@hadoop.apache.org
>> > 
>> > What are the numbers like? Is it 1k rows you need to process? 1M? 10B?
>> > Your question is more about scaling (or the need to).
>> > 
>> > J-D
>> > 
>> > On Tue, Apr 20, 2010 at 8:39 AM, Andrey <atimerb...@gmx.net> wrote:
>>> > > Dear All,
>>> > >
>>> > > Assumed, I've got a list of rowIDs of a HBase table. I want to get each
>>> row by
>>> > > its rowID, do some operations with its values, and store the results
>>> somewhere
>>> > > subsequently. Is there a good way to do this in a Map-Reduce manner?
>>> > >
>>> > > As far as I understand, a mapper usually takes a Scan to form inputs. It
is
>>> > > quite possible to create such a Scan, which contains a lot of RowFilters
>>> to be
>>> > > EQUAL to a particular <rowId>. Such a strategy will work for sure,
>>> however is
>>> > > inefficient, since each filter will be tried to match to each found row.
>>> > >
>>> > > So, is there a good Map-Reduce praxis for such kind of situations? (E.g.
>>> to make
>>> > > a Get operation inside a map() method.) If yes, could you kindly point
>>> to a good
>>> > > code example?
>>> > >
>>> > > Thank you in advance.
>>> > >
>>> > >
>             
> _________________________________________________________________
> The New Busy is not the too busy. Combine all your e-mail accounts with
> Hotmail.
> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326
> ::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

Reply via email to