I guess the list part could be trivial. I did a very simple test program that used an input file listing one key per line as the source to my MapReduce program.
Basically, I did a FileInputFormat.addInputPaths() to point at the input file, and my Mapper did a bunch of Get()¹s for each key. I¹m guessing it¹s fine for a small set of keys you¹re looking to operate on. The problem I¹d see though, is that a few thousand lines would wind up being that only a single Map job would be kicked off to process those lines. I guess it depends on where the brunt of your work will be done in the Map or Reduce phase. Based on the 1k count of records, I¹d almost recommend doing the brunt of work in the Reduce, and manually specify the number of your reducers via Job.setNumReduceTasks(); Hope that helps --Rick On 4/20/10 12:36 PM, "Michael Segel" <michael_se...@hotmail.com> wrote: > > Going back to the OP's question... using get() within a M/R, the answer is > yes. > > However you have a problem in that you need to have to somehow determine which > row_id you want to retrieve. > > Since you're starting with a list of row_ids, then that should be the source > for your m/r. > So you'd have to work out your mapper to take the data from this list as your > source and then within each m/r 's setup(), you connect to HBase to be used in > each iteration of map(). > > I have a process where I scan one column family in a table, and based on > information in the record, I have to perform a get() so what you want to do is > possible in a M/R. > > I don't have a good code example for your specific use case. The issue isn't > in connecting to hbase or doing the get. (That's trivial) The hard part is > writing a mapper that takes a list in memory as its input source. > > Now here's the point where someone from Cloudera, Yahoo! or somewhere else > says that even that piece is trivial and here's how to do it. :-) > > -Mike > > >> > Date: Tue, 20 Apr 2010 10:15:52 +0200 >> > Subject: Re: Get operation in HBase Map-Reduce methods >> > From: jdcry...@apache.org >> > To: hbase-user@hadoop.apache.org >> > >> > What are the numbers like? Is it 1k rows you need to process? 1M? 10B? >> > Your question is more about scaling (or the need to). >> > >> > J-D >> > >> > On Tue, Apr 20, 2010 at 8:39 AM, Andrey <atimerb...@gmx.net> wrote: >>> > > Dear All, >>> > > >>> > > Assumed, I've got a list of rowIDs of a HBase table. I want to get each >>> row by >>> > > its rowID, do some operations with its values, and store the results >>> somewhere >>> > > subsequently. Is there a good way to do this in a Map-Reduce manner? >>> > > >>> > > As far as I understand, a mapper usually takes a Scan to form inputs. It is >>> > > quite possible to create such a Scan, which contains a lot of RowFilters >>> to be >>> > > EQUAL to a particular <rowId>. Such a strategy will work for sure, >>> however is >>> > > inefficient, since each filter will be tried to match to each found row. >>> > > >>> > > So, is there a good Map-Reduce praxis for such kind of situations? (E.g. >>> to make >>> > > a Get operation inside a map() method.) If yes, could you kindly point >>> to a good >>> > > code example? >>> > > >>> > > Thank you in advance. >>> > > >>> > > > > _________________________________________________________________ > The New Busy is not the too busy. Combine all your e-mail accounts with > Hotmail. > http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326 > ::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4