As I understand it, you have a table, and you need to do some set of operations on a subset of the rowIDs in the table. One idea (and I'm new to this too), would be to create a temporary table, and write the rows in question into it. Then you can use a TableMapper on the temp table. Another way would be to create a SequenceFile with the RowIDs, then use a SequenceFileInputFormat to drive your mapreduce job, and do Get operations to read the rows from within your mapper method. And a third idea would be to override getSplits in TableInputFormat in such a way that you scan the table, and create splits only where you have blocks of contiguous rows that you need to process. This idea probably only makes sense if in fact the rows you need to process are not randomly distributed, but rather occur in ranges.
-geoff -----Original Message----- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Tuesday, April 20, 2010 9:37 AM To: hbase-user@hadoop.apache.org Subject: RE: Get operation in HBase Map-Reduce methods Going back to the OP's question... using get() within a M/R, the answer is yes. However you have a problem in that you need to have to somehow determine which row_id you want to retrieve. Since you're starting with a list of row_ids, then that should be the source for your m/r. So you'd have to work out your mapper to take the data from this list as your source and then within each m/r 's setup(), you connect to HBase to be used in each iteration of map(). I have a process where I scan one column family in a table, and based on information in the record, I have to perform a get() so what you want to do is possible in a M/R. I don't have a good code example for your specific use case. The issue isn't in connecting to hbase or doing the get. (That's trivial) The hard part is writing a mapper that takes a list in memory as its input source. Now here's the point where someone from Cloudera, Yahoo! or somewhere else says that even that piece is trivial and here's how to do it. :-) -Mike > Date: Tue, 20 Apr 2010 10:15:52 +0200 > Subject: Re: Get operation in HBase Map-Reduce methods > From: jdcry...@apache.org > To: hbase-user@hadoop.apache.org > > What are the numbers like? Is it 1k rows you need to process? 1M? 10B? > Your question is more about scaling (or the need to). > > J-D > > On Tue, Apr 20, 2010 at 8:39 AM, Andrey <atimerb...@gmx.net> wrote: > > Dear All, > > > > Assumed, I've got a list of rowIDs of a HBase table. I want to get > > each row by its rowID, do some operations with its values, and store > > the results somewhere subsequently. Is there a good way to do this in a Map-Reduce manner? > > > > As far as I understand, a mapper usually takes a Scan to form > > inputs. It is quite possible to create such a Scan, which contains a > > lot of RowFilters to be EQUAL to a particular <rowId>. Such a > > strategy will work for sure, however is inefficient, since each filter will be tried to match to each found row. > > > > So, is there a good Map-Reduce praxis for such kind of situations? > > (E.g. to make a Get operation inside a map() method.) If yes, could > > you kindly point to a good code example? > > > > Thank you in advance. > > > > _________________________________________________________________ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PI D28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4