Hi,
This is assuming there is no easier way to do it (someone from hbase
team can comment better !).
But the usual way to handle this for mapreduce is to create a composite
input format : which delegates to the underlying formats to generate the
splits, and the corresponding record reader's based on the split.
I have not done this for hbase though - but looking at
TableInputFormatBase, it looks possible to implement ...
Specifically for hbase, something along the lines of :
--- start dirty pseudo code ---
CustomTableInputFormat extends TableInputFormatBase and implements
setConf() to configure the table(s) required.
public class CustomTableInputFormat extends
InputFormat<ImmutableBytesWritable, Result> {
private CustomTableInputFormat delegate1;
private CustomTableInputFormat delegate2;
public void setConf(){
delegate1 = createTable1InputFormat();
delegate2 = createTable2InputFormat();
}
public List<InputSplit> getSplits(JobContext context) throws
IOException {
List<InputSplit> retval = new LinkedList<InputSplit>();
retval.addAll(delegate1.getSplits(context));
retval.addAll(delegate1.getSplits(context));
return retval;
}
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context
) throws IOException,
InterruptedException {
if (split for table1) return delegate.createRecordReader();
else if (split for table2) return delegate.createRecordReader();
else throw exception
}
}
--- end pseudo code ---
Regards,
Mridul
john smith wrote:
Mridul
Can you be more clear .. I didn't get you !
On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
<[email protected]>wrote:
If you just want to scan both tables for your mapper, assuming there is no
easier way to do it - cant you not write a composite input format which
delegates to both tables input formats ?
Regards,
Mridul
john smith wrote:
Stack,
The requirement is that I need to I need to scan two tables A,B for an MR
job ,Order is not important . That is , the reduce phase contains both
keys
from both A,B.
Presently what iam doing is that I am using TableMap for "A" .. And in one
of the mappers , I am reading the entire B using a scanner. But this is a
big overhead right ! Because non-local B data will we transferred (over
network) to the machine executing that Map phase . Instead what
I was thinking is that , there is some kind of variant of TableMap which
scans for both A,B and emit the corresponding keys . Order is not at all
important and also no random lookups . I need the entire B table keys in
some way or the other with least overhead !
Also therz one more solution I was thinking .. Suppose Iam scanning some
particular region using table map . I can get that particular region names
using some func in the API , then I can build a scanner on B over that
particular region and emit all the keys from B . This doesn't require and
network transfer of data . Is this solution feasible ?? If yes any hints
on
what classes to use from API ?
Thanks ,
J-S
On Fri, Jan 8, 2010 at 10:46 AM, stack <[email protected]> wrote:
This is a little tough. Do both tables have same number of regions? Are
you walking through the two tables serially in your mapreduce or do you
want
to do random lookups into the second table dependent on the row you are
currently processing in table one?
St.Ack
On Thu, Jan 7, 2010 at 7:51 PM, john smith <[email protected]>
wrote:
Hi all,
My requirement is that , I must read two tables (belonging to the same
region server) in the same Map .
Normally TableMap supports only 1 table at a time and right now I am
reading
the entire 2nd table in any one
of the maps , This is a big overhead . So can any one suggest some
modification of TableMap or a different
approach which can read 2 tables simultaneously at the same time . This
can
be very useful to us!
Thanks
J-S