Mridul , It seems it is feasible , but Iam not 100% clear . Can you please show us your implementation in hadoop so that we can get some idea and implement the same for HBase. Thanks for your help.
J-S On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan <[email protected]>wrote: > > Hi, > > > This is assuming there is no easier way to do it (someone from hbase team > can comment better !). > > But the usual way to handle this for mapreduce is to create a composite > input format : which delegates to the underlying formats to generate the > splits, and the corresponding record reader's based on the split. > > > I have not done this for hbase though - but looking at > TableInputFormatBase, it looks possible to implement ... > > Specifically for hbase, something along the lines of : > > --- start dirty pseudo code --- > > CustomTableInputFormat extends TableInputFormatBase and implements > setConf() to configure the table(s) required. > > public class CustomTableInputFormat extends > InputFormat<ImmutableBytesWritable, Result> { > > private CustomTableInputFormat delegate1; > private CustomTableInputFormat delegate2; > > public void setConf(){ > delegate1 = createTable1InputFormat(); > delegate2 = createTable2InputFormat(); > } > > public List<InputSplit> getSplits(JobContext context) throws IOException { > List<InputSplit> retval = new LinkedList<InputSplit>(); > retval.addAll(delegate1.getSplits(context)); > retval.addAll(delegate1.getSplits(context)); > return retval; > } > > > public abstract > RecordReader<K,V> createRecordReader(InputSplit split, > TaskAttemptContext context > ) throws IOException, > InterruptedException { > if (split for table1) return delegate.createRecordReader(); > else if (split for table2) return delegate.createRecordReader(); > else throw exception > } > > } > > --- end pseudo code --- > > > Regards, > Mridul > > john smith wrote: > >> Mridul >> >> Can you be more clear .. I didn't get you ! >> >> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan >> <[email protected]>wrote: >> >> >>> If you just want to scan both tables for your mapper, assuming there is >>> no >>> easier way to do it - cant you not write a composite input format which >>> delegates to both tables input formats ? >>> >>> >>> Regards, >>> Mridul >>> >>> >>> john smith wrote: >>> >>> Stack, >>>> >>>> The requirement is that I need to I need to scan two tables A,B for an >>>> MR >>>> job ,Order is not important . That is , the reduce phase contains both >>>> keys >>>> from both A,B. >>>> >>>> Presently what iam doing is that I am using TableMap for "A" .. And in >>>> one >>>> of the mappers , I am reading the entire B using a scanner. But this is >>>> a >>>> big overhead right ! Because non-local B data will we transferred (over >>>> network) to the machine executing that Map phase . Instead what >>>> I was thinking is that , there is some kind of variant of TableMap which >>>> scans for both A,B and emit the corresponding keys . Order is not at all >>>> important and also no random lookups . I need the entire B table keys >>>> in >>>> some way or the other with least overhead ! >>>> >>>> Also therz one more solution I was thinking .. Suppose Iam scanning >>>> some >>>> particular region using table map . I can get that particular region >>>> names >>>> using some func in the API , then I can build a scanner on B over that >>>> particular region and emit all the keys from B . This doesn't require >>>> and >>>> network transfer of data . Is this solution feasible ?? If yes any hints >>>> on >>>> what classes to use from API ? >>>> >>>> Thanks , >>>> J-S >>>> >>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <[email protected]> wrote: >>>> >>>> This is a little tough. Do both tables have same number of regions? >>>> Are >>>> >>>>> you walking through the two tables serially in your mapreduce or do you >>>>> want >>>>> to do random lookups into the second table dependent on the row you are >>>>> currently processing in table one? >>>>> >>>>> St.Ack >>>>> >>>>> >>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <[email protected]> >>>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>>> My requirement is that , I must read two tables (belonging to the same >>>>>> region server) in the same Map . >>>>>> >>>>>> Normally TableMap supports only 1 table at a time and right now I am >>>>>> reading >>>>>> the entire 2nd table in any one >>>>>> of the maps , This is a big overhead . So can any one suggest some >>>>>> modification of TableMap or a different >>>>>> approach which can read 2 tables simultaneously at the same time . >>>>>> This >>>>>> >>>>>> can >>>>> >>>>> be very useful to us! >>>>>> >>>>>> Thanks >>>>>> J-S >>>>>> >>>>>> >>>>>> >
