Re: MR in HBase

john smith Sat, 09 Jan 2010 21:40:14 -0800

Mridul ,

It seems it is feasible , but Iam not 100% clear . Can you please show us
your implementation in hadoop so that we can get some idea and implement the
same for HBase. Thanks for your help.


J-S

On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan
<[email protected]>wrote:

>
> Hi,
>
>
> This is assuming there is no easier way to do it (someone from hbase team
> can comment better !).
>
> But the usual way to handle this for mapreduce is to create a composite
> input format : which delegates to the underlying formats to generate the
> splits, and the corresponding record reader's based on the split.
>
>
> I have not done this for hbase though - but looking at
> TableInputFormatBase, it looks possible to implement ...
>
> Specifically for hbase, something along the lines of :
>
> --- start dirty pseudo code ---
>
> CustomTableInputFormat extends TableInputFormatBase and implements
> setConf() to configure the table(s) required.
>
> public class CustomTableInputFormat extends
> InputFormat<ImmutableBytesWritable, Result> {
>
>  private CustomTableInputFormat delegate1;
>  private CustomTableInputFormat delegate2;
>
>  public void setConf(){
>    delegate1 = createTable1InputFormat();
>    delegate2 = createTable2InputFormat();
>  }
>
>  public List<InputSplit> getSplits(JobContext context) throws IOException {
>    List<InputSplit> retval = new LinkedList<InputSplit>();
>    retval.addAll(delegate1.getSplits(context));
>    retval.addAll(delegate1.getSplits(context));
>    return retval;
>  }
>
>
>  public abstract
>    RecordReader<K,V> createRecordReader(InputSplit split,
>                                         TaskAttemptContext context
>                                        ) throws IOException,
>                                                 InterruptedException {
>    if (split for table1) return delegate.createRecordReader();
>    else if (split for table2) return delegate.createRecordReader();
>    else throw exception
>  }
>
> }
>
> --- end pseudo code ---
>
>
> Regards,
> Mridul
>
> john smith wrote:
>
>> Mridul
>>
>> Can you be more clear .. I didn't get you !
>>
>> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
>> <[email protected]>wrote:
>>
>>
>>> If you just want to scan both tables for your mapper, assuming there is
>>> no
>>> easier way to do it - cant you not write a composite input format which
>>> delegates to both tables input formats ?
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> john smith wrote:
>>>
>>>  Stack,
>>>>
>>>> The requirement is that I need to I need to scan two tables A,B for  an
>>>> MR
>>>> job ,Order is not important . That is , the reduce phase  contains both
>>>> keys
>>>> from both A,B.
>>>>
>>>> Presently what iam doing is that I am using TableMap for "A" .. And in
>>>> one
>>>> of the mappers , I am reading the entire B using a scanner. But this is
>>>> a
>>>> big overhead right ! Because non-local  B data will we transferred (over
>>>> network) to the machine executing that Map phase . Instead what
>>>> I was thinking is that , there is some kind of variant of TableMap which
>>>> scans for both A,B and emit the corresponding keys . Order is not at all
>>>> important  and also no random lookups . I need the entire B table keys
>>>> in
>>>> some way or the other with least overhead !
>>>>
>>>> Also therz one more solution I was thinking ..  Suppose Iam scanning
>>>> some
>>>> particular region using table map . I can get that particular region
>>>> names
>>>> using some func in the API , then I can build a scanner on B over that
>>>> particular region and emit all the keys from B . This doesn't require
>>>> and
>>>> network transfer of data . Is this solution feasible ?? If yes any hints
>>>> on
>>>> what classes to use from API ?
>>>>
>>>> Thanks ,
>>>> J-S
>>>>
>>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <[email protected]> wrote:
>>>>
>>>>  This is a little tough.  Do both tables have same number of regions?
>>>>  Are
>>>>
>>>>> you walking through the two tables serially in your mapreduce or do you
>>>>> want
>>>>> to do random lookups into the second table dependent on the row you are
>>>>> currently processing in table one?
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Hi all,
>>>>>
>>>>>> My requirement is that , I must read two tables (belonging to the same
>>>>>> region server) in the same Map .
>>>>>>
>>>>>> Normally TableMap supports only 1 table at a time and right now I am
>>>>>> reading
>>>>>> the entire 2nd table in any one
>>>>>> of the maps , This is a big overhead . So can any one suggest some
>>>>>> modification of TableMap or a different
>>>>>> approach which can read 2 tables simultaneously at the same time .
>>>>>> This
>>>>>>
>>>>>>  can
>>>>>
>>>>>  be very useful to us!
>>>>>>
>>>>>> Thanks
>>>>>> J-S
>>>>>>
>>>>>>
>>>>>>
>

Re: MR in HBase

Reply via email to