Hi Justin,

   If it is not feasible for you to do as praveen suggested, here you can
go.

1. You can write customized InputFormat which can create different
connections for different data sources and returns splits from those data
source tables. Internally you can use DBInputFormat for each data source in
your customized InputFormat if you can.

2. If your mapper input is not same for two data sources, you can write one
mapper which internally delegates to mappers corresponding to the mapper
based on the inputsplit(you can refer MultipleInputs for this).

MultipleInputs doesn't support for DBInputFormat, it supports only the input
format's which uses file path as input path.

If you explain your use case with more details, I may help you better.



Devaraj K 

-----Original Message-----
From: Praveen Sripati [mailto:[email protected]] 
Sent: Tuesday, December 06, 2011 4:11 PM
To: [email protected]
Subject: Re: Multiple Mappers for Multiple Tables

MultipleInputs take multiple Path (files) and not DB as input. As mentioned
earlier export tables into HDFS either using Sqoop or native DB export tool
and then do the processing. Sqoop is configured to use native DB export
tool whenever possible.

Regards,
Praveen

On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent <[email protected]> wrote:

> Thanks Bejoy,
> I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
> Path parameter. Are these paths just ignored here?
>
> On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <[email protected]> wrote:
>
> > Hi Justin,
> >            Just to add on to my response. If you need to fetch data from
> > rdbms on your mapper using your custom mapreduce code you can use the
> > DBInputFormat in your mapper class with MultipleInputs. You have to be
> > careful in using the number of mappers for your application as dbs would
> be
> > constrained with a limit on maximum simultaneous connections. Also you
> need
> > to ensure that that the same Query is not executed n number of times in
n
> > mappers all fetching the same data, It'd be just wastage of network.
> Sqoop
> > + Hive would be my recommendation and a good combination for such use
> > cases. If you have Pig competency you can also look into pig instead of
> > hive.
> >
> > Hope it helps!...
> >
> > Regards
> > Bejoy.K.S
> >
> > On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <[email protected]> wrote:
> >
> > > Justin
> > >         If I get your requirement right you need to get in data from
> > > multiple rdbms sources and do a join on the same, also may be some
more
> > > custom operations on top of this. For this you don't need to go in for
> > > writing your custom mapreduce code unless it is that required. You can
> > > achieve the same in two easy steps
> > > - Import data from RDBMS into Hive using SQOOP (Import)
> > > - Use hive to do some join and processing on this data
> > >
> > > Hope it helps!..
> > >
> > > Regards
> > > Bejoy.K.S
> > >
> > >
> > > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <[email protected]
> > >wrote:
> > >
> > >> I would like join some db tables, possibly from different databases,
> in
> > a
> > >> MR job.
> > >>
> > >> I would essentially like to use MultipleInputs, but that seems file
> > >> oriented. I need a different mapper for each db table.
> > >>
> > >> Suggestions?
> > >>
> > >> Thanks!
> > >>
> > >> Justin Vincent
> > >>
> > >
> > >
> >
>

Reply via email to