Hi Justin, If it is not feasible for you to do as praveen suggested, here you can go.
1. You can write customized InputFormat which can create different connections for different data sources and returns splits from those data source tables. Internally you can use DBInputFormat for each data source in your customized InputFormat if you can. 2. If your mapper input is not same for two data sources, you can write one mapper which internally delegates to mappers corresponding to the mapper based on the inputsplit(you can refer MultipleInputs for this). MultipleInputs doesn't support for DBInputFormat, it supports only the input format's which uses file path as input path. If you explain your use case with more details, I may help you better. Devaraj K -----Original Message----- From: Praveen Sripati [mailto:[email protected]] Sent: Tuesday, December 06, 2011 4:11 PM To: [email protected] Subject: Re: Multiple Mappers for Multiple Tables MultipleInputs take multiple Path (files) and not DB as input. As mentioned earlier export tables into HDFS either using Sqoop or native DB export tool and then do the processing. Sqoop is configured to use native DB export tool whenever possible. Regards, Praveen On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent <[email protected]> wrote: > Thanks Bejoy, > I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a > Path parameter. Are these paths just ignored here? > > On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <[email protected]> wrote: > > > Hi Justin, > > Just to add on to my response. If you need to fetch data from > > rdbms on your mapper using your custom mapreduce code you can use the > > DBInputFormat in your mapper class with MultipleInputs. You have to be > > careful in using the number of mappers for your application as dbs would > be > > constrained with a limit on maximum simultaneous connections. Also you > need > > to ensure that that the same Query is not executed n number of times in n > > mappers all fetching the same data, It'd be just wastage of network. > Sqoop > > + Hive would be my recommendation and a good combination for such use > > cases. If you have Pig competency you can also look into pig instead of > > hive. > > > > Hope it helps!... > > > > Regards > > Bejoy.K.S > > > > On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <[email protected]> wrote: > > > > > Justin > > > If I get your requirement right you need to get in data from > > > multiple rdbms sources and do a join on the same, also may be some more > > > custom operations on top of this. For this you don't need to go in for > > > writing your custom mapreduce code unless it is that required. You can > > > achieve the same in two easy steps > > > - Import data from RDBMS into Hive using SQOOP (Import) > > > - Use hive to do some join and processing on this data > > > > > > Hope it helps!.. > > > > > > Regards > > > Bejoy.K.S > > > > > > > > > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <[email protected] > > >wrote: > > > > > >> I would like join some db tables, possibly from different databases, > in > > a > > >> MR job. > > >> > > >> I would essentially like to use MultipleInputs, but that seems file > > >> oriented. I need a different mapper for each db table. > > >> > > >> Suggestions? > > >> > > >> Thanks! > > >> > > >> Justin Vincent > > >> > > > > > > > > >
