MultipleInputs is available from Hadoop 0.19 onwards (in org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input for the new API in later versions).
Tom On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant <mark.vige...@riskmetrics.com> wrote: > Amogh, > > That sounds so awesome! Yeah I wish I had that class now. Do you have any > tips on how to create such a delegating class? The best I can come up with is > to just submit both files to the mapper using multiple input paths and then > having anif statement at the beginning of the map that checks which file it's > dealing with but I'm skeptical that I can even make that work... Is there a > way you know of that I could submit 2 mapper classes to the job? > > -----Original Message----- > From: Amogh Vasekar [mailto:am...@yahoo-inc.com] > Sent: Wednesday, November 04, 2009 1:50 AM > To: common-user@hadoop.apache.org > Subject: Re: Multiple Input Paths > > Hi Mark, > A future release of Hadoop will have a MultipleInputs class, akin to > MultipleOutputs. This would allow you to have a different inputformat, mapper > depending on the path you are getting the split from. It uses special > Delegating[mapper/input] classes to resolve this. I understand backporting > this is more or less out of question, but the ideas there might provide > pointers to help you solve your current problem. > Just a thought :) > > Amogh > > > On 11/3/09 8:44 PM, "Mark Vigeant" <mark.vige...@riskmetrics.com> wrote: > > Hey Vipul > > No I haven't concatenated my files yet, and I was just thinking over how to > approach the issue of multiple input paths. > > I actually did what Amandeep hinted at which was we wrote our own > XMLInputFormat and XMLRecordReader. When configuring the job in my driver I > set job.setInputFormatClass(XMLFileInputFormat.class) and what it does is > send chunks of XML to the mapper as opposed to lines of text or whole files. > So I specified the Line Delimiter in the XMLRecordReader (ie <startTag>) and > everything in between the tags <startTag> and </startTag> are sent to the > mapper. Inside the map function is where to parse the data and write it to > the table. > > What I have to do now is just figure out how to set the Line Delimiter to be > something common in both XML files I'm reading. Currently I have 2 mapper > classes and thus 2 submitted jobs which is really inefficient and time > consuming. > > Make sense at all? Sorry if it doesn't, feel free to ask more questions > > Mark > > -----Original Message----- > From: Vipul Sharma [mailto:sharmavi...@gmail.com] > Sent: Monday, November 02, 2009 7:48 PM > To: common-user@hadoop.apache.org > Subject: RE: Multiple Input Paths > > Mark, > > were you able to concatenate both the xml files together. What did you do to > keep the resulting xml well forned? > > Regards, > Vipul Sharma, > Cell: 281-217-0761 > >