Re: Multiple Input Paths

Tom White Sun, 08 Nov 2009 20:40:12 -0800

MultipleInputs is available from Hadoop 0.19 onwards (in
org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input
for the new API in later versions).


Tom

On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant
<mark.vige...@riskmetrics.com> wrote:
> Amogh,
>
> That sounds so awesome! Yeah I wish I had that class now. Do you have any 
> tips on how to create such a delegating class? The best I can come up with is 
> to just submit both files to the mapper using multiple input paths and then 
> having anif statement at the beginning of the map that checks which file it's 
> dealing with but I'm skeptical that I can even make that work... Is there a 
> way you know of that I could submit 2 mapper classes to the job?
>
> -----Original Message-----
> From: Amogh Vasekar [mailto:am...@yahoo-inc.com]
> Sent: Wednesday, November 04, 2009 1:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Multiple Input Paths
>
> Hi Mark,
> A future release of Hadoop will have a MultipleInputs class, akin to 
> MultipleOutputs. This would allow you to have a different inputformat, mapper 
> depending on the path you are getting the split from. It uses special 
> Delegating[mapper/input] classes to resolve this. I understand backporting 
> this is more or less out of question, but the ideas there might provide 
> pointers to help you solve your current problem.
> Just a thought :)
>
> Amogh
>
>
> On 11/3/09 8:44 PM, "Mark Vigeant" <mark.vige...@riskmetrics.com> wrote:
>
> Hey Vipul
>
> No I haven't concatenated my files yet, and I was just thinking over how to 
> approach the issue of multiple input paths.
>
> I actually did what Amandeep hinted at which was we wrote our own 
> XMLInputFormat and XMLRecordReader. When configuring the job in my driver I 
> set job.setInputFormatClass(XMLFileInputFormat.class) and what it does is 
> send chunks of XML to the mapper as opposed to lines of text or whole files. 
> So I specified the Line Delimiter in the XMLRecordReader (ie <startTag>) and 
> everything in between the tags <startTag> and </startTag> are sent to the 
> mapper. Inside the map function is where to parse the data and write it to 
> the table.
>
> What I have to do now is just figure out how to set the Line Delimiter to be 
> something common in both XML files I'm reading. Currently I have 2 mapper 
> classes and thus 2 submitted jobs which is really inefficient and time 
> consuming.
>
> Make sense at all? Sorry if it doesn't, feel free to ask more questions
>
> Mark
>
> -----Original Message-----
> From: Vipul Sharma [mailto:sharmavi...@gmail.com]
> Sent: Monday, November 02, 2009 7:48 PM
> To: common-user@hadoop.apache.org
> Subject: RE: Multiple Input Paths
>
> Mark,
>
> were you able to concatenate both the xml files together. What did you do to
> keep the resulting xml well forned?
>
> Regards,
> Vipul Sharma,
> Cell: 281-217-0761
>
>

Re: Multiple Input Paths

Reply via email to