Re: Multiple Input Paths

2009-11-08 Thread Tom White
MultipleInputs is available from Hadoop 0.19 onwards (in
org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input
for the new API in later versions).

Tom

On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant
 wrote:
> Amogh,
>
> That sounds so awesome! Yeah I wish I had that class now. Do you have any 
> tips on how to create such a delegating class? The best I can come up with is 
> to just submit both files to the mapper using multiple input paths and then 
> having anif statement at the beginning of the map that checks which file it's 
> dealing with but I'm skeptical that I can even make that work... Is there a 
> way you know of that I could submit 2 mapper classes to the job?
>
> -Original Message-
> From: Amogh Vasekar [mailto:am...@yahoo-inc.com]
> Sent: Wednesday, November 04, 2009 1:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Multiple Input Paths
>
> Hi Mark,
> A future release of Hadoop will have a MultipleInputs class, akin to 
> MultipleOutputs. This would allow you to have a different inputformat, mapper 
> depending on the path you are getting the split from. It uses special 
> Delegating[mapper/input] classes to resolve this. I understand backporting 
> this is more or less out of question, but the ideas there might provide 
> pointers to help you solve your current problem.
> Just a thought :)
>
> Amogh
>
>
> On 11/3/09 8:44 PM, "Mark Vigeant"  wrote:
>
> Hey Vipul
>
> No I haven't concatenated my files yet, and I was just thinking over how to 
> approach the issue of multiple input paths.
>
> I actually did what Amandeep hinted at which was we wrote our own 
> XMLInputFormat and XMLRecordReader. When configuring the job in my driver I 
> set job.setInputFormatClass(XMLFileInputFormat.class) and what it does is 
> send chunks of XML to the mapper as opposed to lines of text or whole files. 
> So I specified the Line Delimiter in the XMLRecordReader (ie ) and 
> everything in between the tags  and  are sent to the 
> mapper. Inside the map function is where to parse the data and write it to 
> the table.
>
> What I have to do now is just figure out how to set the Line Delimiter to be 
> something common in both XML files I'm reading. Currently I have 2 mapper 
> classes and thus 2 submitted jobs which is really inefficient and time 
> consuming.
>
> Make sense at all? Sorry if it doesn't, feel free to ask more questions
>
> Mark
>
> -Original Message-
> From: Vipul Sharma [mailto:sharmavi...@gmail.com]
> Sent: Monday, November 02, 2009 7:48 PM
> To: common-user@hadoop.apache.org
> Subject: RE: Multiple Input Paths
>
> Mark,
>
> were you able to concatenate both the xml files together. What did you do to
> keep the resulting xml well forned?
>
> Regards,
> Vipul Sharma,
> Cell: 281-217-0761
>
>


RE: Multiple Input Paths

2009-11-04 Thread Mark Vigeant
Amogh,

That sounds so awesome! Yeah I wish I had that class now. Do you have any tips 
on how to create such a delegating class? The best I can come up with is to 
just submit both files to the mapper using multiple input paths and then having 
anif statement at the beginning of the map that checks which file it's dealing 
with but I'm skeptical that I can even make that work... Is there a way you 
know of that I could submit 2 mapper classes to the job?

-Original Message-
From: Amogh Vasekar [mailto:am...@yahoo-inc.com] 
Sent: Wednesday, November 04, 2009 1:50 AM
To: common-user@hadoop.apache.org
Subject: Re: Multiple Input Paths

Hi Mark,
A future release of Hadoop will have a MultipleInputs class, akin to 
MultipleOutputs. This would allow you to have a different inputformat, mapper 
depending on the path you are getting the split from. It uses special 
Delegating[mapper/input] classes to resolve this. I understand backporting this 
is more or less out of question, but the ideas there might provide pointers to 
help you solve your current problem.
Just a thought :)

Amogh


On 11/3/09 8:44 PM, "Mark Vigeant"  wrote:

Hey Vipul

No I haven't concatenated my files yet, and I was just thinking over how to 
approach the issue of multiple input paths.

I actually did what Amandeep hinted at which was we wrote our own 
XMLInputFormat and XMLRecordReader. When configuring the job in my driver I set 
job.setInputFormatClass(XMLFileInputFormat.class) and what it does is send 
chunks of XML to the mapper as opposed to lines of text or whole files. So I 
specified the Line Delimiter in the XMLRecordReader (ie ) and 
everything in between the tags  and  are sent to the 
mapper. Inside the map function is where to parse the data and write it to the 
table.

What I have to do now is just figure out how to set the Line Delimiter to be 
something common in both XML files I'm reading. Currently I have 2 mapper 
classes and thus 2 submitted jobs which is really inefficient and time 
consuming.

Make sense at all? Sorry if it doesn't, feel free to ask more questions

Mark

-Original Message-
From: Vipul Sharma [mailto:sharmavi...@gmail.com]
Sent: Monday, November 02, 2009 7:48 PM
To: common-user@hadoop.apache.org
Subject: RE: Multiple Input Paths

Mark,

were you able to concatenate both the xml files together. What did you do to
keep the resulting xml well forned?

Regards,
Vipul Sharma,
Cell: 281-217-0761



Re: Multiple Input Paths

2009-11-03 Thread Amogh Vasekar
Hi Mark,
A future release of Hadoop will have a MultipleInputs class, akin to 
MultipleOutputs. This would allow you to have a different inputformat, mapper 
depending on the path you are getting the split from. It uses special 
Delegating[mapper/input] classes to resolve this. I understand backporting this 
is more or less out of question, but the ideas there might provide pointers to 
help you solve your current problem.
Just a thought :)

Amogh


On 11/3/09 8:44 PM, "Mark Vigeant"  wrote:

Hey Vipul

No I haven't concatenated my files yet, and I was just thinking over how to 
approach the issue of multiple input paths.

I actually did what Amandeep hinted at which was we wrote our own 
XMLInputFormat and XMLRecordReader. When configuring the job in my driver I set 
job.setInputFormatClass(XMLFileInputFormat.class) and what it does is send 
chunks of XML to the mapper as opposed to lines of text or whole files. So I 
specified the Line Delimiter in the XMLRecordReader (ie ) and 
everything in between the tags  and  are sent to the 
mapper. Inside the map function is where to parse the data and write it to the 
table.

What I have to do now is just figure out how to set the Line Delimiter to be 
something common in both XML files I'm reading. Currently I have 2 mapper 
classes and thus 2 submitted jobs which is really inefficient and time 
consuming.

Make sense at all? Sorry if it doesn't, feel free to ask more questions

Mark

-Original Message-
From: Vipul Sharma [mailto:sharmavi...@gmail.com]
Sent: Monday, November 02, 2009 7:48 PM
To: common-user@hadoop.apache.org
Subject: RE: Multiple Input Paths

Mark,

were you able to concatenate both the xml files together. What did you do to
keep the resulting xml well forned?

Regards,
Vipul Sharma,
Cell: 281-217-0761



RE: Multiple Input Paths

2009-11-03 Thread vipul sharma
Mark,

thanks for the pointer. So as far as I understand you are not using hadoop's
default split but using your own split of one record as specified by the
everything between the starting tag and the end tag in your xml? So in a way
you have one map per record? In my case this will not be efficient since my
xml files are small. What I would want to do is to have a split that
includes multiple files so that I can use one map for around 64meg of data.
And do the parsing inside map. I will update you once it makes more sense to
even me.

-- 
Vipul Sharma
sharmavipul AT gmail DOT com


RE: Multiple Input Paths

2009-11-03 Thread Mark Vigeant
Hey Vipul

No I haven't concatenated my files yet, and I was just thinking over how to 
approach the issue of multiple input paths.

I actually did what Amandeep hinted at which was we wrote our own 
XMLInputFormat and XMLRecordReader. When configuring the job in my driver I set 
job.setInputFormatClass(XMLFileInputFormat.class) and what it does is send 
chunks of XML to the mapper as opposed to lines of text or whole files. So I 
specified the Line Delimiter in the XMLRecordReader (ie ) and 
everything in between the tags  and  are sent to the 
mapper. Inside the map function is where to parse the data and write it to the 
table.

What I have to do now is just figure out how to set the Line Delimiter to be 
something common in both XML files I'm reading. Currently I have 2 mapper 
classes and thus 2 submitted jobs which is really inefficient and time 
consuming. 

Make sense at all? Sorry if it doesn't, feel free to ask more questions

Mark

-Original Message-
From: Vipul Sharma [mailto:sharmavi...@gmail.com] 
Sent: Monday, November 02, 2009 7:48 PM
To: common-user@hadoop.apache.org
Subject: RE: Multiple Input Paths

Mark,

were you able to concatenate both the xml files together. What did you do to
keep the resulting xml well forned?

Regards,
Vipul Sharma,
Cell: 281-217-0761


RE: Multiple Input Paths

2009-11-02 Thread Vipul Sharma
Mark,

were you able to concatenate both the xml files together. What did you do to
keep the resulting xml well forned?

Regards,
Vipul Sharma,
Cell: 281-217-0761


RE: Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Ok, thank you very much Amogh, I will redesign my program.

-Original Message-
From: Amogh Vasekar [mailto:am...@yahoo-inc.com] 
Sent: Monday, November 02, 2009 11:45 AM
To: common-user@hadoop.apache.org
Subject: Re: Multiple Input Paths

Mark,
Set-up for a mapred job consumes a considerable amount of time and resources 
and so, if possible a single job is preferred.
You can add multiple paths to your job, and if you need different processing 
logic depending upon the input being consumed, you can use parameter 
map.input.file in your mapper to decide.

Amogh


On 11/2/09 8:53 PM, "Mark Vigeant"  wrote:

Hey, quick question:

I'm writing a program that parses data from 2 different files and puts the data 
into a table. Currently I have 2 different map functions and so I submit 2 
separate jobs to the job client. Would it be more efficient to add both paths 
to the same mapper and only submit one job? Thanks a lot!

Mark Vigeant
RiskMetrics Group, Inc.



Re: Multiple Input Paths

2009-11-02 Thread Amogh Vasekar
Mark,
Set-up for a mapred job consumes a considerable amount of time and resources 
and so, if possible a single job is preferred.
You can add multiple paths to your job, and if you need different processing 
logic depending upon the input being consumed, you can use parameter 
map.input.file in your mapper to decide.

Amogh


On 11/2/09 8:53 PM, "Mark Vigeant"  wrote:

Hey, quick question:

I'm writing a program that parses data from 2 different files and puts the data 
into a table. Currently I have 2 different map functions and so I submit 2 
separate jobs to the job client. Would it be more efficient to add both paths 
to the same mapper and only submit one job? Thanks a lot!

Mark Vigeant
RiskMetrics Group, Inc.



RE: Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Yes, the structure is similar. They're both XML log files documenting the same 
set of data, just in different ways.

That's a really cool idea though, to combine them. How exactly would I go about 
doing that?

-Original Message-
From: L [mailto:archit...@galatea.com] 
Sent: Monday, November 02, 2009 10:27 AM
To: common-user@hadoop.apache.org
Subject: Re: Multiple Input Paths

Mark,

Is the structure of both files the same? It makes even more sense to 
combine the files, if you can, as I have seen a considerable speed up 
when I've done that (at least when I've had small files to deal with).

Lajos


Mark Vigeant wrote:
> Hey, quick question:
> 
> I'm writing a program that parses data from 2 different files and puts the 
> data into a table. Currently I have 2 different map functions and so I submit 
> 2 separate jobs to the job client. Would it be more efficient to add both 
> paths to the same mapper and only submit one job? Thanks a lot!
> 
> Mark Vigeant
> RiskMetrics Group, Inc.
> 

-- 



Re: Multiple Input Paths

2009-11-02 Thread L

Mark,

Is the structure of both files the same? It makes even more sense to 
combine the files, if you can, as I have seen a considerable speed up 
when I've done that (at least when I've had small files to deal with).


Lajos


Mark Vigeant wrote:

Hey, quick question:

I'm writing a program that parses data from 2 different files and puts the data 
into a table. Currently I have 2 different map functions and so I submit 2 
separate jobs to the job client. Would it be more efficient to add both paths 
to the same mapper and only submit one job? Thanks a lot!

Mark Vigeant
RiskMetrics Group, Inc.



--



Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Hey, quick question:

I'm writing a program that parses data from 2 different files and puts the data 
into a table. Currently I have 2 different map functions and so I submit 2 
separate jobs to the job client. Would it be more efficient to add both paths 
to the same mapper and only submit one job? Thanks a lot!

Mark Vigeant
RiskMetrics Group, Inc.