Re: Hadoop job using multiple input files

2009-02-06 Thread Billy Pearson

If it was me I would prefix the map values outputs with a: and n:.
a: for address and n: for number
then on the reduce you could test the value to see if its the address or the 
name with if statements no need to worry about which one comes first just 
make sure they both have been set before output on the reduce.


Billy

"Amandeep Khurana"  wrote in 
message news:35a22e220902061646m941a545o554b189ed5bdb...@mail.gmail.com...

Ok. I was able to get this to run but have a slight problem.

*File 1*
1   10
2   20
3   30
3   35
4   40
4   45
4   49
5   50

*File 2*

a   10   123
b   20   21321
c   45   2131
d   40   213

I want to join the above two based on the second column of file 1. Here's
what I am getting as the output.

*Output*
1   a   123
b   21321   2
3
3
4   d   213
c   2131   4
4
5

The ones in red are in the format I want it. The ones in blue have their
order reversed. How can I get them to be in the correct order too?
Basically, the order in which the iterator iterates over the values is not
consistent. How can I get this to be consistent?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana 
 wrote:



Ok. Got it.

Now, how would my reducer know whether the name is coming first or the
address? Is it going to be in the same order in the iterator as the files
are read (alphabetically) in the mapper?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher 
wrote:



You put the files into a common directory, and use that as your input to
the
MapReduce job. You write a single Mapper class that has an "if" 
statement
examining the map.input.file property, outputting "number" as the key 
for

both files, but "address" for one and "name" for the other. By using a
commone key ("number"), you'll  ensure that the name and address make it
to
the same reducer after the shuffle. In the reducer, you'll then have the
relevant information (in the values) you need to create the name, 
address

pair.

On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana 


wrote:

> Thanks Jeff...
> I am not 100% clear about the first solution you have given. How do I
get
> the multiple files to be read and then feed into a single reducer? I
should
> have multiple mappers in the same class and have different job configs
for
> them, run two separate jobs with one outputing the key as 
> (name,number)

and
> the other outputing the value as (number, address) into the reducer?
> Not clear what I'll be doing with the map.intput.file here...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher 
> 
> >wrote:
>
> > Hey Amandeep,
> >
> > You can get the file name for a task via the "map.input.file"
property.
> For
> > the join you're doing, you could inspect this property and ouput
(number,
> > name) and (number, address) as your (key, value) pairs, depending on
the
> > file you're working with. Then you can do the combination in your
> reducer.
> >
> > You could also check out the join package in contrib/utils (
> >
> >
>
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > ),
> > but I'd say your job is simple enough that you'll get it done faster
with
> > the above method.
> >
> > This task would be a simple join in Hive, so you could consider 
> > using

> Hive
> > to manage the data and perform the join.
> >
> > Later,
> > Jeff
> >
> > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> > 

> wrote:
> >
> > > Is it possible to write a map reduce job using multiple input 
> > > files?

> > >
> > > For example:
> > > File 1 has data like - Name, Number
> > > File 2 has data like - Number, Address
> > >
> > > Using these, I want to create a third file which has something 
> > > like

-
> > Name,
> > > Address
> > >
> > > How can a map reduce job be written to do this?
> > >
> > > Amandeep
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>











Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Ok. I was able to get this to run but have a slight problem.

*File 1*
1   10
2   20
3   30
3   35
4   40
4   45
4   49
5   50

*File 2*

a   10   123
b   20   21321
c   45   2131
d   40   213

I want to join the above two based on the second column of file 1. Here's
what I am getting as the output.

*Output*
1   a   123
b   21321   2
3
3
4   d   213
c   2131   4
4
5

The ones in red are in the format I want it. The ones in blue have their
order reversed. How can I get them to be in the correct order too?
Basically, the order in which the iterator iterates over the values is not
consistent. How can I get this to be consistent?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana  wrote:

> Ok. Got it.
>
> Now, how would my reducer know whether the name is coming first or the
> address? Is it going to be in the same order in the iterator as the files
> are read (alphabetically) in the mapper?
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote:
>
>> You put the files into a common directory, and use that as your input to
>> the
>> MapReduce job. You write a single Mapper class that has an "if" statement
>> examining the map.input.file property, outputting "number" as the key for
>> both files, but "address" for one and "name" for the other. By using a
>> commone key ("number"), you'll  ensure that the name and address make it
>> to
>> the same reducer after the shuffle. In the reducer, you'll then have the
>> relevant information (in the values) you need to create the name, address
>> pair.
>>
>> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana 
>> wrote:
>>
>> > Thanks Jeff...
>> > I am not 100% clear about the first solution you have given. How do I
>> get
>> > the multiple files to be read and then feed into a single reducer? I
>> should
>> > have multiple mappers in the same class and have different job configs
>> for
>> > them, run two separate jobs with one outputing the key as (name,number)
>> and
>> > the other outputing the value as (number, address) into the reducer?
>> > Not clear what I'll be doing with the map.intput.file here...
>> >
>> > Amandeep
>> >
>> >
>> > Amandeep Khurana
>> > Computer Science Graduate Student
>> > University of California, Santa Cruz
>> >
>> >
>> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher > > >wrote:
>> >
>> > > Hey Amandeep,
>> > >
>> > > You can get the file name for a task via the "map.input.file"
>> property.
>> > For
>> > > the join you're doing, you could inspect this property and ouput
>> (number,
>> > > name) and (number, address) as your (key, value) pairs, depending on
>> the
>> > > file you're working with. Then you can do the combination in your
>> > reducer.
>> > >
>> > > You could also check out the join package in contrib/utils (
>> > >
>> > >
>> >
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
>> > > ),
>> > > but I'd say your job is simple enough that you'll get it done faster
>> with
>> > > the above method.
>> > >
>> > > This task would be a simple join in Hive, so you could consider using
>> > Hive
>> > > to manage the data and perform the join.
>> > >
>> > > Later,
>> > > Jeff
>> > >
>> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
>> > wrote:
>> > >
>> > > > Is it possible to write a map reduce job using multiple input files?
>> > > >
>> > > > For example:
>> > > > File 1 has data like - Name, Number
>> > > > File 2 has data like - Number, Address
>> > > >
>> > > > Using these, I want to create a third file which has something like
>> -
>> > > Name,
>> > > > Address
>> > > >
>> > > > How can a map reduce job be written to do this?
>> > > >
>> > > > Amandeep
>> > > >
>> > > >
>> > > >
>> > > > Amandeep Khurana
>> > > > Computer Science Graduate Student
>> > > > University of California, Santa Cruz
>> > > >
>> > >
>> >
>>
>
>


Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Ok. Got it.

Now, how would my reducer know whether the name is coming first or the
address? Is it going to be in the same order in the iterator as the files
are read (alphabetically) in the mapper?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher wrote:

> You put the files into a common directory, and use that as your input to
> the
> MapReduce job. You write a single Mapper class that has an "if" statement
> examining the map.input.file property, outputting "number" as the key for
> both files, but "address" for one and "name" for the other. By using a
> commone key ("number"), you'll  ensure that the name and address make it to
> the same reducer after the shuffle. In the reducer, you'll then have the
> relevant information (in the values) you need to create the name, address
> pair.
>
> On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana  wrote:
>
> > Thanks Jeff...
> > I am not 100% clear about the first solution you have given. How do I get
> > the multiple files to be read and then feed into a single reducer? I
> should
> > have multiple mappers in the same class and have different job configs
> for
> > them, run two separate jobs with one outputing the key as (name,number)
> and
> > the other outputing the value as (number, address) into the reducer?
> > Not clear what I'll be doing with the map.intput.file here...
> >
> > Amandeep
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher  > >wrote:
> >
> > > Hey Amandeep,
> > >
> > > You can get the file name for a task via the "map.input.file" property.
> > For
> > > the join you're doing, you could inspect this property and ouput
> (number,
> > > name) and (number, address) as your (key, value) pairs, depending on
> the
> > > file you're working with. Then you can do the combination in your
> > reducer.
> > >
> > > You could also check out the join package in contrib/utils (
> > >
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > > ),
> > > but I'd say your job is simple enough that you'll get it done faster
> with
> > > the above method.
> > >
> > > This task would be a simple join in Hive, so you could consider using
> > Hive
> > > to manage the data and perform the join.
> > >
> > > Later,
> > > Jeff
> > >
> > > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> > wrote:
> > >
> > > > Is it possible to write a map reduce job using multiple input files?
> > > >
> > > > For example:
> > > > File 1 has data like - Name, Number
> > > > File 2 has data like - Number, Address
> > > >
> > > > Using these, I want to create a third file which has something like -
> > > Name,
> > > > Address
> > > >
> > > > How can a map reduce job be written to do this?
> > > >
> > > > Amandeep
> > > >
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > >
> >
>


Re: Hadoop job using multiple input files

2009-02-06 Thread Ian Soboroff
Amandeep Khurana  writes:

> Is it possible to write a map reduce job using multiple input files?
>
> For example:
> File 1 has data like - Name, Number
> File 2 has data like - Number, Address
>
> Using these, I want to create a third file which has something like - Name,
> Address
>
> How can a map reduce job be written to do this?

Have one map job read both files in sequence, and map them to (number,
name or address).  Then reduce on number.

Ian



Re: Hadoop job using multiple input files

2009-02-06 Thread Jeff Hammerbacher
You put the files into a common directory, and use that as your input to the
MapReduce job. You write a single Mapper class that has an "if" statement
examining the map.input.file property, outputting "number" as the key for
both files, but "address" for one and "name" for the other. By using a
commone key ("number"), you'll  ensure that the name and address make it to
the same reducer after the shuffle. In the reducer, you'll then have the
relevant information (in the values) you need to create the name, address
pair.

On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana  wrote:

> Thanks Jeff...
> I am not 100% clear about the first solution you have given. How do I get
> the multiple files to be read and then feed into a single reducer? I should
> have multiple mappers in the same class and have different job configs for
> them, run two separate jobs with one outputing the key as (name,number) and
> the other outputing the value as (number, address) into the reducer?
> Not clear what I'll be doing with the map.intput.file here...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher  >wrote:
>
> > Hey Amandeep,
> >
> > You can get the file name for a task via the "map.input.file" property.
> For
> > the join you're doing, you could inspect this property and ouput (number,
> > name) and (number, address) as your (key, value) pairs, depending on the
> > file you're working with. Then you can do the combination in your
> reducer.
> >
> > You could also check out the join package in contrib/utils (
> >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > ),
> > but I'd say your job is simple enough that you'll get it done faster with
> > the above method.
> >
> > This task would be a simple join in Hive, so you could consider using
> Hive
> > to manage the data and perform the join.
> >
> > Later,
> > Jeff
> >
> > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> wrote:
> >
> > > Is it possible to write a map reduce job using multiple input files?
> > >
> > > For example:
> > > File 1 has data like - Name, Number
> > > File 2 has data like - Number, Address
> > >
> > > Using these, I want to create a third file which has something like -
> > Name,
> > > Address
> > >
> > > How can a map reduce job be written to do this?
> > >
> > > Amandeep
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>


Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Thanks Jeff...
I am not 100% clear about the first solution you have given. How do I get
the multiple files to be read and then feed into a single reducer? I should
have multiple mappers in the same class and have different job configs for
them, run two separate jobs with one outputing the key as (name,number) and
the other outputing the value as (number, address) into the reducer?
Not clear what I'll be doing with the map.intput.file here...

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher wrote:

> Hey Amandeep,
>
> You can get the file name for a task via the "map.input.file" property. For
> the join you're doing, you could inspect this property and ouput (number,
> name) and (number, address) as your (key, value) pairs, depending on the
> file you're working with. Then you can do the combination in your reducer.
>
> You could also check out the join package in contrib/utils (
>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> ),
> but I'd say your job is simple enough that you'll get it done faster with
> the above method.
>
> This task would be a simple join in Hive, so you could consider using Hive
> to manage the data and perform the join.
>
> Later,
> Jeff
>
> On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana  wrote:
>
> > Is it possible to write a map reduce job using multiple input files?
> >
> > For example:
> > File 1 has data like - Name, Number
> > File 2 has data like - Number, Address
> >
> > Using these, I want to create a third file which has something like -
> Name,
> > Address
> >
> > How can a map reduce job be written to do this?
> >
> > Amandeep
> >
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
>


Re: Hadoop job using multiple input files

2009-02-06 Thread Jeff Hammerbacher
Hey Amandeep,

You can get the file name for a task via the "map.input.file" property. For
the join you're doing, you could inspect this property and ouput (number,
name) and (number, address) as your (key, value) pairs, depending on the
file you're working with. Then you can do the combination in your reducer.

You could also check out the join package in contrib/utils (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html),
but I'd say your job is simple enough that you'll get it done faster with
the above method.

This task would be a simple join in Hive, so you could consider using Hive
to manage the data and perform the join.

Later,
Jeff

On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana  wrote:

> Is it possible to write a map reduce job using multiple input files?
>
> For example:
> File 1 has data like - Name, Number
> File 2 has data like - Number, Address
>
> Using these, I want to create a third file which has something like - Name,
> Address
>
> How can a map reduce job be written to do this?
>
> Amandeep
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>


Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Is it possible to write a map reduce job using multiple input files?

For example:
File 1 has data like - Name, Number
File 2 has data like - Number, Address

Using these, I want to create a third file which has something like - Name,
Address

How can a map reduce job be written to do this?

Amandeep



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: Multiple input files

2008-09-06 Thread Owen O'Malley
You can give a comma separated list of files and directories to the  
FileInputFormats, such as TextInputFormat. Directories are expanded  
one level, so dir1 becomes dir1/*, but not dir1/*/*.


-- Owen




Re: Multiple input files

2008-09-06 Thread Ryan LeCompte
Hi Sayali,

Yes, you can submit a collection of files from HDFS as input to the
job. Please take a look at the WordCount example in the Map/Reduce
tutorial for an example:

http://hadoop.apache.org/core/docs/r0.18.0/mapred_tutorial.html#Example%3A+WordCount+v1.0

Ryan


On Sat, Sep 6, 2008 at 9:03 AM, Sayali Kulkarni
<[EMAIL PROTECTED]> wrote:
> Hello,
> When starting a hadoop job, I need to specify an input file and an output 
> file. Can I instead specify a list of input files?
> example, I have the input distributed in the files:
> file000,
> file001,
> file002,
> file003,
> ...
> So I can I specify input files as file*. I can add all my files to HDFS.
>
> Thanks in advance!
> --Sayali
>
>


Multiple input files

2008-09-06 Thread Sayali Kulkarni
Hello,
When starting a hadoop job, I need to specify an input file and an output file. 
Can I instead specify a list of input files? 
example, I have the input distributed in the files:
file000,
file001,
file002,
file003,
...
So I can I specify input files as file*. I can add all my files to HDFS.

Thanks in advance!
--Sayali