How to get jobconf variables in streaming's mapper/reducer?

2009-05-15 Thread Steve Gao
I am using streaming with perl, and I want to get jobconf variable values. As 
many tutorials say they are in environment, but I can not get them. 

For example, in reducer:
while (){
  my $part = $ENV{"mapred.task.partition"};
  print ("$part\n");
}

It turns out that  $ENV{"mapred.task.partition"} is not defined.

HOWEVER, I can get myself defined variable value. For example:

 $HADOOP_HOME/bin/hadoop  \
 jar $HADOOP_HOME/hadoop-streaming.jar \
     -input file1 \
     -output myOutputDir \
     -mapper mapper \
     -reducer reducer \
 -jobcont arg=test

In reducer:

while (){

  my $part2 = $ENV{"arg"};

  print ("$part2\n");

}


It works.

Anybody knows why is that? How to get jobconf variables in streaming? Thanks 
lot!



  

Re: [Interesting] One reducer randomly hangs on getting 0 mapper output

2009-04-10 Thread Steve Gao
Does anybody have a clue? Thanks lot.

--- On Thu, 4/9/09, Steve Gao  wrote:

From: Steve Gao 
Subject: [Interesting] One reducer randomly hangs on getting 0 mapper output
To: core-user@hadoop.apache.org
Date: Thursday, April 9, 2009, 6:04 PM


I have hadoop jobs with the last 1 reducer randomly hangs on getting 0 mapper 
output. By randomly I mean the job sometimes works correctly, sometimes their 
last 1 reducer keeps reading map output but always gets 0 data. It would hang 
up to 100 hours for getting 0 data until I kill it. After I kill and re-run it, 
it could run correctly. The hung reducer could happen on any machine of my 
cluster.

I attach the tail of the problematic reducer's log here. Does anybody have a 
hint what happened?

syslog logs

2009-04-09 21:57:46,445 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Need 15 map output(s)
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0: Got 0 new map-outputs & 0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Got 0 known map output location(s); 
scheduling...
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)

2009-04-09 21:57:51,453 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Need 15 map output(s)
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0: Got 0 new map-outputs & 0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Got 0 known map output location(s); 
scheduling...
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)


... (forever)



      


  

Re: [Interesting] One reducer randomly hangs on getting 0 mapper output

2009-04-09 Thread Steve Gao
I am using 0.17.0 . 
I think the problem is basically because reducer falls in a infinite loop to 
get mapper output, when mapper is somehow not available/dead . Doesn't hadoop 
have a solution?

--- On Thu, 4/9/09, Steve Gao  wrote:

From: Steve Gao 
Subject: [Interesting] One reducer randomly hangs on getting 0 mapper output
To: core-user@hadoop.apache.org
Date: Thursday, April 9, 2009, 6:04 PM


I have hadoop jobs with the last 1 reducer randomly hangs on getting 0 mapper 
output. By randomly I mean the job sometimes works correctly, sometimes their 
last 1 reducer keeps reading map output but always gets 0 data. It would hang 
up to 100 hours for getting 0 data until I kill it. After I kill and re-run it, 
it could run correctly. The hung reducer could happen on any machine of my 
cluster.

I attach the tail of the problematic reducer's log here. Does anybody have a 
hint what happened?

syslog logs

2009-04-09 21:57:46,445 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Need 15 map output(s)
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0: Got 0 new map-outputs & 0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Got 0 known map output location(s); 
scheduling...
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)

2009-04-09 21:57:51,453 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Need 15 map output(s)
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0: Got 0 new map-outputs & 0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Got 0 known map output location(s); 
scheduling...
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)


... (forever)



      


  

[Interesting] One reducer randomly hangs on getting 0 mapper output

2009-04-09 Thread Steve Gao

I have hadoop jobs with the last 1 reducer randomly hangs on getting 0 mapper 
output. By randomly I mean the job sometimes works correctly, sometimes their 
last 1 reducer keeps reading map output but always gets 0 data. It would hang 
up to 100 hours for getting 0 data until I kill it. After I kill and re-run it, 
it could run correctly. The hung reducer could happen on any machine of my 
cluster.

I attach the tail of the problematic reducer's log here. Does anybody have a 
hint what happened?

syslog logs

2009-04-09 21:57:46,445 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Need 15 map output(s)
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0: Got 0 new map-outputs & 0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Got 0 known map output location(s); 
scheduling...
2009-04-09 21:57:46,446 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)

2009-04-09 21:57:51,453 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Need 15 map output(s)
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0: Got 0 new map-outputs & 0 obsolete 
map-outputs from tasktracker and 0 map-outputs from previous failures
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Got 0 known map output location(s); 
scheduling...
2009-04-09 21:57:51,460 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200902022141_50382_r_08_0 Scheduled 0 of 0 known outputs (0 slow hosts 
and 0 dup hosts)


... (forever)



  

Re: Does HDFS provide a way to append A file to B ?

2009-03-17 Thread Steve Gao
Thanks, Bryan. Does 0.18.3 has built-in "append" command?

--- On Tue, 3/17/09, Bryan Duxbury  wrote:
From: Bryan Duxbury 
Subject: Re: Does HDFS provide a way to append A file to B ?
To: core-user@hadoop.apache.org
Date: Tuesday, March 17, 2009, 8:04 PM

I believe the last word on appends right now is that the patch that was
committed broke a lot of other things, so it's been disabled. As such, there
is no working append in HDFS, and certainly not in hadoop-17.x.

-Bryan

On Mar 17, 2009, at 4:50 PM, Steve Gao wrote:

> Thanks, but I was told there is an append command, isn't there? But I
don't know how to apply this patch
https://issues.apache.org/jira/browse/HADOOP-1700
> 
> --- On Tue, 3/17/09, Bo Shi  wrote:
> From: Bo Shi 
> Subject: Re: Does HDFS provide a way to append A file to B ?
> To: core-user@hadoop.apache.org
> Date: Tuesday, March 17, 2009, 7:42 PM
> 
> what about an identity mapper taking A and B as inputs?  this will
> likely mix rows of A and B together though...
> 
> On Tue, Mar 17, 2009 at 7:35 PM, Steve Gao 
wrote:
>> BTW, I am using hadoop 0.17.0 and jdk 1.6
>> 
>> --- On Tue, 3/17/09, Steve Gao  wrote:
>> From: Steve Gao 
>> Subject: Does HDFS provide a way to append A file to B ?
>> To: core-user@hadoop.apache.org
>> Date: Tuesday, March 17, 2009, 7:22 PM
>> 
>> I need to append file A to file B in HDFS without
downloading/uploading
> them to
>> local disk. Is there a way?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 




  

Re: How to apply a patch to my hadoop?

2009-03-17 Thread Steve Gao
Thank you, Ravi and Amandeep. 
Can I apply a patch on the fly? Or I need to shut down HTFS & job-tracker 
beforehand?


--- On Tue, 3/17/09, Ravi Phulari  wrote:
From: Ravi Phulari 
Subject: Re: How to apply a patch to my hadoop?
To: "core-user@hadoop.apache.org" , 
"steve@yahoo.com" 
Date: Tuesday, March 17, 2009, 7:52 PM

Hello Steve.

Assuming you are using  *nix.

To Apply patch
patch -p0 -E < HADOOP-X.patch

To remove Patch
patch -p0 --reverse -E < HADOOP-X.patch


Hope this helps.

Regards,
Ravi



On 3/17/09 4:48 PM, "Steve Gao"  wrote:

I want to apply this patch https://issues.apache.org/jira/browse/HADOOP-1700
to my hadoop 0.17.0 .

Would anybody tell me how to do it? Thanks!









  

Re: Does HDFS provide a way to append A file to B ?

2009-03-17 Thread Steve Gao
Thanks, but I was told there is an append command, isn't there? But I don't 
know how to apply this patch https://issues.apache.org/jira/browse/HADOOP-1700 

--- On Tue, 3/17/09, Bo Shi  wrote:
From: Bo Shi 
Subject: Re: Does HDFS provide a way to append A file to B ?
To: core-user@hadoop.apache.org
Date: Tuesday, March 17, 2009, 7:42 PM

what about an identity mapper taking A and B as inputs?  this will
likely mix rows of A and B together though...

On Tue, Mar 17, 2009 at 7:35 PM, Steve Gao  wrote:
> BTW, I am using hadoop 0.17.0 and jdk 1.6
>
> --- On Tue, 3/17/09, Steve Gao  wrote:
> From: Steve Gao 
> Subject: Does HDFS provide a way to append A file to B ?
> To: core-user@hadoop.apache.org
> Date: Tuesday, March 17, 2009, 7:22 PM
>
> I need to append file A to file B in HDFS without downloading/uploading
them to
> local disk. Is there a way?
>
>
>
>
>
>
>



  

How to apply a patch to my hadoop?

2009-03-17 Thread Steve Gao
I want to apply this patch https://issues.apache.org/jira/browse/HADOOP-1700
to my hadoop 0.17.0 .

Would anybody tell me how to do it? Thanks!



  

Re: Does HDFS provide a way to append A file to B ?

2009-03-17 Thread Steve Gao
BTW, I am using hadoop 0.17.0 and jdk 1.6

--- On Tue, 3/17/09, Steve Gao  wrote:
From: Steve Gao 
Subject: Does HDFS provide a way to append A file to B ?
To: core-user@hadoop.apache.org
Date: Tuesday, March 17, 2009, 7:22 PM

I need to append file A to file B in HDFS without downloading/uploading them to
local disk. Is there a way?



  


  

Does HDFS provide a way to append A file to B ?

2009-03-17 Thread Steve Gao
I need to append file A to file B in HDFS without downloading/uploading them to 
local disk. Is there a way?



  

RE: Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Steve Gao
Thanks, Amogh. But my case is slightly different. The command line inputs are 2 
files: file1 and file2. I need to tell in the mapper which line is from which 
file:
#In mapper
while (){
  //how to tell the current line is from file1 or file2?
}

-jobconfs map.input.file param does not help in this case 
because file1 and file2 are both input.

-Steve

--- On Thu, 10/23/08, Amogh Vasekar <[EMAIL PROTECTED]> wrote:
From: Amogh Vasekar <[EMAIL PROTECTED]>
Subject: RE: Is there a way to know the input filename at Hadoop Streaming?
To: [EMAIL PROTECTED]
Date: Thursday, October 23, 2008, 12:11 AM

Personally haven't worked with streaming but I guess the ur jobconfs
map.input.file param should do it for you.
-Original Message-----
From: Steve Gao [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 23, 2008 7:26 AM
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Subject: Is there a way to know the input filename at Hadoop Streaming?

I am using Hadoop Streaming. The input are multiple files.
Is there a way to get the current filename in mapper?

For example:
$HADOOP_HOME/bin/hadoop  \
jar $HADOOP_HOME/hadoop-streaming.jar \
-input file1 \
-input file2 \
-output myOutputDir \
-mapper mapper \
-reducer reducer

In mapper:
while (){
  //how to tell the current line is from file1 or file2?
}




  



  

[Help needed] Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Steve Gao
Sorry for the email. Thanks for any help or hint.

    I am using Hadoop Streaming. The input are multiple files.
    Is there a way to get the current filename in mapper?

    For example:
    $HADOOP_HOME/bin/hadoop  \
    jar $HADOOP_HOME/hadoop-streaming.jar \
    -input file1 \
    -input file2 \
    -output myOutputDir \
    -mapper mapper \
    -reducer reducer

    In mapper:
    while (){
  //how to tell the current line is from file1 or file2?
    }



  

Is there a way to know the input filename at Hadoop Streaming?

2008-10-22 Thread Steve Gao
I am using Hadoop Streaming. The input are multiple files.
Is there a way to get the current filename in mapper?

For example:
$HADOOP_HOME/bin/hadoop  \
jar $HADOOP_HOME/hadoop-streaming.jar \
-input file1 \
-input file2 \
-output myOutputDir \
-mapper mapper \
-reducer reducer

In mapper:
while (){
  //how to tell the current line is from file1 or file2?
}




  

Help: How to change number of mappers in Hadoop streaming?

2008-10-16 Thread Steve Gao

Would anybody help me?
Can I use 
-jobconf mapred.map.task=50 in streaming command to change the job's number of 
mappers?

I don't have a hadoop at hand and can not verify it. Thanks for your help.

--- On Wed, 10/15/08, Steve Gao <[EMAIL PROTECTED]> wrote:
From: Steve Gao <[EMAIL PROTECTED]>
Subject: How to change number of mappers in Hadoop streaming?
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Date: Wednesday, October 15, 2008, 7:25 PM

Is there a way to change number of mappers in Hadoop streaming command line?
I know I can change hadoop-default.xml:


  mapred.map.tasks
  10
  The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  


But that's for all jobs. What if I just want each job has different
NUM_OF_Mappers themselves? Thanks




  


  

How to change number of mappers in Hadoop streaming?

2008-10-15 Thread Steve Gao
Is there a way to change number of mappers in Hadoop streaming command line?
I know I can change hadoop-default.xml:


  mapred.map.tasks
  10
  The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  


But that's for all jobs. What if I just want each job has different 
NUM_OF_Mappers themselves? Thanks




  

Re: Hadoop User Group (Bay Area) Oct 15th

2008-10-15 Thread Steve Gao
I am excited to see the slides. Would you send me a copy? Thanks.

--- On Wed, 10/15/08, Nishant Khurana <[EMAIL PROTECTED]> wrote:
From: Nishant Khurana <[EMAIL PROTECTED]>
Subject: Re: Hadoop User Group (Bay Area) Oct 15th
To: core-user@hadoop.apache.org
Date: Wednesday, October 15, 2008, 9:45 AM

I would love to see the slides too. I am specially interested in
implementing database joins with Map Reduce.

On Wed, Oct 15, 2008 at 7:24 AM, Johan Oskarsson <[EMAIL PROTECTED]>
wrote:

> Since I'm not based in the San Francisco I would love to see the
slides
> from this meetup uploaded somewhere. Especially the database join
> techniques talk sounds very interesting to me.
>
> /Johan
>
> Ajay Anand wrote:
> > The next Bay Area User Group meeting is scheduled for October 15th at
> > Yahoo! 2821 Mission College Blvd, Santa Clara, Building 1, Training
> > Rooms 3 & 4 from 6:00-7:30 pm.
> >
> > Agenda:
> > - Exploiting database join techniques for analytics with Hadoop: Jun
> > Rao, IBM
> > - Jaql Update: Kevin Beyer, IBM
> > - Experiences moving a Petabyte Data Center: Sriram Rao, Quantcast
> >
> > Look forward to seeing you there!
> > Ajay
>
>


-- 
Nishant Khurana
Candidate for Masters in Engineering (Dec 2009)
Computer and Information Science
School of Engineering and Applied Science
University of Pennsylvania



  

Are There Books of Hadoop/Pig?

2008-10-14 Thread Steve Gao
Does anybody know if there are books about hadoop or pig? The wiki and manual 
are kind of ad-hoc and hard to comprehend, for example "I want to know how to 
apply patchs to my Hadoop, but can't find how to do it" that kind of things.

Would anybody help? Thanks.



  

Re: How to concatenate hadoop files to a single hadoop file

2008-10-02 Thread Steve Gao
Anybody knows? Thanks a lot.

--- On Thu, 10/2/08, Steve Gao <[EMAIL PROTECTED]> wrote:
From: Steve Gao <[EMAIL PROTECTED]>
Subject: How to concatenate hadoop files to a single hadoop file
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Date: Thursday, October 2, 2008, 3:17 PM

Suppose I have 3 files in Hadoop that I want to "cat" them to a single
file. I know it can be done by "hadoop dfs -cat" to a local file and
updating it to Hadoop. But it's very expensive for large files. Is there an
internal way to do this in Hadoop itself? Thanks



  


  

How to concatenate hadoop files to a single hadoop file

2008-10-02 Thread Steve Gao
Suppose I have 3 files in Hadoop that I want to "cat" them to a single file. I 
know it can be done by "hadoop dfs -cat" to a local file and updating it to 
Hadoop. But it's very expensive for large files. Is there an internal way to do 
this in Hadoop itself? Thanks



  

Is there a way to pause a running hadoop job?

2008-10-01 Thread Steve Gao
I have 5 running jobs, each has 2 reducers. Because I set max number of 
reducers as 10 so any incoming job will be hold until some of the 5 jobs finish 
and release reducer quota. 

Now the problem is that an incoming job has a higher priority that I want to 
pause some of the 5 jobs, let the new job finish, and resume the old one. 

Is this doable in Hadoop? Thanks!



  

Re: [Streaming] How to pass arguments to a map/reduce script

2008-08-21 Thread Steve Gao
Unfortunately this does not work. Hadoop complains:
08/08/21 18:04:46 ERROR streaming.StreamJob: Unexpected arg1 while processing 
-input|-output|-mapper|-combiner|-reducer|-file|-dfs|-jt|-additionalconfspec|-inputformat|-outputformat|-partitioner|-numReduceTasks|-inputreader|-mapdebug|-reducedebug|||-cacheFile|-cacheArchive|-verbose|-info|-debug|-inputtagged|-help

--- On Thu, 8/21/08, Yuri Pradkin <[EMAIL PROTECTED]> wrote:
From: Yuri Pradkin <[EMAIL PROTECTED]>
Subject: Re: [Streaming] How to pass arguments to a map/reduce script
To: core-user@hadoop.apache.org
Cc: "Gopal Gandhi" <[EMAIL PROTECTED]>
Date: Thursday, August 21, 2008, 1:43 PM

On Thursday 21 August 2008 00:14:56 Gopal Gandhi wrote:
> I am using Hadoop streaming and I need to pass arguments to my map/reduce
> script. Because a map/reduce script is triggered by hadoop, like hadoop
>   -file MAPPER -mapper "$MAPPER" -file REDUCER -reducer
"$REDUCER" ...
> How can I pass arguments to MAPPER?
>
> I tried -cmdenv name=val , but it does not work.
> Anybody can help me? Thanks lot.

I think you can simply do:
  -file MAPPER -mapper "$MAPPER arg1 arg2" -file 
REDUCER -reducer "$REDUCER" ...



  

Re: [Streaming] How to pass arguments to a map/reduce script

2008-08-21 Thread Steve Gao
That's interesting. Suppose your mapper script is a Perl script, how do you 
assign "my.mapper.arg1"'s value to a variable $x?
$x = $my.mapper.arg1
I just tried the way and my perl script does not recognize $my.mapper.arg1.

--- On Thu, 8/21/08, Rong-en Fan <[EMAIL PROTECTED]> wrote:
From: Rong-en Fan <[EMAIL PROTECTED]>
Subject: Re: [Streaming] How to pass arguments to a map/reduce script
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Date: Thursday, August 21, 2008, 11:09 AM

On Thu, Aug 21, 2008 at 3:14 PM, Gopal Gandhi
<[EMAIL PROTECTED]> wrote:
> I am using Hadoop streaming and I need to pass arguments to my map/reduce
script. Because a map/reduce script is triggered by hadoop, like
> hadoop   -file MAPPER -mapper "$MAPPER" -file REDUCER
-reducer "$REDUCER" ...
> How can I pass arguments to MAPPER?
>
> I tried -cmdenv name=val , but it does not work.
> Anybody can help me? Thanks lot.

I use -jobconf, for example

hadoop ... -jobconf my.mapper.arg1="foobar"

and in the map script, I get this by reading the environment variable

my_mapper_arg1

Hope this helps,
Rong-En Fan
>
>
>
>



  

Re: [Streaming]What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Steve Gao
One more little question, why Hadoop streaming is designed in this way to use 2 
different options to do the same thing (i.e. control the reduce number)? What's 
the point here?
Thanks

--- On Fri, 7/18/08, Arun C Murthy <[EMAIL PROTECTED]> wrote:
From: Arun C Murthy <[EMAIL PROTECTED]>
Subject: Re: [Streaming]What is the difference between streaming options: -file 
and -CacheFile ?
To: core-user@hadoop.apache.org, "Steve Gao" <[EMAIL PROTECTED]>
Date: Friday, July 18, 2008, 8:27 PM

On Jul 18, 2008, at 4:53 PM, Steve Gao wrote:

> Hi All,
> I am using Hadoop Streaming. I am confused by streaming  
> options: -file and -CacheFile. Seems that they mean the same thing,  
> right?
>

The difference is that -file will 'ship' your file (local file) to  
the cluster, while -cachefile assumes that it is already present on  
HDFS at the given path.

> Another misleading options are : -NumReduceTasks and -jobconf  
> mapred.reduce.tasks. Both are used to control (or give hit to) the  
> number of reducers.
>

Yes, they are both equivalent.

hth,
Arun


  

[Streaming]What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Steve Gao
Hi All,  
    I am using Hadoop Streaming. I am confused by streaming options: -file and 
-CacheFile. Seems that they mean the same thing, right?

    Another misleading options are : -NumReduceTasks and -jobconf 
mapred.reduce.tasks. Both are used to control (or give hit to) the number of 
reducers.

  Thanks



  

What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Steve Gao
Seems that they mean the same thing, right?
Another misleading options are : -NumReduceTasks and -jobconf 
mapred.reduce.tasks. Both are used to control (or give hit to) the number of 
reducers.