How jobs are copied to other nodes?

2008-03-07 Thread Ben Kucinich
I am interested to know the internal working of Hadoop regarding
distribution of jobs. How are the jobs copied to other nodes?

Is the class file copied to all other nodes where they are executed?


RE: Map/Reduce Type Mismatch error

2008-03-07 Thread Jeff Eastman
The key provided by the default FileInputFormat is not Text, but an
integer offset into the split(which is not very usful IMHO). Try
changing your mapper back to . If you are
expecting the file name to be the key, you will (I think) need to write
your own InputFormat.

Jeff

-Original Message-
From: Prasan Ary [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 07, 2008 3:50 PM
To: hadoop
Subject: Map/Reduce Type Mismatch error

  Hi All,
  I am running a Map/Reduce on a textfile.
  Map takes  as (key,value) input pair , and outputs
 as (key,value) output pair.
   
  Reduce takes  as (key,value) input pair, and outputs
 as (key,value) output pair.
   
  I am getting a type mismatch error.
   
  Any suggestion?
   
   
  JobConf job = new JobConf(..
   
  job.setOutputKeyClass(Text.class); 
  job.setOutputValueClass(Text.class); 
   
  -
  public static class Map extends MapReduceBase implements Mapper {
..
public void map(Text key, Text value, OutputCollector output, Reporter reporter) throws IOException { ..
   
  output.collect(key,new IntWritable(1));
   
  
   
  public static class Reduce extends MapReduceBase implements
Reducer { 
  public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException
{  ..
   
  output.collect(key, new Text("SomeText");

   
-
Never miss a thing.   Make Yahoo your homepage.


Map/Reduce Type Mismatch error

2008-03-07 Thread Prasan Ary
  Hi All,
  I am running a Map/Reduce on a textfile.
  Map takes  as (key,value) input pair , and outputs 
 as (key,value) output pair.
   
  Reduce takes  as (key,value) input pair, and outputs 
 as (key,value) output pair.
   
  I am getting a type mismatch error.
   
  Any suggestion?
   
   
  JobConf job = new JobConf(..
   
  job.setOutputKeyClass(Text.class); 
  job.setOutputValueClass(Text.class); 
   
  -
  public static class Map extends MapReduceBase implements Mapper {
..
public void map(Text key, Text value, OutputCollector 
output, Reporter reporter) throws IOException { ..
   
  output.collect(key,new IntWritable(1));
   
  
   
  public static class Reduce extends MapReduceBase implements Reducer { 
  public void reduce(Text key, Iterator values, 
OutputCollector output, Reporter reporter) throws IOException {  
..
   
  output.collect(key, new Text("SomeText");

   
-
Never miss a thing.   Make Yahoo your homepage.

Re: Does Hadoop Honor Reserved Space?

2008-03-07 Thread Jimmy Wan
Unfortunately, I had to clean up my HDFS in order to get some work done,  
but
I was running Hadoop on Hadoop 0.16.0 running on a Linux box. My  
configuration is
two machines. One has the JobTracker/NameNode and a TaskTracker instance  
all
running on the same machine. The other machine is just running a  
TaskTracker.

Replication was set to 2 for the default and the max.

--
Jimmy

On Thu, 06 Mar 2008 16:01:16 -0600, Hairong Kuang <[EMAIL PROTECTED]>  
wrote:


In addition to the version, could you please send us a copy of the  
datanode

report by running the command bin/hadoop dfsadmin -report?

Thanks,
Hairong


On 3/6/08 11:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:


but intermediate data is stored in a different directory from dfs/data
(something like mapred/local by default i think).

what version are u running?


-Original Message-
From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
Sent: Thu 3/6/2008 10:14 AM
To: core-user@hadoop.apache.org
Subject: RE: Does Hadoop Honor Reserved Space?

I've run into a similar issue in the past. From what I understand, this
parameter only controls the HDFS space usage. However, the intermediate  
data in the map reduce job is stored on the local file system (not

HDFS) and is not subject to this configuration.

In the past I have used mapred.local.dir.minspacekill and
mapred.local.dir.minspacestart to control the amount of space that is
allowable for use by this temporary data.

Not sure if that is the best approach though, so I'd love to hear what  
other people have done. In your case, you have a map-red job that will  
consume too much space (without setting a limit, you didn't have enough

disk capacity for the job), so looking at mapred.output.compress and
mapred.compress.map.output might be useful to decrease the job's disk
requirements.

--Ash

-Original Message-
From: Jimmy Wan [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 06, 2008 9:56 AM
To: core-user@hadoop.apache.org
Subject: Does Hadoop Honor Reserved Space?

I've got 2 datanodes setup with the following configuration parameter:

 dfs.datanode.du.reserved
 429496729600
 Reserved space in bytes per volume. Always leave this
much
space free for non dfs use.
 


Both are housed on 800GB volumes, so I thought this would keep about  
half

the volume free for non-HDFS usage.

After some long running jobs last night, both disk volumes were  
completely

filled. The bulk of the data was in:
${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data

This is running as the user hadoop.

Am I interpretting these parameters incorrectly?

I noticed this issue, but it is marked as closed:
http://issues.apache.org/jira/browse/HADOOP-2549


Re: using a perl script with argument variables which point to config files on the DFS as a mapper

2008-03-07 Thread Theodore Van Rooy
So I've read up on -cacheFile and -File and I still can't quite get my
script to work.  I'm running it as follows:

hstream -input basedir/finegrain/validation.txt.head
 -output basedir/output
 -mapper "Evaluate_linux.pl segs.xml config.txt"
 -numReduceTasks 0
 -jobconf mapred.job.name="Evaluate"
 -file Evaluate_linux.pl
 -cacheFile
hdfs://servername:9008/user/tvan/basedir/custom/final_segs.20080305.xml#segs.xml
 -cacheFile hdfs://servername:9008/user/tvan/basedir/config.txt#config.txt

the job starts but all map jobs fail with the same code:

java.io.IOException: log:null
R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=tvanrooy
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Fri Mar 07 15:47:37 EST 2008
java.io.IOException: Broken pipe


Is this an indication that my script isn't finding the files I pass it?


On Thu, Mar 6, 2008 at 5:17 PM, Lohit <[EMAIL PROTECTED]> wrote:

> you could use -cacheFile or -file option for this. Check streaming doc
> for examples.
>
>
>
>
>
> On Mar 6, 2008, at 2:32 PM, "Theodore Van Rooy" <[EMAIL PROTECTED]>
> wrote:
>
> > I would like to convert a perl script that currently uses argument
> > variables
> > to run with Hadoop Streaming.
> >
> > Normally I would use the script like
> >
> > 'cat datafile.txt | myscript.pl  folder/myfile1.txt  folder/
> > myfile2.txt'
> >
> > where the two argument variables are actually the names of
> > configuration
> > files for the myscript.pl.
> >
> > The question I have is, how do I get the perl script to either look
> > in the
> > local directory for the config files, or how would I go about
> > getting them
> > to look on the DFS for the config files? Once the configurations are
> > passed
> > in there is no problem using the STDIN to process the datafile
> > passed into
> > it by hadoop.
>
>


-- 
Theodore Van Rooy
Green living isn't just for hippies...
http://greentheo.scroggles.com


Re: Equivalent of cmdline head or tail?

2008-03-07 Thread Ted Dunning

I thought so as well until I reflected for a moment.

But if you include the top N from every combiner, then you are guaranteed to
have the global top N in the output of all of the combiners.


On 3/6/08 11:50 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:

> 
> On Mar 6, 2008, at 5:02 PM, Ted Dunning wrote:
> 
>>  I don't know if the combiner sees things in
>> order.  IF it does, then you can prune on both levels to minimize data
>> transfer.
> 
> The input to the combiners is sorted. However, when filtering to the
> top N, you need to be careful to include enough that the partial view
> doesn't distort the global view.
> 
> -- O



Re: Copying files from remote machine to dfs

2008-03-07 Thread Marco Nicosia
If uploading files from a non-HDFS file system:

Install the hadoop distribution, configure it to talk to your namenode, make
sure there are no firewall restrictions (tcp ports 8020, 50010, 50070,
50075) and then simply run "hadoop dfs -put  "

On 3/7/08 03:43, "Ved Prakash" <[EMAIL PROTECTED]> wrote:

> Hi Friends,
> 
> Can we copy files from residing on a remote machine to dfs ?
> 
> Thanks
> 
> Ved

-- 
   Marco Nicosia - Grid Services Ops
   Systems, Tools, and Services Group




Custom Input Formats

2008-03-07 Thread Dan Tamowski
Hello,

First, I am currently subscribed to the digest, could you please cc me at
[EMAIL PROTECTED] with any replies. I really appreciate it.

I have a few questions regarding input formats. Specifically, I want to use
one complete text file per input format. I understand that I must implement
both FileInputFormat and and RecordReader. From there, however, I am not
sure what to do. Can I include these in my MR project or do I need to keep
them in a separate jar and reference that in HADOOP-CLASSPATH? Also should
HADOOP-CLASSPATH point to a directory of jars or does it mimic the
space-delimited manifest.mf? Finally, are there any examples of user-defined
input formats available anywhere?

Thanks,

Dan


[HOD] Collecting MapReduce logs

2008-03-07 Thread Luca Telloli

Hello everyone,
	I wonder what is the meaning of hodring.log-destination-uri versus 
hodring.log-dir. I'd like to collect MapReduce UI logs after a job has 
been run and the only attribute seems to be hod.hadoop-ui-log-dir, in 
the hod section.


With that attribute specified, logs are all grabbed in that directory, 
producing a large amount of html files. Is there a way to collect them, 
maybe as a .tar.gz, in a place somewhere related to the user?


Additionally, how do administrators specify variables in these values? 
Which interpreter interprets them? For instance, variables specified in 
a bash fashion like $USER in section hodring or ringmaster work well (I 
guess they are interpreted by bash itself) but if specified in the hod 
section they're not: I tried with

[hod]
hadoop-ui-log-dir=/somedir/$USER
but any hod command fails displaying an error on that line.

Cheers,
Luca


[HOD] Collecting MapReduce logs

2008-03-07 Thread Luca

Hello everyone,
	I wonder what is the meaning of hodring.log-destination-uri versus 
hodring.log-dir. I'd like to collect MapReduce UI logs after a job has 
been run and the only attribute seems to be hod.hadoop-ui-log-dir, in 
the hod section.


With that attribute specified, logs are all grabbed in that directory, 
producing a large amount of html files. Is there a way to collect them, 
maybe as a .tar.gz, in a place somewhere related to the user?


Additionally, how do administrators specify variables in these values? 
Which interpreter interprets them? For instance, variables specified in 
a bash fashion like $USER in section hodring or ringmaster work well (I 
guess they are interpreted by bash itself) but if specified in the hod 
section they're not: I tried with

[hod]
hadoop-ui-log-dir=/somedir/$USER
but any hod command fails displaying an error on that line.

Cheers,
Luca



Re: 答复: clustering problem

2008-03-07 Thread Ved Prakash
Hi,

I found the solution for the problem I have posted, I would post the
resolution here so that others may benefit from this.

the incompatibility was showing on my slave was because of incompatible java
installed on my slave. I removed the current java installation from slave
and installed the same version as I have on my master and that solved the
problem.

Thanks all, for your responses.

Ved

2008/3/5 Ved Prakash <[EMAIL PROTECTED]>:

> Hi Miles,
>
> Yes, I have hadoop-0.15.2 installed on both my systems.
>
> Ved
>
> 2008/3/5 Miles Osborne <[EMAIL PROTECTED]>:
>
> Did you use exactly the same version of Hadoop on each and every node?
> >
> > Miles
> >
> > On 05/03/2008, Ved Prakash <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi Zhang,
> > >
> > > Thanks for your reply, I tried this but no use. It still throws up
> > > Incompatible build versions.
> > >
> > > I removed the dfs local directory on slave and issued start-dfs.sh on
> > > server, and when I checked the logs it showed up with the same
> > problem.
> > >
> > > Do you guys need some more information from my side to have a better
> > > understanding about the problem.
> > >
> > > Please let me know,
> > >
> > > Thanks
> > >
> > > Ved
> > >
> > > 2008/3/5 Zhang, Guibin <[EMAIL PROTECTED]>:
> > >
> > > > You can delete the DFS local dir in the slave (The local dictionary
> > > should
> > > > be ${hadoop.tmp.dir}/dfs/) and try again.
> > > >
> > > >
> > > > -邮件原件-
> > > > 发件人: Ved Prakash [mailto:[EMAIL PROTECTED]
> > > > 发送时间: 2008年3月5日 14:51
> > > > 收件人: core-user@hadoop.apache.org
> > > > 主题: clustering problem
> > > >
> > > > Hi Guys,
> > > >
> > > > I am having problems creating clusters on 2 machines
> > > >
> > > > Machine configuration :
> > > > Master : OS: Fedora core 7
> > > > hadoop-0.15.2
> > > >
> > > > hadoop-site.xml listing
> > > >
> > > > 
> > > >  
> > > >fs.default.name
> > > >anaconda:50001
> > > >  
> > > >  
> > > >mapred.job.tracker
> > > >anaconda:50002
> > > >  
> > > >  
> > > >dfs.replication
> > > >2
> > > >  
> > > >  
> > > >dfs.secondary.info.port
> > > >50003
> > > >  
> > > >  
> > > >dfs.info.port
> > > >50004
> > > >  
> > > >  
> > > >mapred.job.tracker.info.port
> > > >50005
> > > >  
> > > >  
> > > >tasktracker.http.port
> > > >50006
> > > >  
> > > > 
> > > >
> > > > conf/masters
> > > > localhost
> > > >
> > > > conf/slaves
> > > > anaconda
> > > > v-desktop
> > > >
> > > > the datanode, namenode, secondarynamenode seems to be working fine
> > on
> > > the
> > > > master but on slave this is not the case
> > > >
> > > > slave
> > > > OS: Ubuntu
> > > >
> > > > hadoop-site.xml listing
> > > >
> > > > same as master
> > > >
> > > > in the logs on slave machine I see this
> > > >
> > > > 2008-03-05 12:15:25,705 INFO
> > org.apache.hadoop.metrics.jvm.JvmMetrics:
> > > > Initializing JVM Metrics with processName=DataNode, sessionId=null
> > > > 2008-03-05 12:15:25,920 FATAL org.apache.hadoop.dfs.DataNode:
> > > Incompatible
> > > > build versions: namenode BV = Unknown; datanode BV = 607333
> > > > 2008-03-05 12:15:25,926 ERROR org.apache.hadoop.dfs.DataNode:
> > > > java.io.IOException: Incompatible build versions: namenode BV =
> > Unknown;
> > > > datanode BV = 607333
> > > >at org.apache.hadoop.dfs.DataNode.handshake(DataNode.java
> > :316)
> > > >at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java
> > > :238)
> > > >at org.apache.hadoop.dfs.DataNode.(DataNode.java:206)
> > > >at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java
> > > :1575)
> > > >at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1519)
> > > >at org.apache.hadoop.dfs.DataNode.createDataNode(
> > DataNode.java
> > > > :1540)
> > > >at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1711)
> > > >
> > > > Can someone help me with this please.
> > > >
> > > > Thanks
> > > >
> > > > Ved
> > > >
> > >
> >
> >
> >
> > --
> > The University of Edinburgh is a charitable body, registered in
> > Scotland,
> > with registration number SC005336.
> >
>
>


Copying files from remote machine to dfs

2008-03-07 Thread Ved Prakash
Hi Friends,

Can we copy files from residing on a remote machine to dfs ?

Thanks

Ved


Re: Pipes task being killed

2008-03-07 Thread Owen O'Malley


On Mar 5, 2008, at 9:31 AM, Rahul Sood wrote:


Hi,

We have a Pipes C++ application where the Reduce task does a lot of
computation. After some time the task gets killed by the Hadoop
framework. The job output shows the following error:

Task task_200803051654_0001_r_00_0 failed to report status for 604
seconds. Killing!

Is there any way to send a heartbeat to the TaskTracker from a Pipes
application. I believe this is possible in Java using
org.apache.hadoop.util.Progress and we're looking for something
equivalent in the C++ Pipes API.


The context object has a progress method that should be called during  
long computations...

http://tinyurl.com/yt7hyx
search for progress...


-- Owen