RE: How do I copy files from my linux file system to HDFS using a java prog?

2008-05-01 Thread Babu, Suresh

Try this program. Modify the HDFS configuration, if it is different from
the default.

import java.io.File;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

public class HadoopDFSFileReadWrite {

  static void usage () {
System.out.println("Usage : HadoopDFSFileReadWrite 
");
System.exit(1);
  }

  static void printAndExit(String str) {
System.err.println(str);
System.exit(1);
  }

  public static void main (String[] argv) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.default.name", "localhost:9000");
FileSystem fs = FileSystem.get(conf);

FileStatus[] fileStatus = fs.listStatus(fs.getHomeDirectory());
for(FileStatus status : fileStatus) {
System.out.println("File: " + status.getPath());
}

if (argv.length != 2)
  usage();

// HadoopDFS deals with Path
Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);

// Check if input/output are valid
if (!fs.exists(inFile))
  printAndExit("Input file not found");
if (!fs.isFile(inFile))
  printAndExit("Input should be a file");
if (fs.exists(outFile))
  printAndExit("Output already exists");

// Read from and write to new file
FSDataInputStream in = fs.open(inFile);
FSDataOutputStream out = fs.create(outFile);
byte buffer[] = new byte[256];
try {
  int bytesRead = 0;
  while ((bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
  }

} catch (IOException e) {
  System.out.println("Error while copying file");
} finally {
  in.close();
  out.close();
}
  }
}

Suresh


-Original Message-
From: Ajey Shah [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 01, 2008 3:31 AM
To: core-user@hadoop.apache.org
Subject: How do I copy files from my linux file system to HDFS using a
java prog?


Hello all,

I need to copy files from my linux file system to HDFS in a java program
and not manually. This is the piece of code that I have.

try {

FileSystem hdfs = FileSystem.get(new
Configuration());

LocalFileSystem ls = null;

ls = hdfs.getLocal(hdfs.getConf());

hdfs.copyFromLocalFile(false, new
Path(fileName), new Path(outputFile));

} catch (Exception e) {
e.printStackTrace();
}

The problem is that it searches for the input path on the HDFS and not
my linux file system.

Can someone point out where I may be wrong. I feel it's some
configuration issue but have not been able to figure it out. 

Thanks.
--
View this message in context:
http://www.nabble.com/How-do-I-copy-files-from-my-linux-file-system-to-H
DFS-using-a-java-prog--tp16992491p16992491.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Hadoop Cluster Administration Tools?

2008-05-01 Thread Allen Wittenauer
On 5/1/08 5:00 PM, "Bradford Stephens" <[EMAIL PROTECTED]> wrote:
> *Very* cool information. As someone who's leading the transition to
> open-source and cluster-orientation  at a company of about 50 people,
> finding good tools for the IT staff to use is essential. Thanks so much for
> the continued feedback.

Hmm.  I should upload my slides.




Re: Hadoop Cluster Administration Tools?

2008-05-01 Thread Bradford Stephens
*Very* cool information. As someone who's leading the transition to
open-source and cluster-orientation  at a company of about 50 people,
finding good tools for the IT staff to use is essential. Thanks so much for
the continued feedback.

On Thu, May 1, 2008 at 6:10 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:

> Khalil Honsali wrote:
>
> > Thanks Mr. Steve, and everyone..
> >
> > I actually have just 16 machines (normal P4 PCs), so in case I need to
> > do
> > things manually it takes half an hour (for example when installing
> > sun-java,
> > I had to type that 'yes' for each .bin install)
> > but for now i'm ok with pssh or just a simple custom script, however,
> > I'm
> > afraid things will get complicated soon enough...
> >
> > You said:
> > "you can automate rpm install using pure "rpm" command, and check for
> > installed artifacts yourself"
> > Could you please explain more, I understand you run the same rpm against
> > all
> > machines provided the cluster is homogeneous.
> >
> >
> 1. you can push out the same RPM files to all machines.
>
> 2. if you use rpmbuild (ant's  task does this), you can build your
> own RPMs and push them out, possibly with scp, then run ssh to install them.
> http://wiki.smartfrog.org/wiki/display/sf/RPM+Files
>
> 3. A lot of linux distros have adopted Yum
> http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Yum
>
>
> I was discussing Yum support on the Config-Management list last week,
> funnily enough
> http://lopsa.org/pipermail/config-mgmt/2008-April/000662.html
>
> Nobody likes automating it much as
>  -it doesnt provide much state information
>  -it doesnt let you roll back very easily, or fix what you want
>
> Most people in that group -the CM tool authors - prefer to automate RPM
> install/rollback themselves, so they can stay in control.
>
> Having a look at how our build.xml file manages test RPMs -that is from
> the build VMware image to a clean test image, we  and then  the
> operations
>
>
>passphrase="${rpm.ssh.passphrase}"
>keyfile="${rpm.ssh.keyfile}"
>trust="${rpm.ssh.trust}"
>verbose="${rpm.ssh.verbose}">
>  
>
>
>
>
>  
>command="cd ${rpm.full.ssh.dir};rpm --upgrade --force
> ${rpm.verbosity} smartfrog-*.rpm"
>outputProperty="rpm.result.all"/>
>
>  
>
>
> The  preset runs a remote root command
>
>
>username="${rpm.ssh.user}"
>  passphrase="${rpm.ssh.passphrase}"
>  trust="${rpm.ssh.trust}"
>  keyfile="${rpm.ssh.keyfile}"
>  timeout="${ssh.command.timeout}"
>  />
>
>
>
>username="root"
>  timeout="${ssh.rpm.command.timeout}"
>  />
>
>
> More troublesome is how we check for errors. No simple exit code here,
> instead I have to scan for strings in the response.
>
>
>  
>  
>
>  @{result}
>
>
>  
>string="@{result}"
>substring="does not exist"/>
>  
>  The rpm contains files belonging to an unknown user.
>
>  
>
>
> Then, once everything is installed, I do something even scarier - run lots
> of query commands and look for error strings. I do need to automate this
> better; its on my todo list and one of the things I might use as a test
> project would be automating creating custom hadoop EC2 images, something
> like
>
> -bring up the image
> -push out new RPMs and ssh keys, including JVM versions.
> -create the new AMI
> -set the AMI access rights up.
> -delete the old one.
>
> Like I said, on the todo list.
>
>
>
>
> --
> Steve Loughran  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action   http://antbook.org/
>


Workaround for Hadoop 0.16.3 org.xml.sax.SAXParseException on Mac OS X?

2008-05-01 Thread Craig E. Ward
I successfully used Hadoop 0.15.3 on Mac OS X 10.4 with Java 5, but I get a 
strange error when trying to upgrade to 0.16.x:


$ hadoop jar hadoop-*examples.jar grep input output 'dfs[a-z.]+'
08/05/01 15:22:21 INFO mapred.FileInputFormat: Total input paths to process : 11
org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference 
"" is an invalid XML character.


Is there a workaround for this error? Seems rather an odd thing to happen.

Thanks,

Craig
--
[EMAIL PROTECTED]


Re: OOM error with large # of map tasks

2008-05-01 Thread sam rash
Hi,
In fact we verified it is our jobconf--we have about 800k in input paths
(11k files for a few TB of data).
We'll indeed up the heap size to about 2048m and we can also do some
significant optimizations on the file paths (use wildcards and others).

Is there any plan to make the storage of the JobConf objects more
memory-efficient?  perhaps they can be serialized to disk or if the jobconf
doesn't change per task (ie it's inherited and not changed), why not keep
one per job in a tasktracker?  (or if it does change, 'share' the common
parts?).  This would greatly help us we have 3 jobs of 20k tasks and if one
gets halfway and we bump another job up, we end up with 1000s of complete
tasks (but no complete jobs) per tasktracker.  Even with our trimming of our
jobconf object and increasing the heap size, we'll hit a limit pretty quick.

thx,
-sr

On Thu, May 1, 2008 at 12:58 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:

> Hi Lili, sorry that I missed one important detail in my last response -
> tasks that complete successfully on tasktrackers are marked as
> COMMIT_PENDING by the tasktracker itself. The JobTracker takes those
> COMMIT_PENDING tasks, promotes their output (if applicable), and then
> marks
> them as SUCCEEDED. However, tasktrackers are not notified about these and
> the state of the tasks in the tasktrackers don't change, i.e., they remain
> in COMMIT_PENDING state. In short, COMMIT_PENDING at the tasktracker's end
> doesn't necessarily mean the job is stuck.
>
> The tasktracker keeps in its memory the objects corresponding to tasks it
> runs. Those objects are purged on job completion/failure only. This
> explains
> why you see so many tasks in the COMMIT_PENDING state. I believe it will
> create one jobconf for every task it launches.
>
> I am only concerned about the memory consumption by the jobconf objects.
> As
> per your report, it is ~1.6 MB per jobconf.
>
> You could try things out with an increased heap size for the
> tasktrackers/tasks. You could increase the heap size for the tasktracker
> by
> changing the value of HADOOP_HEAPSIZE in hadoop-env.sh, and the tasks'
> heap
> size can be increased by tweaking the value of mapred.child.java.opts in
> the
> hadoop-site.xml for your job.
>
> > -Original Message-
> > From: Lili Wu [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 01, 2008 4:19 AM
> > To: core-user@hadoop.apache.org
> > Subject: Re: OOM error with large # of map tasks
> >
> > Hi Devaraj,
> >
> > We don't have any special configuration on the job conf...
> >
> > We only allow 3 map tasks and 3 reduce tasks in *one* node at
> > any time.  So we are puzzled why there are 572 job confs on
> > *one* node?  From the heap dump, we see there are 569 MapTask
> > and 3 ReduceTask, (and that corresponds to 1138 MapTaskStatus
> > and 6 ReduceTaskStatus.)
> >
> > We *think* many Map tasks were stuck in COMMIT_PENDING stage,
> > because in heap dump, we saw a lot of MapTaskStatus objects
> > being in either "UNASSIGNED" or "COMMIT_PENDING" state (the
> > runState variable in
> > MapTaskStatus).   Then we took a look at another node on UI
> > just now,  for a
> > given task tracker, under "Non-runnign tasks", there are at
> > least 200 or 300 COMMIT_PENDING tasks.  It appears they stuck too.
> >
> > Thanks a lot for your help!
> >
> > Lili
> >
> >
> > On Wed, Apr 30, 2008 at 2:14 PM, Devaraj Das
> > <[EMAIL PROTECTED]> wrote:
> >
> > > Hi Lili, the jobconf memory consumption seems quite high. Could you
> > > please let us know if you pass anything in the jobconf of jobs that
> > > you run? I think you are seeing the 572 objects since a job
> > is running
> > > and the TaskInProgress objects for tasks of the running job
> > are kept
> > > in memory (but I need to double check this).
> > > Regarding COMMIT_PENDING, yes it means that tasktracker has
> > finished
> > > executing the task but the jobtracker hasn't committed the
> > output yet.
> > > In
> > > 0.16 all tasks have to necessarily take the transition from
> > > RUNNING->COMMIT_PENDING->SUCCEEDED. This behavior has been
> > improved in
> > > 0.17
> > > (hadoop-3140) to include only tasks that generate output,
> > i.e., a task
> > > is marked as SUCCEEDED if it doesn't generate any output in
> > its output path.
> > >
> > > Devaraj
> > >
> > > > -Original Message-
> > > > From: Lili Wu [mailto:[EMAIL PROTECTED]
> > > > Sent: Thursday, May 01, 2008 2:09 AM
> > > > To: core-user@hadoop.apache.org
> > > > Cc: [EMAIL PROTECTED]
> > > > Subject: OOM error with large # of map tasks
> > > >
> > > > We are using hadoop 0.16 and are seeing a consistent problem:
> > > >  out of memory errors when we have a large # of map tasks.
> > > > The specifics of what is submitted when we reproduce this:
> > > >
> > > > three large jobs:
> > > > 1. 20,000 map tasks and 10 reduce tasks 2. 17,000 map
> > tasks and 10
> > > > reduce tasks 3. 10,000 map tasks and 10 reduce tasks
> > > >
> > > > these are at normal priority and periodically we 

Re: ClassNotFoundException while running jar file

2008-05-01 Thread Jason Venner
You need to add your class or a class in your jar to the constructor for 
your JobConf object.


?
? JobConf(Class exampleClass)   ?
?   Construct a map/reduce job configuration.   ?
?
? JobConf(Configuration conf, Class exampleClass)   ?
?   Construct a map/reduce job configuration.   ?
?


JobConf

public JobConf(Class exampleClass)

   Construct a map/reduce job configuration.
 
   Parameters:
   *exampleClass - a class whose containing jar is used as the 
job's jar.

  *


JobConf

public JobConf(Configuration conf,
  Class exampleClass)

   Construct a map/reduce job configuration.
 
   Parameters:

   conf - a Configuration whose settings will be inherited.
  * exampleClass - a class whose containing jar is used as the 
job's jar.*
 



chaitanya krishna wrote:

Hi,

I wanted to run my own java code in hadoop. The following are the commands
that I executed and errors occurred.

mkdir temp

javac -Xlint -classpath hadoop-0.16.0-core.jar -d temp
GetFeatures.java   (GetFeatures.java is the code)

jar -cvf temp.jar temp

bin/hadoop jar temp.jar GetFeatures input/input.txt out

ERROR:


Exception in thread "main" java.lang.ClassNotFoundException: GetFeatures
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:242)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)


can you interpret the possible reason for the error?

Thank you.

  

--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Re: OOM error with large # of map tasks

2008-05-01 Thread Jason Venner
We have a problem with this in our application, in particular sometimes 
threads started by the map/reduce class block the tasktracker$child 
process from exiting when the map/reduce is done.
JMX is the number 1 cause of this for us,  Badly behaving JNI tasks is 
#2, MINA is #3


We modify the tasktracker$child main to System.exit when done, and this 
solves a very large set of this OOM's for us. The JNI tasks can run the 
machine OOM.


(our jni tasks have a 2.5gig working set each - don't ask...)

Devaraj Das wrote:

Long term we need to see how we can minimize the memory consumption by
objects corresponding to completed tasks in the tasktracker. 

  

-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 02, 2008 1:29 AM

To: 'core-user@hadoop.apache.org'
Subject: RE: OOM error with large # of map tasks

Hi Lili, sorry that I missed one important detail in my last 
response - tasks that complete successfully on tasktrackers 
are marked as COMMIT_PENDING by the tasktracker itself. The 
JobTracker takes those COMMIT_PENDING tasks, promotes their 
output (if applicable), and then marks them as SUCCEEDED. 
However, tasktrackers are not notified about these and the 
state of the tasks in the tasktrackers don't change, i.e., 
they remain in COMMIT_PENDING state. In short, COMMIT_PENDING 
at the tasktracker's end doesn't necessarily mean the job is stuck.


The tasktracker keeps in its memory the objects corresponding 
to tasks it runs. Those objects are purged on job 
completion/failure only. This explains why you see so many 
tasks in the COMMIT_PENDING state. I believe it will create 
one jobconf for every task it launches.


I am only concerned about the memory consumption by the 
jobconf objects. As per your report, it is ~1.6 MB per jobconf. 

You could try things out with an increased heap size for the 
tasktrackers/tasks. You could increase the heap size for the 
tasktracker by changing the value of HADOOP_HEAPSIZE in 
hadoop-env.sh, and the tasks' heap size can be increased by 
tweaking the value of mapred.child.java.opts in the 
hadoop-site.xml for your job.




-Original Message-
From: Lili Wu [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 01, 2008 4:19 AM
To: core-user@hadoop.apache.org
Subject: Re: OOM error with large # of map tasks

Hi Devaraj,

We don't have any special configuration on the job conf...

We only allow 3 map tasks and 3 reduce tasks in *one* node at any 
time.  So we are puzzled why there are 572 job confs on
*one* node?  From the heap dump, we see there are 569 MapTask and 3 
ReduceTask, (and that corresponds to 1138 MapTaskStatus and 6 
ReduceTaskStatus.)


We *think* many Map tasks were stuck in COMMIT_PENDING 
  
stage, because 

in heap dump, we saw a lot of MapTaskStatus objects being in either 
"UNASSIGNED" or "COMMIT_PENDING" state (the runState variable in
MapTaskStatus).   Then we took a look at another node on UI 
just now,  for a
given task tracker, under "Non-runnign tasks", there are at 
  
least 200 


or 300 COMMIT_PENDING tasks.  It appears they stuck too.

Thanks a lot for your help!

Lili


On Wed, Apr 30, 2008 at 2:14 PM, Devaraj Das <[EMAIL PROTECTED]> 
wrote:


  
Hi Lili, the jobconf memory consumption seems quite high. 

Could you 

please let us know if you pass anything in the jobconf of 

jobs that 


you run? I think you are seeing the 572 objects since a job


is running
  

and the TaskInProgress objects for tasks of the running job


are kept
  

in memory (but I need to double check this).
Regarding COMMIT_PENDING, yes it means that tasktracker has


finished
  

executing the task but the jobtracker hasn't committed the

output yet. 
  

In
0.16 all tasks have to necessarily take the transition from
RUNNING->COMMIT_PENDING->SUCCEEDED. This behavior has been


improved in
  

0.17
(hadoop-3140) to include only tasks that generate output,


i.e., a task
  

is marked as SUCCEEDED if it doesn't generate any output in


its output path.
  

Devaraj



-Original Message-
From: Lili Wu [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 01, 2008 2:09 AM
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Subject: OOM error with large # of map tasks

We are using hadoop 0.16 and are seeing a consistent problem:
 out of memory errors when we have a large # of map tasks.
The specifics of what is submitted when we reproduce this:

three large jobs:
1. 20,000 map tasks and 10 reduce tasks 2. 17,000 map
  

tasks and 10
  

reduce tasks 3. 10,000 map tasks and 10 reduce tasks

these are at normal priority and periodically we swap the
  

priorities
  

around to get some tasks started by each and let them complete.
other smaller jobs come  and go every hour or so (no more
  

than 200
  

map tasks, 4-10 reducers).

Our cluster consists of 

Re: JobConf: How to pass List/Map

2008-05-01 Thread Jason Venner
We have been serializing to a bytearrayoutput stream then base64 
encoding the underlying byte array and passing that string in the conf.

It is ugly but it works well until 0.17

Enis Soztutar wrote:
Yes Stringifier was committed in 0.17. What you can do in 0.16 is to 
simulate DefaultStringifier. The key feature of the Stringifier is 
that it can convert/restore any object to string using base64 encoding 
on the binary form of the object. If your objects can be easily 
converted to and from strings, then you can directly store them in 
conf. The other obvious alternative would be to switch to 0.17, once 
it is out.


Tarandeep Singh wrote:
On Wed, Apr 30, 2008 at 5:11 AM, Enis Soztutar 
<[EMAIL PROTECTED]> wrote:
 

Hi,

 There are many ways which you can pass objects using configuration.
Possibly the easiest way would be to use Stringifier interface.

 you can for example :

 DefaultStringifier.store(conf, variable ,"mykey");

 variable = DefaultStringifier.load(conf, "mykey", variableClass );



thanks... but I am using Hadoop-0.16 and Stringifier is a fix for 
0.17 version -

https://issues.apache.org/jira/browse/HADOOP-3048

Any thoughts on how to do this in 0.16 version ?

thanks,
Taran

 
 you should take into account that the variable you pass to 
configuration

should be serializable by the framework. That means it must implement
Writable of Serializable interfaces. In your particular case, you 
might want

to look at ArrayWritable and MapWritable classes.

 That said, you should however not pass large objects via 
configuration,
since it can seriously effect job overhead. If the data you want to 
pass is

large, then you should use other alternatives(such as DistributedCache,
HDFS, etc).



 Tarandeep Singh wrote:

   

Hi,

How can I set a list or map to JobConf that I can access in
Mapper/Reducer class ?
The get/setObject method from Configuration has been deprecated and
the documentation says -
"A side map of Configuration to Object should be used instead."
I could not follow this :(

Can someone please explain to me how to do this ?

Thanks,
Taran



  


  



--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


RE: OOM error with large # of map tasks

2008-05-01 Thread Devaraj Das
Long term we need to see how we can minimize the memory consumption by
objects corresponding to completed tasks in the tasktracker. 

> -Original Message-
> From: Devaraj Das [mailto:[EMAIL PROTECTED] 
> Sent: Friday, May 02, 2008 1:29 AM
> To: 'core-user@hadoop.apache.org'
> Subject: RE: OOM error with large # of map tasks
> 
> Hi Lili, sorry that I missed one important detail in my last 
> response - tasks that complete successfully on tasktrackers 
> are marked as COMMIT_PENDING by the tasktracker itself. The 
> JobTracker takes those COMMIT_PENDING tasks, promotes their 
> output (if applicable), and then marks them as SUCCEEDED. 
> However, tasktrackers are not notified about these and the 
> state of the tasks in the tasktrackers don't change, i.e., 
> they remain in COMMIT_PENDING state. In short, COMMIT_PENDING 
> at the tasktracker's end doesn't necessarily mean the job is stuck.
> 
> The tasktracker keeps in its memory the objects corresponding 
> to tasks it runs. Those objects are purged on job 
> completion/failure only. This explains why you see so many 
> tasks in the COMMIT_PENDING state. I believe it will create 
> one jobconf for every task it launches.
> 
> I am only concerned about the memory consumption by the 
> jobconf objects. As per your report, it is ~1.6 MB per jobconf. 
> 
> You could try things out with an increased heap size for the 
> tasktrackers/tasks. You could increase the heap size for the 
> tasktracker by changing the value of HADOOP_HEAPSIZE in 
> hadoop-env.sh, and the tasks' heap size can be increased by 
> tweaking the value of mapred.child.java.opts in the 
> hadoop-site.xml for your job.
> 
> > -Original Message-
> > From: Lili Wu [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 01, 2008 4:19 AM
> > To: core-user@hadoop.apache.org
> > Subject: Re: OOM error with large # of map tasks
> > 
> > Hi Devaraj,
> > 
> > We don't have any special configuration on the job conf...
> > 
> > We only allow 3 map tasks and 3 reduce tasks in *one* node at any 
> > time.  So we are puzzled why there are 572 job confs on
> > *one* node?  From the heap dump, we see there are 569 MapTask and 3 
> > ReduceTask, (and that corresponds to 1138 MapTaskStatus and 6 
> > ReduceTaskStatus.)
> > 
> > We *think* many Map tasks were stuck in COMMIT_PENDING 
> stage, because 
> > in heap dump, we saw a lot of MapTaskStatus objects being in either 
> > "UNASSIGNED" or "COMMIT_PENDING" state (the runState variable in
> > MapTaskStatus).   Then we took a look at another node on UI 
> > just now,  for a
> > given task tracker, under "Non-runnign tasks", there are at 
> least 200 
> > or 300 COMMIT_PENDING tasks.  It appears they stuck too.
> > 
> > Thanks a lot for your help!
> > 
> > Lili
> > 
> > 
> > On Wed, Apr 30, 2008 at 2:14 PM, Devaraj Das <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > Hi Lili, the jobconf memory consumption seems quite high. 
> Could you 
> > > please let us know if you pass anything in the jobconf of 
> jobs that 
> > > you run? I think you are seeing the 572 objects since a job
> > is running
> > > and the TaskInProgress objects for tasks of the running job
> > are kept
> > > in memory (but I need to double check this).
> > > Regarding COMMIT_PENDING, yes it means that tasktracker has
> > finished
> > > executing the task but the jobtracker hasn't committed the
> > output yet. 
> > > In
> > > 0.16 all tasks have to necessarily take the transition from
> > > RUNNING->COMMIT_PENDING->SUCCEEDED. This behavior has been
> > improved in
> > > 0.17
> > > (hadoop-3140) to include only tasks that generate output,
> > i.e., a task
> > > is marked as SUCCEEDED if it doesn't generate any output in
> > its output path.
> > >
> > > Devaraj
> > >
> > > > -Original Message-
> > > > From: Lili Wu [mailto:[EMAIL PROTECTED]
> > > > Sent: Thursday, May 01, 2008 2:09 AM
> > > > To: core-user@hadoop.apache.org
> > > > Cc: [EMAIL PROTECTED]
> > > > Subject: OOM error with large # of map tasks
> > > >
> > > > We are using hadoop 0.16 and are seeing a consistent problem:
> > > >  out of memory errors when we have a large # of map tasks.
> > > > The specifics of what is submitted when we reproduce this:
> > > >
> > > > three large jobs:
> > > > 1. 20,000 map tasks and 10 reduce tasks 2. 17,000 map
> > tasks and 10
> > > > reduce tasks 3. 10,000 map tasks and 10 reduce tasks
> > > >
> > > > these are at normal priority and periodically we swap the
> > priorities
> > > > around to get some tasks started by each and let them complete.
> > > > other smaller jobs come  and go every hour or so (no more
> > than 200
> > > > map tasks, 4-10 reducers).
> > > >
> > > > Our cluster consists of 23 nodes and we have 69 map tasks and
> > > > 69 reduce tasks.
> > > > Eventually, we see consistent oom errors in the task 
> logs and the 
> > > > task tracker itself goes down on as many as 14 of our nodes.
> > > >
> > > > We examined a heap dump after one of these crashes of a
> > TaskTracker

Re: Hadoop-On-Demand (HOD) newbie: question and curious about user experience

2008-05-01 Thread Alvin AuYoung

Hi Vinod,

thanks a lot for the reply. The updated description answers a lot of my 
questions -- I apologize for not finding it earlier.


On Wed, 30 Apr 2008, Vinod KV wrote:


On a first note, can you please tell which documentation you are looking at. 
'Coz hod interface is cleaned up and -o option is part of old interface. You 
can find the latest HOD documentation at 
http://hadoop.apache.org/core/docs/r0.16.3/hod.html


Oops. For some reason I was looking at the documentation for the 0.16.0 release, 
which had that -o option:


http://hadoop.apache.org/core/docs/r0.16.0/hod.html



Hod interface lets you tell how many nodes nodes you want, and not the actual
nodes. HOD obtains the list of nodes and thus gets its components (hodring
and ringmaster) running by talking to a resource manager, torque
(http://www.clusterresources.com/pages/products/torque-resource-manager.php)
being the only one that hod works with currently.


So, it seems like HOD already supports the more sophisticated torque schedulers 
like Maui, since the user can specify particular options like wallclock time 
(which primarily seems useful for backfilling and such). Is this correct?


Your last point is right, each user can have his own HOD virtual cluster(s), 
with each cluster tied to a cluster directory that (s)he owns.


Please look here for a sample HOD session 
(http://hadoop.apache.org/core/docs/r0.16.3/hod_user_guide.html#A+typical+HOD+session).


HTH

-vinod


Yes, this certainly helped a lot. Thanks!

- Alvin


RE: OOM error with large # of map tasks

2008-05-01 Thread Devaraj Das
Hi Lili, sorry that I missed one important detail in my last response -
tasks that complete successfully on tasktrackers are marked as
COMMIT_PENDING by the tasktracker itself. The JobTracker takes those
COMMIT_PENDING tasks, promotes their output (if applicable), and then marks
them as SUCCEEDED. However, tasktrackers are not notified about these and
the state of the tasks in the tasktrackers don't change, i.e., they remain
in COMMIT_PENDING state. In short, COMMIT_PENDING at the tasktracker's end
doesn't necessarily mean the job is stuck.

The tasktracker keeps in its memory the objects corresponding to tasks it
runs. Those objects are purged on job completion/failure only. This explains
why you see so many tasks in the COMMIT_PENDING state. I believe it will
create one jobconf for every task it launches.

I am only concerned about the memory consumption by the jobconf objects. As
per your report, it is ~1.6 MB per jobconf. 

You could try things out with an increased heap size for the
tasktrackers/tasks. You could increase the heap size for the tasktracker by
changing the value of HADOOP_HEAPSIZE in hadoop-env.sh, and the tasks' heap
size can be increased by tweaking the value of mapred.child.java.opts in the
hadoop-site.xml for your job.

> -Original Message-
> From: Lili Wu [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 01, 2008 4:19 AM
> To: core-user@hadoop.apache.org
> Subject: Re: OOM error with large # of map tasks
> 
> Hi Devaraj,
> 
> We don't have any special configuration on the job conf...
> 
> We only allow 3 map tasks and 3 reduce tasks in *one* node at 
> any time.  So we are puzzled why there are 572 job confs on 
> *one* node?  From the heap dump, we see there are 569 MapTask 
> and 3 ReduceTask, (and that corresponds to 1138 MapTaskStatus 
> and 6 ReduceTaskStatus.)
> 
> We *think* many Map tasks were stuck in COMMIT_PENDING stage, 
> because in heap dump, we saw a lot of MapTaskStatus objects 
> being in either "UNASSIGNED" or "COMMIT_PENDING" state (the 
> runState variable in
> MapTaskStatus).   Then we took a look at another node on UI 
> just now,  for a
> given task tracker, under "Non-runnign tasks", there are at 
> least 200 or 300 COMMIT_PENDING tasks.  It appears they stuck too.
> 
> Thanks a lot for your help!
> 
> Lili
> 
> 
> On Wed, Apr 30, 2008 at 2:14 PM, Devaraj Das 
> <[EMAIL PROTECTED]> wrote:
> 
> > Hi Lili, the jobconf memory consumption seems quite high. Could you 
> > please let us know if you pass anything in the jobconf of jobs that 
> > you run? I think you are seeing the 572 objects since a job 
> is running 
> > and the TaskInProgress objects for tasks of the running job 
> are kept 
> > in memory (but I need to double check this).
> > Regarding COMMIT_PENDING, yes it means that tasktracker has 
> finished 
> > executing the task but the jobtracker hasn't committed the 
> output yet. 
> > In
> > 0.16 all tasks have to necessarily take the transition from
> > RUNNING->COMMIT_PENDING->SUCCEEDED. This behavior has been 
> improved in
> > 0.17
> > (hadoop-3140) to include only tasks that generate output, 
> i.e., a task 
> > is marked as SUCCEEDED if it doesn't generate any output in 
> its output path.
> >
> > Devaraj
> >
> > > -Original Message-
> > > From: Lili Wu [mailto:[EMAIL PROTECTED]
> > > Sent: Thursday, May 01, 2008 2:09 AM
> > > To: core-user@hadoop.apache.org
> > > Cc: [EMAIL PROTECTED]
> > > Subject: OOM error with large # of map tasks
> > >
> > > We are using hadoop 0.16 and are seeing a consistent problem:
> > >  out of memory errors when we have a large # of map tasks.
> > > The specifics of what is submitted when we reproduce this:
> > >
> > > three large jobs:
> > > 1. 20,000 map tasks and 10 reduce tasks 2. 17,000 map 
> tasks and 10 
> > > reduce tasks 3. 10,000 map tasks and 10 reduce tasks
> > >
> > > these are at normal priority and periodically we swap the 
> priorities 
> > > around to get some tasks started by each and let them complete.
> > > other smaller jobs come  and go every hour or so (no more 
> than 200 
> > > map tasks, 4-10 reducers).
> > >
> > > Our cluster consists of 23 nodes and we have 69 map tasks and
> > > 69 reduce tasks.
> > > Eventually, we see consistent oom errors in the task logs and the 
> > > task tracker itself goes down on as many as 14 of our nodes.
> > >
> > > We examined a heap dump after one of these crashes of a 
> TaskTracker 
> > > and found something interesting--there were 572 instances of 
> > > JobConf's that
> > > accounted for 940mb of String objects.   This seems quite odd
> > > that there are
> > > so many instances of JobConf.  It seems to correlate with task in 
> > > the COMMIT_PENDING state as shown on the status for a 
> task tracker 
> > > node.  Has anyone observed something like this?
> > > can anyone explain what would cause tasks to remain in 
> this state? 
> > > (which also apparently is in-memory vs
> > > serialized to disk...).   In general, what does
> > > COMM

Re: Block reports: memory vs. file system, and Dividing offerService into 2 threads

2008-05-01 Thread Cagdas Gerede
 As far as I understand, the current focus is on how to reduce namenode's
CPU time to process block reports from a lot of datanodes.

Don't we miss another issue? Doesn't the way a block report is computed
delays the master startup time. I have to make sure the master is up as
quick as possible for maximum availability. The bottleneck seems like the
scanning of the local disk. I wrote a simple java program that only scanned
the datanode directories as Hadoop code did, and the time the java program
took was equivalent to the 90% of the time that took for block report
generation and sending. It seems scanning is very costly. It takes about 2-4
minutes.

 To address the problem, can we have *two types of block reports*. Once is
generated from memory and the other from localfs. For master starts, we can
trigger the block report that is generated from memory, and for periodic
ones we can trigger the block report that is computed from localfs.



Another issue I have is even if we do it block reports every 10 days, once
it happens, it will almost freeze the datanode functions. More specifically,
data node won't be able to report to namenode about new blocks until this
report is computed. This takes at least a couple of minutes in my system for
each datanode. As a result, master thinks a block is not yet replicated
enough and it rejects addition of a new block to a file. Then, since it does
not wait for enough time, it eventually causes the failure of writing a
file. To address the first problem, can we separate this process of scanning
the underlying disk as a separate thread then reporting of newly received
blocks?

Dhruba points out
> This sequential nature is critical in ensuring that there is no erroneous
race condition in the Namenode

I do not have any insight to this.


Cagdas

-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


ClassNotFoundException while running jar file

2008-05-01 Thread chaitanya krishna
Hi,

I wanted to run my own java code in hadoop. The following are the commands
that I executed and errors occurred.

mkdir temp

javac -Xlint -classpath hadoop-0.16.0-core.jar -d temp
GetFeatures.java   (GetFeatures.java is the code)

jar -cvf temp.jar temp

bin/hadoop jar temp.jar GetFeatures input/input.txt out

ERROR:


Exception in thread "main" java.lang.ClassNotFoundException: GetFeatures
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:242)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)


can you interpret the possible reason for the error?

Thank you.


Re: Hadoop Cluster Administration Tools?

2008-05-01 Thread Steve Loughran

Khalil Honsali wrote:

Thanks Mr. Steve, and everyone..

I actually have just 16 machines (normal P4 PCs), so in case I need to do
things manually it takes half an hour (for example when installing sun-java,
I had to type that 'yes' for each .bin install)
but for now i'm ok with pssh or just a simple custom script, however, I'm
afraid things will get complicated soon enough...

You said:
"you can automate rpm install using pure "rpm" command, and check for
installed artifacts yourself"
Could you please explain more, I understand you run the same rpm against all
machines provided the cluster is homogeneous.



1. you can push out the same RPM files to all machines.

2. if you use rpmbuild (ant's  task does this), you can build your 
own RPMs and push them out, possibly with scp, then run ssh to install them.

http://wiki.smartfrog.org/wiki/display/sf/RPM+Files

3. A lot of linux distros have adopted Yum
http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Yum


I was discussing Yum support on the Config-Management list last week, 
funnily enough

http://lopsa.org/pipermail/config-mgmt/2008-April/000662.html

Nobody likes automating it much as
 -it doesnt provide much state information
 -it doesnt let you roll back very easily, or fix what you want

Most people in that group -the CM tool authors - prefer to automate RPM 
install/rollback themselves, so they can stay in control.


Having a look at how our build.xml file manages test RPMs -that is from 
the build VMware image to a clean test image, we  and then  
the operations




  




  
command="cd ${rpm.full.ssh.dir};rpm --upgrade --force 
${rpm.verbosity} smartfrog-*.rpm"

outputProperty="rpm.result.all"/>

  


The  preset runs a remote root command


  



  


More troublesome is how we check for errors. No simple exit code here, 
instead I have to scan for strings in the response.



  
  

  @{result}


  

  
  The rpm contains files belonging to an unknown user.

  


Then, once everything is installed, I do something even scarier - run 
lots of query commands and look for error strings. I do need to automate 
this better; its on my todo list and one of the things I might use as a 
test project would be automating creating custom hadoop EC2 images, 
something like


-bring up the image
-push out new RPMs and ssh keys, including JVM versions.
-create the new AMI
-set the AMI access rights up.
-delete the old one.

Like I said, on the todo list.




--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Input and Output types?

2008-05-01 Thread Sridhar Raman
Thanks.

On Fri, Apr 18, 2008 at 9:14 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

>
> On Apr 17, 2008, at 11:20 PM, Sridhar Raman wrote:
>
>  I am new to MapReduce and Hadoop, and I have managed to find my way
> > through
> > with a few programs.  But I still have some doubts that are constantly
> > clinging onto me.  I am not too sure whether these are basic doubts, or
> > just
> > some documentation that I missed somewhere.
> >
>
> Take a look at  http://tinyurl.com/4y7776 under InputFormats.
>
>  1)  Should my input _always_ be text files?  What if my input is in the
> > form
> > of Java objects?  Where do I handle this conversion?
> >
>
> You can define your own InputFormat that reads an arbitrary format, or use
> SequenceFileInputFormat that reads SequenceFiles. SequenceFiles are a file
> format defined by Hadoop to hold binary data consisting of Writable keys and
> values.
>
>  2)  How do I control how the output is written?  For example, if I want
> > to
> > output in a format that is my own, how do I do it?
> >
>
> That is controlled by the OutputFormat. It defaults to TextOutputFormat,
> but you can either use SequenceFileOutputFormat or make your own.
>
> -- Owen
>


Re: User accounts in Master and Slaves

2008-05-01 Thread Sridhar Raman
Though I am able to run MapReduce tasks without errors, I am still not able
to get stop-all to work.  It still says, "no tasktracker to stop, no
datanode to stop, ...".

And also, there are a lot of java processes running in my Task Manager which
I need to forcibly shut down.  Are these two problems related?

On Thu, Apr 24, 2008 at 3:06 PM, Sridhar Raman <[EMAIL PROTECTED]>
wrote:

> I tried following the instructions for a single-node cluster (as mentioned
> in the link).  I am facing a strange roadblock.
>
> In the hadoop-site.xml, I have set the value of hadoop.tmp.dir to
> /WORK/temp/hadoop/workspace/hadoop-${user.name}.
>
> After doing this, I run bin/hadoop namenode -format, and this now creates
> a hadoop-sridhar folder at under workspace.  This is fine as the user I've
> logged on as is "sridhar".
>
> Then I start my cluster by running bin/start-all.sh.  When I do this, the
> output I get is as mentioned in this 
> link,
> but a new folder called hadoop-SYSTEM is created under workspace.
>
> And then when I run bin/stop-all.sh, all that I get is a "no tasktracker
> to stop, no datanode to stop, ...".  Any idea why this can happen?
>
> Another point is that after starting the cluster, I did a netstat.  I
> found multiple entries of localhost:9000 and all had LISTENING.  Is this
> also expected behaviour?
>
>
> On Wed, Apr 23, 2008 at 9:13 PM, Norbert Burger <[EMAIL PROTECTED]>
> wrote:
>
> > Yes, this is the suggested configuration.  Hadoop relies on
> > password-less
> > SSH to be able to start tasks on slave machines.  You can find
> > instructions
> > on creating/transferring the SSH keys here:
> >
> >
> > http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >
> > On Wed, Apr 23, 2008 at 4:39 AM, Sridhar Raman <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Ok, what about the issue regarding the users?  Do all the machines
> > need to
> > > be under the same user?
> > >
> > > On Wed, Apr 23, 2008 at 12:43 PM, Harish Mallipeddi <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman <
> > [EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > After trying out Hadoop in a single machine, I decided to run a
> > > > MapReduce
> > > > > across multiple machines.  This is the approach I followed:
> > > > > 1 Master
> > > > > 1 Slave
> > > > >
> > > > > (A doubt here:  Can my Master also be used to execute the
> > Map/Reduce
> > > > > functions?)
> > > > >
> > > >
> > > > If you add the master node to the list of slaves (conf/slaves), then
> > the
> > > > master node run will also run a TaskTracker.
> > > >
> > > >
> > > > >
> > > > > To do this, I set up the masters and slaves files in the conf
> > > directory.
> > > > > Following the instructions in this page -
> > > > > http://hadoop.apache.org/core/docs/current/cluster_setup.html, I
> > had
> > > set
> > > > > up
> > > > > sshd in both the machines, and was able to ssh from one to the
> > other.
> > > > >
> > > > > I tried to run bin/start-dfs.sh.  Unfortunately, this asked for a
> > > > password
> > > > > for [EMAIL PROTECTED], while in slave, there was only user2.  While in
> > > master,
> > > > > user1 was the logged on user.  How do I resolve this?  Should the
> > user
> > > > > accounts be present in all the machines?  Or can I specify this
> > > > somewhere?
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Harish Mallipeddi
> > > > circos.com : poundbang.in/blog/
> > > >
> > >
> >
>
>


Re: Inconsistency while running in eclipse and cygwin

2008-05-01 Thread Sridhar Raman
I managed to fix both the problems.

1)  The one in Eclipse was happening because of the df task which wasn't
possible in Windows.  Once I added cygwin/bin to the Path, this started
working.

2)  This one occurred because the output value class of my Combiner and
Reducer was different.

On Wed, Apr 30, 2008 at 6:49 PM, Sridhar Raman <[EMAIL PROTECTED]>
wrote:

> I am trying to run my MR task in a local machine.  I am doing this because
> this is my first foray into MapReduce, and want to debug completely before I
> move it to a multi-node cluser.  The initial hitch that I faced was in
> running it from Eclipse.  Thanks to this 
> response,
> I managed to get that working.
>
> But now I am facing a strange problem.
>
> My Mapper takes  as input, and returns  Vector>.
> I have a Combiner that takes the > and
> computes the intermediate output of .  This
> IntVector basically stores the sum of the Vectors present in the iterator,
> along with other information.
>
> Now to the problem.  If I run this in Eclipse, after the Mapper is done, I
> straightaway get this exception:
> Iteration 0
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
> at
> com.company.analytics.clustering.mr.core.KMeansDriver.runIteration(KMeansDriver.java:144)
> at
> com.company.analytics.clustering.mr.core.KMeansDriver.runJob(KMeansDriver.java:91)
> at
> com.company.analytics.clustering.mr.core.KMeansDriver.main(KMeansDriver.java:37)
>
> Whereas if I run this from the cygwin command prompt, the execution
> manages to go to the Combiner, but after that it gives this exception:
> java.io.IOException: wrong value class:
> com.company.analytics.clustering.feature.
> InterVector is not class com.company.analytics.clustering.feature.Vector
> at
> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:938)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(MapTask.java:414)
> at
> com.company.analytics.clustering.mr.core.KMeansCombiner.reduce(KMeansCombiner.java:57)
> at
> com.company.analytics.clustering.mr.core.KMeansCombiner.reduce(KMeansCombiner.java:1)
>
> Any idea what could be the problem.  Both the cygwin and Eclipse take the
> same input folder.
> I would be glad if I get any assistance.
>
> Thanks,
> Sridhar
>


Re: One-node cluster with DFS on Debian

2008-05-01 Thread Steve Loughran

Richard Crowley wrote:
Problem fixed.  My machine's /etc/hostname file came without a 
fully-qualified domain name.  Why does Hadoop (or perhaps just 
java.net.InetAddress) rely on reverse DNS lookups?


Richard



Jave networking is a mess. There are some implicit assumptions "welll 
managed network, static IPs, static proxy settings" that dont apply 
everywhere...it takes testing on home boxes to bring these problems up, 
like mine


"regression: Sf daemon longer works on my home machine due to networking 
changes"

http://jira.smartfrog.org/jira/browse/SFOS-697

If I look at our code to fix this, we handled failure by falling back to 
something else


try {
//this can still do a network reverse DNS lookup, and 
hence fail

hostInetAddress = InetAddress.getLocalHost();
} catch (UnknownHostException e) {
//no, nothing there either
hostInetAddress = InetAddress.getByName(null);

}


Re: Getting jobTracker startTime from the JobClient

2008-05-01 Thread Steve Loughran

Pete Wyckoff wrote:

I'm looking for the actual #of seconds since (or absolute time) the job
tracker was ready to start accepting jobs.

I'm writing a utility to (attempt :)) more robustly run hadoop jobs and one
thing I want to detect is if the JobTracker has gone down while a job that
failed was running.

Since the script will also sleep and poll the job tracker till it comes back
up (or connectivity is restored), when I finally contact the job tracker, I
want to know how long its been running.

So, knowing whether the JT is initializing or not is useful, but also the
actual time it was ready.

Thanks, pete


Seems to me something that should be remotely available and visible via 
JMX...always good to use. Absolute time info, if sent as a time_t long 
should be in UTC to avoid confusion when talking to servers in different 
time zones...


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/