too many open files error

2008-09-26 Thread Eric Zhang

Hi,
I encountered following FileNotFoundException resulting from too many 
open files error when i tried to run a job.  The job had been run for 
several times before without problem.  I am confused by the exception 
because my code closes all the files and even it doesn't,  the job only 
have only 10-20 small input/output files.   The limit on the open file 
on my box is 1024.Besides, the error seemed to happen even before 
the task was executed, I am using 0.17 version.   I'd appreciate if 
somebody can shed some light on this issue.  BTW, the job ran ok after i 
restarted hadoop.Yes, the hadoop-site.xml did exist in that directory. 

java.lang.RuntimeException: java.io.FileNotFoundException: 
/home/y/conf/hadoop/hadoop-site.xml (Too many open files)
   at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:901)
   at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:804)
   at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:772)

   at org.apache.hadoop.conf.Configuration.get(Configuration.java:272)
   at 
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:414)
   at 
org.apache.hadoop.mapred.JobConf.getKeepFailedTaskFiles(JobConf.java:306)
   at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.setJobConf(TaskTracker.java:1487)
   at 
org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:722)
   at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:716)
   at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
   at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)

   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)
Caused by: java.io.FileNotFoundException: 
/home/y/conf/hadoop/hadoop-site.xml (Too many open files)

   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:106)
   at java.io.FileInputStream.init(FileInputStream.java:66)
   at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
   at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
   at 
com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
   at 
com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
   at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
   at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
   at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:225)
   at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)

   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
   at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:832)

   ... 12 more


Sometimes it gave me this message:
java.io.IOException: Cannot run program bash: java.io.IOException: 
error=24, Too many open files

   at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
   at org.apache.hadoop.util.Shell.run(Shell.java:134)
   at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
   at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)

   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:133)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many 
open files

   at java.lang.UNIXProcess.init(UNIXProcess.java:148)
   at java.lang.ProcessImpl.start(ProcessImpl.java:65)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
   ... 6 more



--
Eric Zhang
408-349-2466
Vespa Content team



Re: Questions about the MapReduce libraries and job schedulers inside JobTracker and JobClient running on Hadoop

2008-02-19 Thread Eric Zhang
The class is defined to be accessed in package level so not displayed in 
javadoc.   Source code comes with hadoop installation under 
${HADOOP_INSTALLATION_DIR}/src/java/org/apache/hadoop/mapred.


Eric
Andy Li wrote:

Thanks for both inputs.  My question actually focus more on what Vivek has
mentioned.

I would like to work on the JobClient to see how it submits jobs to
different file system and
slaves in the same Hadoop cluster.

Not sure if there is a complete document to explain the scheduler underneath
Hadoop,
if not, I'll wrap up what I know and study from the source code and submit
it to the community
once it is done.  Review and comments are welcome.

For the code, I couldn't find JobInProgress from the API index.  Could
anyone provide me
a pointer to this?  Thanks.

On Fri, Feb 15, 2008 at 3:01 PM, Vivek Ratan [EMAIL PROTECTED] wrote:

  

I read Andy's question a little differently. For a given job, the
JobTracker
decides which tasks go to which TaskTracker (the TTs ask for a task to run
and the JT decides which task is the most appropriate). Currently, the JT
favors a task whose input data is on the same host as the TT (if there are
more than one such tasks, it picks the one with the largest input size).
It
also looks at failed tasks and certain other criteria. This is very basic
scheduling and there is a lot of scope for improvement. There currently is
a
proposal to support rack awareness, so that if the JT can't find a task
whose input data is on the same host as the TT, it looks for a task whose
data is on the same rack.

You can clearly get more ambitious with your scheduling algorithm. As you
mention, you could use other criteria for scheduling a task: available CPU
or memory, for example. You could assign tasks to hosts that are the most
'free', or aim to distribute tasks across racks, or try some other load
balancing techniques. I believe there are a few discussions on these
methods
on Jira, but I don't think there's anything concrete yet.

BTW, the code that decides what task to run is primarily in
JobInProgress::findNewTask().


-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Friday, February 15, 2008 1:54 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions about the MapReduce libraries and job schedulers
inside JobTracker and JobClient running on Hadoop


Core-user is the right place for this question.

Your description is mostly correct.  Jobs don't necessarily go to all of
your boxes in the cluster, but they may.

Non-uniform machine specs are a bit of a problem that is being (has been?)
addressed by allowing each machine to have a slightly different
hadoop-site.xml file.  That would allow different settings for storage
configuration and number of processes to run.

Even without that, you can level the load a bit by simply running more
jobs
on the weak machines than you would otherwise prefer.  Most map reduce
programs are pretty light on memory usage so all that happens is that you
get less throughput on the weak machines.  Since there are normally more
map
tasks than cores, this is no big deal; slow machines get fewer tasks and
toward the end of the job, their tasks are even replicated on other
machines
in case they can be done more quickly.


On 2/15/08 1:25 PM, [EMAIL PROTECTED] [EMAIL PROTECTED]

wrote:




Hello,

My first time posting this in the news group.My question sounds more
  

like


a MapReduce question
instead of Hadoop HDFS itself.

To my understanding, the JobClient will submit all Mapper and Reduce
class in a uniform way to the cluster?  Can I assume this is more like
a uniform scheduler for all the task?

For example, if I have a 100 node cluster, 1 master (namenode), 99
slaves (datanodes).
When I do
JobClient.runJob(jconf)
the JobClient will uniformly distributes all Mapper and Reduce class
to all 99 nodes.

In the slaves, they will all have the same hadoop-site.xml and
hadoop-default.xml.
Here comes the main concern, what if some of the nodes don't have the
same hardware spec such as memory or CPU speed?  E.g. different batch
purchase and repairment overtime that causes this.

Is there any way that the JobClient can be aware of this and submit
different number of tasks to different slaves during start-up?
For example, for some slaves, it has 16 cores CPU instead of 8 cores.
The problem I see here is that for the 16 cores, only 8 cores are
used.

P.S. I'm looking into the JobClient source code and
JobProfile/JobTracker to see if this can be done.
But not sure if I am on the right track.

If this topic is more likely to be in the [EMAIL PROTECTED],
please let me know.  I'll send another one to that news group.

Regards,
-Andy

TREND MICRO EMAIL NOTICE
The information contained in this email and any attachments is
confidential and may be subject to copyright or other intellectual
property protection. If you are not the intended recipient, you are
not authorized to use or disclose this information, and we 

Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Eric Zhang
This is very impressive.  Congrats!. 


Which version of Hadoop is this running on and what's the input data size?

Eric

Owen O'Malley wrote:
The link inversion and ranking algorithms for Yahoo Search are now 
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html 



Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion 
links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes