Re: The reduce copier failed

2008-09-26 Thread Arun C Murthy


2008-09-25 17:12:18,250 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200809180916_0027_r_07_2: Got 2 new map-outputs  number
of known map outputs is 21
2008-09-25 17:12:18,251 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200809180916_0027_r_07_2 Merge of the inmemory files threw
an exception: java.io.IOException: Intermedate merge failed
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
$InMemFSMergeThread.run(ReduceTask.java:2064)

Caused by: org.apache.lucene.index.MergePolicy$MergeException: segment
_bfu exists in external directory yet the MergeScheduler executed the
merge in a separate thread
   at


Hmm... could this be something specific to  
org.apache.lucene.index.MergePolicy?


Arun

org 
.apache 
.lucene.index.IndexWriter.copyExternalSegments(IndexWriter.java:2362)
   at  
org 
.apache 
.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:2307)
   at  
org 
.apache 
.hadoop 
.contrib.index.mapred.IntermediateForm.process(IntermediateForm.java: 
135)
   at  
org 
.apache 
.hadoop 
.contrib 
.index.mapred.IndexUpdateCombiner.reduce(IndexUpdateCombiner.java:56)
   at  
org 
.apache 
.hadoop 
.contrib 
.index.mapred.IndexUpdateCombiner.reduce(IndexUpdateCombiner.java:38)
   at org.apache.hadoop.mapred.ReduceTask 
$ReduceCopier.combineAndSpill(ReduceTask.java:2160)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access 
$3100(ReduceTask.java:341)
   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2120)

   ... 1 more

2008-09-25 17:12:19,087 INFO org.apache.hadoop.mapred.ReduceTask: Read
14523136 bytes from map-output for
attempt_200809180916_0027_m_16_0
2008-09-25 17:12:19,087 INFO org.apache.hadoop.mapred.ReduceTask: Rec
#1 from attempt_200809180916_0027_m_16_0 - (41, 10651153) from
ars1dev6
2008-09-25 17:12:19,110 INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling 14506735 bytes (14506735 raw bytes) into RAM from
attempt_200809180916_0027_m_09_0
2008-09-25 17:12:19,226 INFO org.apache.hadoop.mapred.ReduceTask:
Closed ram manager
2008-09-25 17:12:19,228 WARN org.apache.hadoop.mapred.TaskTracker:
Error running child
java.io.IOException: attempt_200809180916_0027_r_07_2The reduce
copier failed
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
   at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java:2209)


Thanks,
Joe




Re: 1 file per record

2008-09-26 Thread chandravadana


hi

i'm writing an appln which computes using the entire data from a file.
for that purpose i dont want to split my file and the entire file shd go to
map task..
i've been able to override isSplitable() do it and the file is not getting
split now..
then..
i had to store the input values to an array..(in map func) and then proceed
with my computation. when i displayed that array i found only the last line
of the file getting displayed... does this mean that data is read line by
line by the line reader and not continously.
if so, what shd i do inorder to read complete contents of the file...

Thank you
Chandravadana S


Enis Soztutar wrote:
 
 Yes, you can use MultiFileInputFormat.
 
 You can extend the MultiFileInputFormat to return a RecordReader, which 
 reads a record for each file in the MultiFileSplit.
 
 Enis
 
 chandra wrote:
 hi..

 By setting isSplitable false, we can set 1 file with n records 1 mapper.

 Is there any way to set 1 complete file per record..

 Thanks in advance
 Chandravadana S




 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message. 
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly 
 prohibited and may be unlawful.

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/1-file-per-record-tp19644985p19685269.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Questions about Hadoop

2008-09-26 Thread Paco NATHAN
Edward,

Can you describe more about Hama, with respect to Hadoop?
I've read through the Incubator proposal and your blog -- it's a great approach.

Are there any benchmarks available?  E.g., size of data sets used,
kinds of operations performed, etc.

Will this project be able to make use of existing libraries?

Best,
Paco


On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon [EMAIL PROTECTED] wrote:
 The decision making system seems interesting to me. :)

 The question I want to ask is whether it is possible to perform statistical 
 analysis on the data using Hadoop and MapReduce.

 I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use
 to matrix algebra and its uses in statistical analysis on Hadoop and
 Hbase. (It is still in its early stage)

 /Edward


Re: small hadoop cluster setup question

2008-09-26 Thread Samuel Guo
could you please attach your configurations and logs?

On Fri, Sep 26, 2008 at 6:12 AM, Ski Gh3 [EMAIL PROTECTED] wrote:

 Hi all,

 I'm trying to set up a small cluster with 3 machines.  I'd like to have one
 machine serves as the namenode and the jobtracker, while the 3 all serve as
 the datanode and tasktrackers.

 After following the set up instructions, I got an exception running
  $HADOOP_HOME/bin/start-dfs.sh: java.net.BindException:Address already in
 use
 and my secondarynamenode cannot be started.  But as I can see, the namenode
 started and the three data nodes all started. Does this matter?

 However, when I go to the webinterface at localhost:50070, I saw there is 1
 datanode.
 Is there any reason why the data nodes are not started on the other two
 machines?
 The site.xml file and the java path, installation path etc are all set up.

 Does anyone have this problem before? I 'd really appreciate any help!

 Thanks!



Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-26 Thread Samuel Guo
maybe you can use
bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args

see details:
http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html

On Thu, Sep 25, 2008 at 12:26 PM, David Hall [EMAIL PROTECTED]wrote:

 On Sun, Sep 21, 2008 at 9:41 PM, David Hall [EMAIL PROTECTED]
 wrote:
  On Sun, Sep 21, 2008 at 9:35 PM, Arun C Murthy [EMAIL PROTECTED]
 wrote:
 
  On Sep 21, 2008, at 2:05 PM, David Hall wrote:
 
  (New to this list)
 
  Hi,
 
  My research group is setting up a small (20-node) cluster. All of
  these machines are linked by NFS. We have a fairly entrenched
  codebase/development cycle, and in particular we'd like to be able to
  access user $CLASSPATHs in the forked jvms run by the Map and Reduce
  tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to
  disallow this by specifying it's own.
 
 
  Using jars on NFS for too many tasks might hurt if you have thousands of
  tasks, causing too much load.
 
  The better solution might be to use the DistributedCache:
 
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
 
  Specifically:
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
  Arun
 
  Good point.. I hadn't thought of that, but at the moment we're dealing
  with barrier-to-adoption rather than efficiency. We'll have to go back
  to PBS if we can't get users (read: picky phd students) on board. I'd
  rather avoid that scenario...
 
  In the meantime, I think I figured out a hack that I'm going to try.

 In case anyone's curious, the hack is to create a jar file with a
 manifest that has the Class-Path field set to all the directories and
 jars you want, and to put that in the lib/ folder of another jar, and
 pass that final jar in as the User Jar to a job.

 Works like a charm. :-)

 -- David

 
  Thanks!
 
  -- David
 
 
  Is there any easy way to trick hadoop into making these visible? If
  not, if I were to submit a patch that would (optionally) add
  $CLASSPATH to the forked jvms' classpath, would it be considered?
 
  Thanks,
  David Hall
 
 
 



Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-26 Thread Joe Shaw
Hi,

On Fri, Sep 26, 2008 at 10:50 AM, Samuel Guo [EMAIL PROTECTED] wrote:
 maybe you can use
 bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args

 see details:
 http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html

Indeed, I was having the same issue trying to get a Lucene jar file
into a running task.  Despite what the docs say, it works with the
jar option to the hadoop command.  (The docs I read said it only
worked with job and a couple other commands; unfortunately I don't
have a link to that page at the moment.)

Joe


Re: IPC Client error | Too many files open

2008-09-26 Thread Raghu Angadi


What does jstack show for this?

Probably better suited for jira discussion.
Raghu.
Goel, Ankur wrote:

Hi Folks,

We have developed a simple log writer in Java that is plugged into
Apache custom log and writes log entries directly to our hadoop cluster
(50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
machine as dedicated Namenode another machine as JobTracker 
TaskTracker + DataNode).

There are around 8 Apache servers dumping logs into HDFS via our writer.
Everything was working fine and we were getting around 15 - 20 MB log
data per hour from each server.

 


Recently we have been experiencing problems with 2-3 of our Apache
servers where a file is opened by log-writer in HDFS for writing but it
never receives any data.

Looking at apache error logs shows the following errors

08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
files
at sun.nio.ch.IOUtil.initPipe(Native Method)
at
sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:49)
at
sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
:18)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
imeout.java:312)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
thTimeout.java:227)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)

...

...

 Followed by connection errors saying 


Retrying to connect to server: hadoop-server.com:9000. Already tried
'n' times.

(same as above) ...



and is retrying constantly (log-writer set up so that it waits and
retries).

 


Doing an lsof on the log writer java process shows that it got stuck in
a lot of pipe/event poll and eventually ran out of file handles. 


Below is the part of the lsof output

 


lsof -p 2171
COMMAND  PID USER   FD   TYPE DEVICE SIZE NODE NAME



java2171 root   20r  FIFO0,7  24090207 pipe
java2171 root   21w  FIFO0,7  24090207 pipe
java2171 root   22r  0,80 24090208
eventpoll
java2171 root   23r  FIFO0,7  23323281 pipe
java2171 root   24r  FIFO0,7  23331536 pipe
java2171 root   25w  FIFO0,7  23306764 pipe
java2171 root   26r  0,80 23306765
eventpoll
java2171 root   27r  FIFO0,7  23262160 pipe
java2171 root   28w  FIFO0,7  23262160 pipe
java2171 root   29r  0,80 23262161
eventpoll
java2171 root   30w  FIFO0,7  23299329 pipe
java2171 root   31r  0,80 23299330
eventpoll
java2171 root   32w  FIFO0,7  23331536 pipe
java2171 root   33r  FIFO0,7  23268961 pipe
java2171 root   34w  FIFO0,7  23268961 pipe
java2171 root   35r  0,80 23268962
eventpoll
java2171 root   36w  FIFO0,7  23314889 pipe

...

...

...

What in DFS client (if any) could have caused this? Could it be
something else?

Is it not ideal to use an HDFS writer to directly write logs from Apache
into HDFS?

Is 'Chukwa (hadoop log collection and analysis framework contributed by
Yahoo) a better fit for our case?

 


I would highly appreciate help on any or all of the above questions.

 


Thanks and Regards

-Ankur






job details

2008-09-26 Thread Shirley Cohen

Hi,

I'm trying to figure out which log files are used by the job  
tracker's web interface to display the following information:


Job Name: my job
Job File: hdfs://localhost:9000/tmp/hadoop-scohen/mapred/system/ 
job_200809260816_0001/job.xml

Status: Succeeded
Started at: Fri Sep 26 08:18:04 CDT 2008
Finished at: Fri Sep 26 08:18:25 CDT 2008
Finished in: 20sec

What I would like to do is backup the log files that are needed to  
display this information so that we can look at it later if the need  
arises.  When I copy everything from the hadoop home/logs directory  
into another hadoop home/logs directory, the jobs show up in the  
history page. However, all I see are the job name and starting time,  
but not the completion time or the length of the job.


Does anyone have any suggestions?

Thanks,

Shirley


Re: job details

2008-09-26 Thread Steve Loughran

Shirley Cohen wrote:

Hi,

I'm trying to figure out which log files are used by the job tracker's 
web interface to display the following information:


Job Name: my job
Job File: 
hdfs://localhost:9000/tmp/hadoop-scohen/mapred/system/job_200809260816_0001/job.xml 


Status: Succeeded
Started at: Fri Sep 26 08:18:04 CDT 2008
Finished at: Fri Sep 26 08:18:25 CDT 2008
Finished in: 20sec

What I would like to do is backup the log files that are needed to 
display this information so that we can look at it later if the need 
arises.  When I copy everything from the hadoop home/logs directory into 
another hadoop home/logs directory, the jobs show up in the history 
page. However, all I see are the job name and starting time, but not the 
completion time or the length of the job.




Looks like the JobStatus structure. Perhaps you will need to tune the 
logging information to display this when the log finishes, if it is not 
listed adequately.


-steve


RE: How to input a hdfs file inside a mapper?

2008-09-26 Thread Htin Hlaing
I would imagine something like:

FSDataInputStream inFileStream = dfsFileSystem.open(dfsFilePath);

Don't forget to close after.

Thanks,
Htin

-Original Message-
From: Amit_Gupta [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 26, 2008 5:47 AM
To: core-user@hadoop.apache.org
Subject: How to input a hdfs file inside a mapper?


How can I get an Input stream on a file stored in HDFS inside a mapper or
a
reducer?

thanks 

Amit
-- 
View this message in context:
http://www.nabble.com/How-to-input-a-hdfs-file-inside-a-mapper--tp19687785
p19687785.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-26 Thread David Hall
On Fri, Sep 26, 2008 at 7:50 AM, Samuel Guo [EMAIL PROTECTED] wrote:
 maybe you can use
 bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args

 see details:
 http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html

Most of our classes are in non-jars. I suppose it wouldn't be too bad
to tell ant to jar them up, but with the hack, it's easy enough to not
bother.

-- David


 On Thu, Sep 25, 2008 at 12:26 PM, David Hall [EMAIL PROTECTED]wrote:

 On Sun, Sep 21, 2008 at 9:41 PM, David Hall [EMAIL PROTECTED]
 wrote:
  On Sun, Sep 21, 2008 at 9:35 PM, Arun C Murthy [EMAIL PROTECTED]
 wrote:
 
  On Sep 21, 2008, at 2:05 PM, David Hall wrote:
 
  (New to this list)
 
  Hi,
 
  My research group is setting up a small (20-node) cluster. All of
  these machines are linked by NFS. We have a fairly entrenched
  codebase/development cycle, and in particular we'd like to be able to
  access user $CLASSPATHs in the forked jvms run by the Map and Reduce
  tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to
  disallow this by specifying it's own.
 
 
  Using jars on NFS for too many tasks might hurt if you have thousands of
  tasks, causing too much load.
 
  The better solution might be to use the DistributedCache:
 
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
 
  Specifically:
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
  Arun
 
  Good point.. I hadn't thought of that, but at the moment we're dealing
  with barrier-to-adoption rather than efficiency. We'll have to go back
  to PBS if we can't get users (read: picky phd students) on board. I'd
  rather avoid that scenario...
 
  In the meantime, I think I figured out a hack that I'm going to try.

 In case anyone's curious, the hack is to create a jar file with a
 manifest that has the Class-Path field set to all the directories and
 jars you want, and to put that in the lib/ folder of another jar, and
 pass that final jar in as the User Jar to a job.

 Works like a charm. :-)

 -- David

 
  Thanks!
 
  -- David
 
 
  Is there any easy way to trick hadoop into making these visible? If
  not, if I were to submit a patch that would (optionally) add
  $CLASSPATH to the forked jvms' classpath, would it be considered?
 
  Thanks,
  David Hall
 
 
 




Re: extracting input to a task from a (streaming) job?

2008-09-26 Thread Yuri Pradkin
I've create a jira describing my problems running under IsolationRunner.
https://issues.apache.org/jira/browse/HADOOP-4041

If anyone is using I.R. successfully to re-run failed tasks in a single JVM, 
can you please, pretty please, describe on how you do that?

Thank you,

  -Yuri

On Friday 08 August 2008 10:09:48 Yuri Pradkin wrote:
 On Thursday 07 August 2008 16:43:10 John Heidemann wrote:
  On Thu, 07 Aug 2008 19:42:05 +0200, Leon Mergen wrote:
  Hello John,
  
  On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote:
   I have a large Hadoop streaming job that generally works fine,
   but a few (2-4) of the ~3000 maps and reduces have problems.
   To make matters worse, the problems are system-dependent (we run an a
   cluster with machines of slightly different OS versions).
   I'd of course like to debug these problems, but they are embedded in a
   large job.
  
   Is there a way to extract the input given to a reducer from a job,
   given the task identity?  (This would also be helpful for mappers.)
  
  I believe you should set keep.failed.tasks.files to true -- this way,
   give a task id, you can see what input files it has in ~/
  taskTracker/${taskid}/work (source:
  http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Isolatio
  nR unner )

 IsolationRunner does not work as described in the tutorial.  After the task
 hung, I failed it via the web interface.  Then I went to the node that was
 running this task

   $ cd ...local/taskTracker/jobcache/job_200808071645_0001/work
 (this path is already different from the tutorial's)

   $ hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
 Exception in thread main java.lang.NullPointerException
 at
 org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:164)

 Looking at IsolationRunner code, I see this:

 164 File workDirName = new File(lDirAlloc.getLocalPathToRead(
 165   TaskTracker.getJobCacheSubdir()
 166   + Path.SEPARATOR +
 taskId.getJobID() 167   + Path.SEPARATOR +
 taskId 168   + Path.SEPARATOR + work, 169
   conf). toString());

 I.e. it assumes there is supposed to be a taskID subdirectory under the job
 dir, but:
  $ pwd
  ...mapred/local/taskTracker/jobcache/job_200808071645_0001
  $ ls
  jars  job.xml  work

 -- it's not there.  Any suggestions?

 Thanks,

   -Yuri




too many open files error

2008-09-26 Thread Eric Zhang

Hi,
I encountered following FileNotFoundException resulting from too many 
open files error when i tried to run a job.  The job had been run for 
several times before without problem.  I am confused by the exception 
because my code closes all the files and even it doesn't,  the job only 
have only 10-20 small input/output files.   The limit on the open file 
on my box is 1024.Besides, the error seemed to happen even before 
the task was executed, I am using 0.17 version.   I'd appreciate if 
somebody can shed some light on this issue.  BTW, the job ran ok after i 
restarted hadoop.Yes, the hadoop-site.xml did exist in that directory. 

java.lang.RuntimeException: java.io.FileNotFoundException: 
/home/y/conf/hadoop/hadoop-site.xml (Too many open files)
   at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:901)
   at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:804)
   at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:772)

   at org.apache.hadoop.conf.Configuration.get(Configuration.java:272)
   at 
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:414)
   at 
org.apache.hadoop.mapred.JobConf.getKeepFailedTaskFiles(JobConf.java:306)
   at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.setJobConf(TaskTracker.java:1487)
   at 
org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:722)
   at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:716)
   at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
   at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)

   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)
Caused by: java.io.FileNotFoundException: 
/home/y/conf/hadoop/hadoop-site.xml (Too many open files)

   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:106)
   at java.io.FileInputStream.init(FileInputStream.java:66)
   at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
   at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
   at 
com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
   at 
com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
   at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
   at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
   at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:225)
   at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)

   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
   at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:832)

   ... 12 more


Sometimes it gave me this message:
java.io.IOException: Cannot run program bash: java.io.IOException: 
error=24, Too many open files

   at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
   at org.apache.hadoop.util.Shell.run(Shell.java:134)
   at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
   at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)

   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:133)
Caused by: java.io.IOException: java.io.IOException: error=24, Too many 
open files

   at java.lang.UNIXProcess.init(UNIXProcess.java:148)
   at java.lang.ProcessImpl.start(ProcessImpl.java:65)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
   ... 6 more



--
Eric Zhang
408-349-2466
Vespa Content team



Could not get block locations. Aborting... exception

2008-09-26 Thread Bryan Duxbury

Hey all.

We've been running into a very annoying problem pretty frequently  
lately. We'll be running some job, for instance a distcp, and it'll  
be moving along quite nicely, until all of the sudden, it sort of  
freezes up. It takes a while, and then we'll get an error like this one:


attempt_200809261607_0003_m_02_0: Exception closing file /tmp/ 
dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile
attempt_200809261607_0003_m_02_0: java.io.IOException: Could not  
get block locations. Aborting...
attempt_200809261607_0003_m_02_0:   at  
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError 
(DFSClient.java:2143)
attempt_200809261607_0003_m_02_0:   at  
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400 
(DFSClient.java:1735)
attempt_200809261607_0003_m_02_0:   at  
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run 
(DFSClient.java:1889)


At approximately the same time, we start seeing lots of these errors  
in the namenode log:


2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR*  
NameSystem.startFile: failed to create file /tmp/dustin/input/ 
input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for  
DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83  
because current leaseholder is trying to recreate file.
2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/ 
_distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x,  
DFSClient_attempt_200809261607_0003_m_02_1, true, 3, 67108864)  
from 10.100.11.83:60056: error:  
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create  
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ 
part-01897.bucketfile for  
DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83  
because current leaseholder is trying to recreate file.
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create  
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ 
part-01897.bucketfile for  
DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83  
because current leaseholder is trying to recreate file.
at org.apache.hadoop.dfs.FSNamesystem.startFileInternal 
(FSNamesystem.java:952)
at org.apache.hadoop.dfs.FSNamesystem.startFile 
(FSNamesystem.java:903)

at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



Eventually, the job fails because of these errors. Subsequent job  
runs also experience this problem and fail. The only way we've been  
able to recover is to restart the DFS. It doesn't happen every time,  
but it does happen often enough that I'm worried.


Does anyone have any ideas as to why this might be happening? I  
thought that https://issues.apache.org/jira/browse/HADOOP-2669 might  
be the culprit, but today we upgraded to hadoop 0.18.1 and the  
problem still happens.


Thanks,

Bryan


RE: Could not get block locations. Aborting... exception

2008-09-26 Thread Hairong Kuang
Does your failed map task open a lot of files to write? Could you please check 
the log of the datanode running at the machine where the map tasks failed? Do 
you see any error message containing exceeds the limit of concurrent xcievers?
 
Hairong



From: Bryan Duxbury [mailto:[EMAIL PROTECTED]
Sent: Fri 9/26/2008 4:36 PM
To: core-user@hadoop.apache.org
Subject: Could not get block locations. Aborting... exception



Hey all.

We've been running into a very annoying problem pretty frequently 
lately. We'll be running some job, for instance a distcp, and it'll 
be moving along quite nicely, until all of the sudden, it sort of 
freezes up. It takes a while, and then we'll get an error like this one:

attempt_200809261607_0003_m_02_0: Exception closing file /tmp/
dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile
attempt_200809261607_0003_m_02_0: java.io.IOException: Could not 
get block locations. Aborting...
attempt_200809261607_0003_m_02_0:   at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError
(DFSClient.java:2143)
attempt_200809261607_0003_m_02_0:   at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400
(DFSClient.java:1735)
attempt_200809261607_0003_m_02_0:   at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:1889)

At approximately the same time, we start seeing lots of these errors 
in the namenode log:

2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR* 
NameSystem.startFile: failed to create file /tmp/dustin/input/
input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for 
DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 
because current leaseholder is trying to recreate file.
2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/
_distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x, 
DFSClient_attempt_200809261607_0003_m_02_1, true, 3, 67108864) 
from 10.100.11.83:60056: error: 
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create 
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
part-01897.bucketfile for 
DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 
because current leaseholder is trying to recreate file.
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create 
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
part-01897.bucketfile for 
DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 
because current leaseholder is trying to recreate file.
 at org.apache.hadoop.dfs.FSNamesystem.startFileInternal
(FSNamesystem.java:952)
 at org.apache.hadoop.dfs.FSNamesystem.startFile
(FSNamesystem.java:903)
 at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
 at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



Eventually, the job fails because of these errors. Subsequent job 
runs also experience this problem and fail. The only way we've been 
able to recover is to restart the DFS. It doesn't happen every time, 
but it does happen often enough that I'm worried.

Does anyone have any ideas as to why this might be happening? I 
thought that https://issues.apache.org/jira/browse/HADOOP-2669 might 
be the culprit, but today we upgraded to hadoop 0.18.1 and the 
problem still happens.

Thanks,

Bryan




Re: too many open files error

2008-09-26 Thread Karl Anderson


On 26-Sep-08, at 3:09 PM, Eric Zhang wrote:


Hi,
I encountered following FileNotFoundException resulting from too  
many open files error when i tried to run a job.  The job had been  
run for several times before without problem.  I am confused by the  
exception because my code closes all the files and even it doesn't,   
the job only have only 10-20 small input/output files.   The limit  
on the open file on my box is 1024.Besides, the error seemed to  
happen even before the task was executed, I am using 0.17 version.
I'd appreciate if somebody can shed some light on this issue.  BTW,  
the job ran ok after i restarted hadoop.Yes, the hadoop-site.xml  
did exist in that directory.


I had the same errors, including the bash one.  Running one particular  
job would cause all subsequent jobs of any kind to fail, even after  
all running jobs had completed or failed out.  This was confusing  
because the failing jobs themselves often had no relationship to the  
cause, they were just in a bad environment.


If you can't successfully run a dummy job (with the identity mapper  
and reducer, or a streaming job with cat) once you start getting  
failures, then you are probably in the same situation.


I believe that the problem was caused by increasing the timeout, but I  
never pinned it down enough to submit a Jira issue.  It might have  
been the XML reader or something else.  I was using streaming, hadoop- 
ec2, and either 0.17.0 or 0.18.0.  It would happen just as rapidly  
after I made an ec2 image with a higher open file limit.


Eventually I figured it out by running each job in my pipeline 5 or so  
times before trying the next one, which let me see which job was  
causing the problem (because it would eventually fail itself, rather  
than hosing a later job).


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Re: Failed to start datanodes

2008-09-26 Thread 叶双明
Do you config hostname right in all node?

2008/9/26 Jeremy Chow [EMAIL PROTECTED]

 Hi list,
  I've created my hadoop cluster following the tutorial on

 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 .

  but failed. when i use bin/hadoop dfsadmin -report
  it shows that there is only one datanode on,
 $ bin/hadoop dfsadmin -report
 Safe mode is ON
 Total raw bytes: 20317106176 (18.92 GB)
 Remaining raw bytes: 12607342427 (11.74 GB)
 Used raw bytes: 834834432 (796.16 MB)
 % used: 4.11%

 Total effective bytes: 0 (0 KB)
 Effective replication multiplier: Infinity
 -
 Datanodes available: 1

 Name: 192.168.3.8:50010
 State  : In Service
 Total raw bytes: 20317106176 (18.92 GB)
 Remaining raw bytes: 12607342427(11.74 GB)
 Used raw bytes: 834834432 (796.16 MB)
 % used: 4.11%
 Last contact: Fri Sep 26 15:46:19 CST 2008

 and then I check the logs of datanodes which expect to run on other hosts,
 I
 found that

 2008-09-26 15:47:54,744 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = localhost.jobui.com/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.18.1
 STARTUP_MSG:   build =
 http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
 694836;
 compiled by 'hadoopqa' on Fri Sep 12 23:29:35 UTC 2008
 /
 2008-09-26 15:48:15,896 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /192.168.3.8:54310. Already tried 0 time(s).
 2008-09-26 15:48:36,898 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /192.168.3.8:54310. Already tried 1 time(s).
 2008-09-26 15:48:57,900 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: /192.168.3.8:54310. Already tried 2 time(s).
 ...

 I take 192.168.3.8 as namenode, and 192.168.3.7, 192.168.3.8, 192.168.3.9as
 datanodes.
 But obviously remote datanodes cannot start sucessfully.
 when I use jps on 192.168.3.7, it seems work fine.
 $ jps
 5131 Jps
 4561 DataNode

 but the namenode cannot find it.

 can anyone give me a solution?

 thanks a lot.

 Jeremy
 --
 My research interests are distributed systems, parallel computing and
 bytecode based virtual machine.

 http://coderplay.javaeye.com




-- 
Sorry for my English!! 明
Please help me correct my English expression and error in syntax


Re: Failed to start datanodes

2008-09-26 Thread Jeremy Chow
Hey,

I've fixed it. :)  The server has turn on a firewall.


Regards,
Jeremy


Re: Could not get block locations. Aborting... exception

2008-09-26 Thread Bryan Duxbury
Well, I did find some more errors in the datanode log. Here's a  
sampling:


2008-09-26 10:43:57,287 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:  
Block blk_-3923611845661840838_176295 is not valid.
at org.apache.hadoop.dfs.FSDataset.getBlockFile 
(FSDataset.java:716)
at org.apache.hadoop.dfs.FSDataset.getLength(FSDataset.java: 
704)
at org.apache.hadoop.dfs.DataNode$BlockSender.init 
(DataNode.java:1678)
at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock 
(DataNode.java:1101)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run 
(DataNode.java:1037)


2008-09-26 10:56:19,325 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException:  
while trying to read 65557 bytes
at org.apache.hadoop.dfs.DataNode$BlockReceiver.readToBuf 
(DataNode.java:2464)
at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.readNextPacket(DataNode.java:2508)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket 
(DataNode.java:2572)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock 
(DataNode.java:2698)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock 
(DataNode.java:1283)


2008-09-26 10:56:19,779 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-1784982905-10.100.11.115-50010-1221785192226,

infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run 
(DataNode.java:1021)

at java.lang.Thread.run(Thread.java:619)

2008-09-26 10:56:21,816 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:  
Could not read from stream
at org.apache.hadoop.net.SocketInputStream.read 
(SocketInputStream.java:119)

at java.io.DataInputStream.readByte(DataInputStream.java:248)
at org.apache.hadoop.io.WritableUtils.readVLong 
(WritableUtils.java:324)
at org.apache.hadoop.io.WritableUtils.readVInt 
(WritableUtils.java:345)

at org.apache.hadoop.io.Text.readString(Text.java:410)

2008-09-26 10:56:28,380 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:  
Connection reset by peer

at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java: 
236)


2008-09-26 10:56:52,387 ERROR org.apache.hadoop.dfs.DataNode:  
DatanodeRegistration(10.100.11.115:50010,  
storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Too  
many open files

at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
at sun.nio.ch.EPollArrayWrapper.init 
(EPollArrayWrapper.java:59)
at sun.nio.ch.EPollSelectorImpl.init 
(EPollSelectorImpl.java:52)
at sun.nio.ch.EPollSelectorProvider.openSelector 
(EPollSelectorProvider.java:18)

at sun.nio.ch.Util.getTemporarySelector(Util.java:123)

The most interesting one in my eyes is the too many open files one.  
My ulimit is 1024. How much should it be? I don't think that I have  
that many files open in my mappers. They should only be operating on  
a single file at a time. I can try to run the job again and get an  
lsof if it would be interesting.


Thanks for taking the time to reply, by the way.

-Bryan


On Sep 26, 2008, at 4:48 PM, Hairong Kuang wrote:

Does your failed map task open a lot of files to write? Could you  
please check the log of the datanode running at the machine where  
the map tasks failed? Do you see any error message containing  
exceeds the limit of concurrent xcievers?


Hairong



From: Bryan Duxbury [mailto:[EMAIL PROTECTED]
Sent: Fri 9/26/2008 4:36 PM
To: core-user@hadoop.apache.org
Subject: Could not get block locations. Aborting... exception



Hey all.

We've been running into a very annoying problem pretty frequently
lately. We'll be running some job, for instance a distcp, and it'll
be moving along quite nicely, until all of the sudden, it sort of
freezes up. It takes a while, and then we'll get an error like this  
one:


attempt_200809261607_0003_m_02_0: Exception