Re: HOD questions

2008-12-19 Thread Craig Macdonald

Hi Hemanth,

While HOD does not do this automatically, please note that since you 
are bringing up a Map/Reduce cluster on the allocated nodes, you can 
submit map/reduce parameters with which to bring up the cluster when 
allocating jobs. The relevant options are 
--gridservice-mapred.server-params (or -M in shorthand). Please 
refer to
http://hadoop.apache.org/core/docs/r0.19.0/hod_user_guide.html#Options+for+Configuring+Hadoop 
for details.
I was aware of this, but the issue is that unless you obtain 
dedicated nodes (as above), this option is not suitable, as it isn't 
set on a per-node basis. I think it would be /fairly/ straightfoward 
to add to HOD, as I detailed in my initial email, so that it does 
the correct thing out the box.
True, I did assume you obtained dedicated nodes. It has been fairly 
simpler to operate HOD in this manner, and if I understand correctly, 
would help to solve the requirement you are having as well.
I think it's a Maui change (or qos directive) to obtain dedicated nodes 
- I'm looking into it presently, but I'm not sure that the correct exact 
incantation is correct.

-W x=NACCESSPOLICY=SINGLETASK

For mixed job environments [e.g. universities] - where users have jobs 
which aren't HOD, often using single CPUs, it can mean that a job has 
more complicated requirements and will hence take longer to reach the 
head of the queue.


According to hadoop-default.xml, the number of maps is Typically set 
to a prime several times greater than number of available hosts. - 
Say that we relax this recommendation to read Typically set to a 
NUMBER several times greater than number of available hosts then it 
should be straightforward for HOD to set it automatically then?
Actually, AFAIK, the number of maps for a job is determined more or 
less exclusively by the M/R framework based on the number of splits. 
I've seen messages on this list before about how the documentation for 
this configuration item is misleading. So, this might actually not 
make a difference at all, whatever is specified.
The reason we were asking is that mapred.map.tasks is provided as the 
hint to the input split.
We were using this number to generate the number of maps. I think its 
just that FileInputFormat doesn't exactly honour the hint, from what I 
can see. Pig's InputFormat ignores the hint.




Craig


Re: Hit a roadbump in solving truncated block issue

2008-12-19 Thread Brian Bockelman

Hey Raghu,

I never heard back from you about whether any of these fixes are ready  
to try out.  Things are getting kind of bad here.


Even at three replicas, I found one block which has all three replicas  
of length=0.  Grepping through the logs, I get things like this:


2008-12-18 22:45:04,680 WARN  
org.apache.hadoop.hdfs.server.datanode.DataNode:  
DatanodeRegistration(172.16.1.121:50010,  
storageID=DS-1732140560-172.16.1.121-50010-1228236234012,  
infoPort=50075, ipcPort=50020):Got exception while serving  
blk_7345861444716855534_7201 to /172.16.1.1:
java.io.IOException:  Offset 35307520 and length 10485760 don't match  
block blk_7345861444716855534_7201 ( blockLen 0 )
java.io.IOException:  Offset 35307520 and length 10485760 don't match  
block blk_7345861444716855534_7201 ( blockLen 0 )


On the other hand, if I look for the block scanner activity:

2008-12-08 13:59:15,616 INFO  
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification  
succeeded for blk_7345861444716855534_7201


There is indeed a zero-sized file on disk and matching *correct*  
metadata:


[r...@node121 ~]# find /hadoop-data/ -name *7345861444716855534* -exec  
ls -lh {} \;
-rw-r--r--  1 root root 7 Dec  3 15:44 /hadoop-data/dfs/data/current/ 
subdir9/subdir6/blk_7345861444716855534_7201.meta
-rw-r--r--  1 root root 0 Dec  3 15:44 /hadoop-data/dfs/data/current/ 
subdir9/subdir6/blk_7345861444716855534


The metadata matches the 0-sized block, not the full one, of course.

We recently went from 2 replicas to 3 replicas on Dec 11.  On Dec 12,  
a replicas was created on node191:


[r...@node191 ~]# find /hadoop-data/ -name *7345861444716855534* -exec  
ls -lh {} \;
-rw-r--r--  1 root root 7 Dec 12 08:53 /hadoop-data/dfs/data/current/ 
subdir40/subdir37/subdir42/blk_7345861444716855534_7201.meta
-rw-r--r--  1 root root 0 Dec 12 08:53 /hadoop-data/dfs/data/current/ 
subdir40/subdir37/subdir42/blk_7345861444716855534


The corresponding log entries are here:

2008-12-12 08:53:09,014 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block  
blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 
172.16.1.191:50010
2008-12-12 08:53:17,134 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: Received block  
blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 
172.16.1.191:50010 of size 0


So, the incorrectly-sized block had a new copy created, the datanode  
reported the incorrect size (!), and the namenode never deleted it  
afterward.  I unfortunately don't have the namenode logs from this  
period.


Brian

On Dec 16, 2008, at 4:10 PM, Raghu Angadi wrote:


Brian Bockelman wrote:

Hey,
I hit a bit of a roadbump in solving the truncated block issue at  
our site: namely, some of the blocks appear perfectly valid to the  
datanode.  The block verifies, but it is still the wrong size (it  
appears that the metadata is too small too).
What's the best way to proceed?  It appears that either (a) the  
block scanner needs to report to the datanode the size of the block  
it just verified, which is possibly a scaling issue or (b) the  
metadata file needs to save the correct block size, which is a  
pretty major modification, as it requires a change of the on-disk  
format.


This should be detected by the NameNode. i.e. it should detect this  
replica is shorter (either compared to other replicas or the  
expected size). There are various fixes (recent or being worked on)  
to this area of NameNode and it is mostly covered by of those or  
should be soon.


Raghu.


Ideas?
Brian




Re: Datanode handling of single disk failure

2008-12-19 Thread Konstantin Shvachko


Brian Bockelman wrote:

Hello all,

I'd like to take the datanode's capability to handle multiple 
directories to a somewhat-extreme, and get feedback on how well this 
might work.


We have a few large RAID servers (12 to 48 disks) which we'd like to 
transition to Hadoop.  I'd like to mount each of the disks individually 
(i.e., /mnt/disk1, /mnt/disk2, ) and take advantage of Hadoop's 
replication - instead of pay the overhead to set up a RAID and still 
have to pay the overhead of replication.


In my experience this is the right way to go.

However, we're a bit concerned about how well Hadoop might handle one of 
the directories disappearing from underneath it.  If a single volume, 
say, /mnt/disk1 starts returning I/O errors, is Hadoop smart enough to 
figure out that this whole volume is broken?  Or will we have to restart 
the datanode after any disk failure for it to search the directory 
realize everything is broken?  What happens if you start up the datanode 
with a data directory that it can't write into?


In current implementation if at any point Datanode detects an unwritable or
unreadable drive it shuts itself down logging a message what went wrong and
reporting the problem to the name-node.
So yes if such thing happens you will have to restart the data-node.
But since the cluster takes care of data-node failures by re-replicating
lost blocks that should not be a problem.

Is anyone running in this fashion (i.e., multiple data directories 
corresponding to different disk volumes ... even better if you're doing 
it with more than a few disks)?


We have a large experience running 4 drives per data-node (no RAID).
So this is not something new or untested.

Thanks,
--Konstantin


Re: Datanode handling of single disk failure

2008-12-19 Thread Brian Bockelman

Thank you Konstantin, this information will be useful.

Brian

On Dec 19, 2008, at 12:37 PM, Konstantin Shvachko wrote:



Brian Bockelman wrote:

Hello all,
I'd like to take the datanode's capability to handle multiple  
directories to a somewhat-extreme, and get feedback on how well  
this might work.
We have a few large RAID servers (12 to 48 disks) which we'd like  
to transition to Hadoop.  I'd like to mount each of the disks  
individually (i.e., /mnt/disk1, /mnt/disk2, ) and take  
advantage of Hadoop's replication - instead of pay the overhead to  
set up a RAID and still have to pay the overhead of replication.


In my experience this is the right way to go.

However, we're a bit concerned about how well Hadoop might handle  
one of the directories disappearing from underneath it.  If a  
single volume, say, /mnt/disk1 starts returning I/O errors, is  
Hadoop smart enough to figure out that this whole volume is  
broken?  Or will we have to restart the datanode after any disk  
failure for it to search the directory realize everything is  
broken?  What happens if you start up the datanode with a data  
directory that it can't write into?


In current implementation if at any point Datanode detects an  
unwritable or
unreadable drive it shuts itself down logging a message what went  
wrong and

reporting the problem to the name-node.
So yes if such thing happens you will have to restart the data-node.
But since the cluster takes care of data-node failures by re- 
replicating

lost blocks that should not be a problem.

Is anyone running in this fashion (i.e., multiple data directories  
corresponding to different disk volumes ... even better if you're  
doing it with more than a few disks)?


We have a large experience running 4 drives per data-node (no RAID).
So this is not something new or untested.

Thanks,
--Konstantin




Re: Hit a roadbump in solving truncated block issue

2008-12-19 Thread Garhan Attebury
Actually we do have the namenode logs for the period Brian mentioned.  
In Brian's email, he shows the log entries on node191 corresponding to  
it storing the third (new) replica of the block in question. The  
namenode log from that period shows:



2008-12-12 08:53:02,637 INFO org.apache.hadoop.hdfs.StateChange:  
BLOCK* ask 172.16.1.121:50010 to replicate  
blk_7345861444716855534_7201 to datanode(s) 172.16.1.191:50010
2008-12-12 08:53:17,127 INFO org.apache.hadoop.hdfs.StateChange:  
BLOCK* NameSystem.addStoredBlock: blockMap updated: 172.16.1.191:50010  
is added to blk_7345861444716855534_7201 size 134217728


As you can see, node191 correctly claimed it received the block of  
size 0, yet the namenode claims that node191 has the block of size  
134217728.



Looking at more of the namenode logs I found another instance where  
the namenode again replicated the block due to a node going down, and  
just as we see here with node191, this new datanode stored the block  
with size 0 while the namenode seemed to think everything was correct.




- Garhan Attebury


On Dec 19, 2008, at 11:57 AM, Brian Bockelman wrote:


Hey Raghu,

I never heard back from you about whether any of these fixes are  
ready to try out.  Things are getting kind of bad here.


Even at three replicas, I found one block which has all three  
replicas of length=0.  Grepping through the logs, I get things like  
this:


2008-12-18 22:45:04,680 WARN  
org.apache.hadoop.hdfs.server.datanode.DataNode:  
DatanodeRegistration(172.16.1.121:50010,  
storageID=DS-1732140560-172.16.1.121-50010-1228236234012,  
infoPort=50075, ipcPort=50020):Got exception while serving  
blk_7345861444716855534_7201 to /172.16.1.1:
java.io.IOException:  Offset 35307520 and length 10485760 don't  
match block blk_7345861444716855534_7201 ( blockLen 0 )
java.io.IOException:  Offset 35307520 and length 10485760 don't  
match block blk_7345861444716855534_7201 ( blockLen 0 )


On the other hand, if I look for the block scanner activity:

2008-12-08 13:59:15,616 INFO  
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner:  
Verification succeeded for blk_7345861444716855534_7201


There is indeed a zero-sized file on disk and matching *correct*  
metadata:


[r...@node121 ~]# find /hadoop-data/ -name *7345861444716855534* - 
exec ls -lh {} \;
-rw-r--r--  1 root root 7 Dec  3 15:44 /hadoop-data/dfs/data/current/ 
subdir9/subdir6/blk_7345861444716855534_7201.meta
-rw-r--r--  1 root root 0 Dec  3 15:44 /hadoop-data/dfs/data/current/ 
subdir9/subdir6/blk_7345861444716855534


The metadata matches the 0-sized block, not the full one, of course.

We recently went from 2 replicas to 3 replicas on Dec 11.  On Dec  
12, a replicas was created on node191:


[r...@node191 ~]# find /hadoop-data/ -name *7345861444716855534* - 
exec ls -lh {} \;
-rw-r--r--  1 root root 7 Dec 12 08:53 /hadoop-data/dfs/data/current/ 
subdir40/subdir37/subdir42/blk_7345861444716855534_7201.meta
-rw-r--r--  1 root root 0 Dec 12 08:53 /hadoop-data/dfs/data/current/ 
subdir40/subdir37/subdir42/blk_7345861444716855534


The corresponding log entries are here:

2008-12-12 08:53:09,014 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block  
blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 
172.16.1.191:50010
2008-12-12 08:53:17,134 INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode: Received block  
blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 
172.16.1.191:50010 of size 0


So, the incorrectly-sized block had a new copy created, the datanode  
reported the incorrect size (!), and the namenode never deleted it  
afterward.  I unfortunately don't have the namenode logs from this  
period.


Brian




Re: Failed to start TaskTracker server

2008-12-19 Thread Sagar Naik
Well u have some process which grabs this port and Hadoop is not able to 
bind the port
By the time u check, there is a chance that socket connection has died 
but was occupied when hadoop processes was attempting


Check all the processes running on the system
Do any of the processes acquire ports ?

-Sagar
ascend1 wrote:

I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes 
work properly but when I use bin/start-mapred.sh to start MapReduce framework 
only 3 or 4 TaskTracker could be started properly. All those couldn't be 
started have the same error.
Here's the log:
 
2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: 
/

STARTUP_MSG: Starting TaskTracker
STARTUP_MSG:   host = msra-5lcd05/172.23.213.80
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.19.0
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; 
compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
/
2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource 
aliases
2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/static,/static]
2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@edf389
2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/logs,/logs]
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@17b0998
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: 
socketlisten...@0.0.0.0:50060
2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
start task tracker because java.net.BindException: Address already in use: 
JVM_Bind
 at java.net.PlainSocketImpl.socketBind(Native Method)
 at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
 at java.net.ServerSocket.bind(ServerSocket.java:319)
 at java.net.ServerSocket.init(ServerSocket.java:185)
 at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391)
 at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
 at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
 at org.mortbay.http.SocketListener.start(SocketListener.java:203)
 at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
 at org.mortbay.util.Container.start(Container.java:72)
 at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
 at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894)
 at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: 
/

SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
/

 Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program 
using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh 
several times, the machines list that could start TaskTracker seems randomly.
 
Could anybody help me solve this problem?
  




Architecture question.

2008-12-19 Thread aakash_j j_shah
Hello All,

   I am designing an architecture which should support 10 million records 
storage capacity and 1 million updates / minute. Data persistancy is not that 
important as I will be purging this data every day.
 
  
I am familiar with memcache but not hadoop. It will be great if I can
get some points from the group regarding designing this architecture.

Thanks,
Aakash.


  

Re: Architecture question.

2008-12-19 Thread Edwin Gonzalez
How large are the records ?
 1 mil updates / mil.  . . .  you mind sharing the complexity of the updates ?

On Fri, Dec 19, 2008 at 8:05 PM aakash_j j_shah aakash_j_s...@yahoo.com 
wrote:
Hello All,
 
    I am designing an architecture which should support 10 million records 
 storage capacity and 1 million updates / minute. Data persistancy is not that 
 important as I will be purging this data every day.
  
   
 I am familiar with memcache but not hadoop. It will be great if I can
 get some points from the group regarding designing this architecture.
 
 Thanks,
 Aakash.
 
 
 


Re: Architecture question.

2008-12-19 Thread aakash_j j_shah
Hello Edwin,
 
  Thanks for the answer. Records are very small usually key is about 64 bytes ( 
ascii ) and updates are for 10 integer values. So I would say that record size 
including key is about 104 bytes. 

Sid.


--- On Fri, 12/19/08, Edwin Gonzalez gonza...@zenbe.com wrote:
From: Edwin Gonzalez gonza...@zenbe.com
Subject: Re: Architecture question.
To: core-user@hadoop.apache.org, aakash_j_s...@yahoo.com
Date: Friday, December 19, 2008, 5:13 PM

How large are the records ?
 1 mil updates / mil.  . . .  you mind sharing the complexity of the updates ?

On Fri, Dec 19, 2008 at 8:05 PM aakash_j j_shah
aakash_j_s...@yahoo.com wrote:
Hello All,
 
    I am designing an architecture which should support 10 million
records storage capacity and 1 million updates / minute. Data persistancy is not
that important as I will be purging this data every day.
  
   
 I am familiar with memcache but not hadoop. It will be great if I can
 get some points from the group regarding designing this architecture.
 
 Thanks,
 Aakash.
 
 
 



  

Re: Failed to start TaskTracker server

2008-12-19 Thread Rico
Well the machines are all servers that probably running many services 
but I have no permission to change or modify other users' programs or 
settings. Is there any way to change 50060 to other port?


Sagar Naik wrote:
Well u have some process which grabs this port and Hadoop is not able 
to bind the port
By the time u check, there is a chance that socket connection has died 
but was occupied when hadoop processes was attempting


Check all the processes running on the system
Do any of the processes acquire ports ?

-Sagar
ascend1 wrote:
I have made a Hadoop platform on 15 machines recently. NameNode - 
DataNodes work properly but when I use bin/start-mapred.sh to start 
MapReduce framework only 3 or 4 TaskTracker could be started 
properly. All those couldn't be started have the same error.

Here's the log:

2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: 
STARTUP_MSG: 
/

STARTUP_MSG: Starting TaskTracker
STARTUP_MSG: host = msra-5lcd05/172.23.213.80
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.19.0
STARTUP_MSG: build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 
713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008

/
2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/static,/static]
2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@edf389
2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/logs,/logs]
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@17b0998
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed 
to start: socketlisten...@0.0.0.0:50060
2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: 
Can not start task tracker because java.net.BindException: Address 
already in use: JVM_Bind

at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
at java.net.ServerSocket.bind(ServerSocket.java:319)
at java.net.ServerSocket.init(ServerSocket.java:185)
at 
org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391)

at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
at org.mortbay.http.SocketListener.start(SocketListener.java:203)
at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
at org.mortbay.util.Container.start(Container.java:72)
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: 
SHUTDOWN_MSG: 
/

SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
/

Then I use netstat -an, but port 50060 isn't in the list. ps -af 
also show that no program using 50060. The strange point is that when 
I repeat bin/start-mapred.sh and bin/stop-mapred.sh several 
times, the machines list that could start TaskTracker seems randomly.


Could anybody help me solve this problem?








Re: Failed to start TaskTracker server

2008-12-19 Thread Sagar Naik

 - check hadoop-default.xml
in here u will find all the ports used. Copy the xml-nodes from 
hadoop-default.xml to hadoop-site.xml. Change the port values in 
hadoop-site.xml

and deploy it on datanodes .


Rico wrote:
Well the machines are all servers that probably running many services 
but I have no permission to change or modify other users' programs or 
settings. Is there any way to change 50060 to other port?


Sagar Naik wrote:
Well u have some process which grabs this port and Hadoop is not able 
to bind the port
By the time u check, there is a chance that socket connection has 
died but was occupied when hadoop processes was attempting


Check all the processes running on the system
Do any of the processes acquire ports ?

-Sagar
ascend1 wrote:
I have made a Hadoop platform on 15 machines recently. NameNode - 
DataNodes work properly but when I use bin/start-mapred.sh to start 
MapReduce framework only 3 or 4 TaskTracker could be started 
properly. All those couldn't be started have the same error.

Here's the log:

2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: 
STARTUP_MSG: 
/

STARTUP_MSG: Starting TaskTracker
STARTUP_MSG: host = msra-5lcd05/172.23.213.80
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.19.0
STARTUP_MSG: build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 
713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008

/
2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/static,/static]
2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@edf389
2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/logs,/logs]
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@17b0998
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed 
to start: socketlisten...@0.0.0.0:50060
2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: 
Can not start task tracker because java.net.BindException: Address 
already in use: JVM_Bind

at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
at java.net.ServerSocket.bind(ServerSocket.java:319)
at java.net.ServerSocket.init(ServerSocket.java:185)
at 
org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) 


at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
at org.mortbay.http.SocketListener.start(SocketListener.java:203)
at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
at org.mortbay.util.Container.start(Container.java:72)
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: 
SHUTDOWN_MSG: 
/

SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
/

Then I use netstat -an, but port 50060 isn't in the list. ps -af 
also show that no program using 50060. The strange point is that 
when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several 
times, the machines list that could start TaskTracker seems randomly.


Could anybody help me solve this problem?










Re:Re: Failed to start TaskTracker server

2008-12-19 Thread ascend1
I'll find and have a test. Thanks for your help!
On 2008-12-20,Sagar Naik sn...@attributor.com wrote:
  - check hadoop-default.xml
in here u will find all the ports used. Copy the xml-nodes from 
hadoop-default.xml to hadoop-site.xml. Change the port values in 
hadoop-site.xml
and deploy it on datanodes .


Rico wrote:
 Well the machines are all servers that probably running many services 
 but I have no permission to change or modify other users' programs or 
 settings. Is there any way to change 50060 to other port?

 Sagar Naik wrote:
 Well u have some process which grabs this port and Hadoop is not able 
 to bind the port
 By the time u check, there is a chance that socket connection has 
 died but was occupied when hadoop processes was attempting

 Check all the processes running on the system
 Do any of the processes acquire ports ?

 -Sagar
 ascend1 wrote:
 I have made a Hadoop platform on 15 machines recently. NameNode - 
 DataNodes work properly but when I use bin/start-mapred.sh to start 
 MapReduce framework only 3 or 4 TaskTracker could be started 
 properly. All those couldn't be started have the same error.
 Here's the log:

 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: 
 STARTUP_MSG: 
 /
 STARTUP_MSG: Starting TaskTracker
 STARTUP_MSG: host = msra-5lcd05/172.23.213.80
 STARTUP_MSG: args = []
 STARTUP_MSG: version = 0.19.0
 STARTUP_MSG: build = 
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 
 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
 /
 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version 
 Jetty/5.1.4
 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking 
 Resource aliases
 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
 org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
 WebApplicationContext[/static,/static]
 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
 org.mortbay.jetty.servlet.webapplicationhand...@edf389
 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
 WebApplicationContext[/logs,/logs]
 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
 org.mortbay.jetty.servlet.webapplicationhand...@17b0998
 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
 WebApplicationContext[/,/]
 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed 
 to start: socketlisten...@0.0.0.0:50060
 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: 
 Can not start task tracker because java.net.BindException: Address 
 already in use: JVM_Bind
 at java.net.PlainSocketImpl.socketBind(Native Method)
 at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
 at java.net.ServerSocket.bind(ServerSocket.java:319)
 at java.net.ServerSocket.init(ServerSocket.java:185)
 at 
 org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) 

 at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
 at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
 at org.mortbay.http.SocketListener.start(SocketListener.java:203)
 at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
 at org.mortbay.util.Container.start(Container.java:72)
 at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
 at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894)
 at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: 
 SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
 /

 Then I use netstat -an, but port 50060 isn't in the list. ps -af 
 also show that no program using 50060. The strange point is that 
 when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several 
 times, the machines list that could start TaskTracker seems randomly.

 Could anybody help me solve this problem?