Re: HOD questions
Hi Hemanth, While HOD does not do this automatically, please note that since you are bringing up a Map/Reduce cluster on the allocated nodes, you can submit map/reduce parameters with which to bring up the cluster when allocating jobs. The relevant options are --gridservice-mapred.server-params (or -M in shorthand). Please refer to http://hadoop.apache.org/core/docs/r0.19.0/hod_user_guide.html#Options+for+Configuring+Hadoop for details. I was aware of this, but the issue is that unless you obtain dedicated nodes (as above), this option is not suitable, as it isn't set on a per-node basis. I think it would be /fairly/ straightfoward to add to HOD, as I detailed in my initial email, so that it does the correct thing out the box. True, I did assume you obtained dedicated nodes. It has been fairly simpler to operate HOD in this manner, and if I understand correctly, would help to solve the requirement you are having as well. I think it's a Maui change (or qos directive) to obtain dedicated nodes - I'm looking into it presently, but I'm not sure that the correct exact incantation is correct. -W x=NACCESSPOLICY=SINGLETASK For mixed job environments [e.g. universities] - where users have jobs which aren't HOD, often using single CPUs, it can mean that a job has more complicated requirements and will hence take longer to reach the head of the queue. According to hadoop-default.xml, the number of maps is Typically set to a prime several times greater than number of available hosts. - Say that we relax this recommendation to read Typically set to a NUMBER several times greater than number of available hosts then it should be straightforward for HOD to set it automatically then? Actually, AFAIK, the number of maps for a job is determined more or less exclusively by the M/R framework based on the number of splits. I've seen messages on this list before about how the documentation for this configuration item is misleading. So, this might actually not make a difference at all, whatever is specified. The reason we were asking is that mapred.map.tasks is provided as the hint to the input split. We were using this number to generate the number of maps. I think its just that FileInputFormat doesn't exactly honour the hint, from what I can see. Pig's InputFormat ignores the hint. Craig
Re: Hit a roadbump in solving truncated block issue
Hey Raghu, I never heard back from you about whether any of these fixes are ready to try out. Things are getting kind of bad here. Even at three replicas, I found one block which has all three replicas of length=0. Grepping through the logs, I get things like this: 2008-12-18 22:45:04,680 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.16.1.121:50010, storageID=DS-1732140560-172.16.1.121-50010-1228236234012, infoPort=50075, ipcPort=50020):Got exception while serving blk_7345861444716855534_7201 to /172.16.1.1: java.io.IOException: Offset 35307520 and length 10485760 don't match block blk_7345861444716855534_7201 ( blockLen 0 ) java.io.IOException: Offset 35307520 and length 10485760 don't match block blk_7345861444716855534_7201 ( blockLen 0 ) On the other hand, if I look for the block scanner activity: 2008-12-08 13:59:15,616 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_7345861444716855534_7201 There is indeed a zero-sized file on disk and matching *correct* metadata: [r...@node121 ~]# find /hadoop-data/ -name *7345861444716855534* -exec ls -lh {} \; -rw-r--r-- 1 root root 7 Dec 3 15:44 /hadoop-data/dfs/data/current/ subdir9/subdir6/blk_7345861444716855534_7201.meta -rw-r--r-- 1 root root 0 Dec 3 15:44 /hadoop-data/dfs/data/current/ subdir9/subdir6/blk_7345861444716855534 The metadata matches the 0-sized block, not the full one, of course. We recently went from 2 replicas to 3 replicas on Dec 11. On Dec 12, a replicas was created on node191: [r...@node191 ~]# find /hadoop-data/ -name *7345861444716855534* -exec ls -lh {} \; -rw-r--r-- 1 root root 7 Dec 12 08:53 /hadoop-data/dfs/data/current/ subdir40/subdir37/subdir42/blk_7345861444716855534_7201.meta -rw-r--r-- 1 root root 0 Dec 12 08:53 /hadoop-data/dfs/data/current/ subdir40/subdir37/subdir42/blk_7345861444716855534 The corresponding log entries are here: 2008-12-12 08:53:09,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 172.16.1.191:50010 2008-12-12 08:53:17,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 172.16.1.191:50010 of size 0 So, the incorrectly-sized block had a new copy created, the datanode reported the incorrect size (!), and the namenode never deleted it afterward. I unfortunately don't have the namenode logs from this period. Brian On Dec 16, 2008, at 4:10 PM, Raghu Angadi wrote: Brian Bockelman wrote: Hey, I hit a bit of a roadbump in solving the truncated block issue at our site: namely, some of the blocks appear perfectly valid to the datanode. The block verifies, but it is still the wrong size (it appears that the metadata is too small too). What's the best way to proceed? It appears that either (a) the block scanner needs to report to the datanode the size of the block it just verified, which is possibly a scaling issue or (b) the metadata file needs to save the correct block size, which is a pretty major modification, as it requires a change of the on-disk format. This should be detected by the NameNode. i.e. it should detect this replica is shorter (either compared to other replicas or the expected size). There are various fixes (recent or being worked on) to this area of NameNode and it is mostly covered by of those or should be soon. Raghu. Ideas? Brian
Re: Datanode handling of single disk failure
Brian Bockelman wrote: Hello all, I'd like to take the datanode's capability to handle multiple directories to a somewhat-extreme, and get feedback on how well this might work. We have a few large RAID servers (12 to 48 disks) which we'd like to transition to Hadoop. I'd like to mount each of the disks individually (i.e., /mnt/disk1, /mnt/disk2, ) and take advantage of Hadoop's replication - instead of pay the overhead to set up a RAID and still have to pay the overhead of replication. In my experience this is the right way to go. However, we're a bit concerned about how well Hadoop might handle one of the directories disappearing from underneath it. If a single volume, say, /mnt/disk1 starts returning I/O errors, is Hadoop smart enough to figure out that this whole volume is broken? Or will we have to restart the datanode after any disk failure for it to search the directory realize everything is broken? What happens if you start up the datanode with a data directory that it can't write into? In current implementation if at any point Datanode detects an unwritable or unreadable drive it shuts itself down logging a message what went wrong and reporting the problem to the name-node. So yes if such thing happens you will have to restart the data-node. But since the cluster takes care of data-node failures by re-replicating lost blocks that should not be a problem. Is anyone running in this fashion (i.e., multiple data directories corresponding to different disk volumes ... even better if you're doing it with more than a few disks)? We have a large experience running 4 drives per data-node (no RAID). So this is not something new or untested. Thanks, --Konstantin
Re: Datanode handling of single disk failure
Thank you Konstantin, this information will be useful. Brian On Dec 19, 2008, at 12:37 PM, Konstantin Shvachko wrote: Brian Bockelman wrote: Hello all, I'd like to take the datanode's capability to handle multiple directories to a somewhat-extreme, and get feedback on how well this might work. We have a few large RAID servers (12 to 48 disks) which we'd like to transition to Hadoop. I'd like to mount each of the disks individually (i.e., /mnt/disk1, /mnt/disk2, ) and take advantage of Hadoop's replication - instead of pay the overhead to set up a RAID and still have to pay the overhead of replication. In my experience this is the right way to go. However, we're a bit concerned about how well Hadoop might handle one of the directories disappearing from underneath it. If a single volume, say, /mnt/disk1 starts returning I/O errors, is Hadoop smart enough to figure out that this whole volume is broken? Or will we have to restart the datanode after any disk failure for it to search the directory realize everything is broken? What happens if you start up the datanode with a data directory that it can't write into? In current implementation if at any point Datanode detects an unwritable or unreadable drive it shuts itself down logging a message what went wrong and reporting the problem to the name-node. So yes if such thing happens you will have to restart the data-node. But since the cluster takes care of data-node failures by re- replicating lost blocks that should not be a problem. Is anyone running in this fashion (i.e., multiple data directories corresponding to different disk volumes ... even better if you're doing it with more than a few disks)? We have a large experience running 4 drives per data-node (no RAID). So this is not something new or untested. Thanks, --Konstantin
Re: Hit a roadbump in solving truncated block issue
Actually we do have the namenode logs for the period Brian mentioned. In Brian's email, he shows the log entries on node191 corresponding to it storing the third (new) replica of the block in question. The namenode log from that period shows: 2008-12-12 08:53:02,637 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 172.16.1.121:50010 to replicate blk_7345861444716855534_7201 to datanode(s) 172.16.1.191:50010 2008-12-12 08:53:17,127 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 172.16.1.191:50010 is added to blk_7345861444716855534_7201 size 134217728 As you can see, node191 correctly claimed it received the block of size 0, yet the namenode claims that node191 has the block of size 134217728. Looking at more of the namenode logs I found another instance where the namenode again replicated the block due to a node going down, and just as we see here with node191, this new datanode stored the block with size 0 while the namenode seemed to think everything was correct. - Garhan Attebury On Dec 19, 2008, at 11:57 AM, Brian Bockelman wrote: Hey Raghu, I never heard back from you about whether any of these fixes are ready to try out. Things are getting kind of bad here. Even at three replicas, I found one block which has all three replicas of length=0. Grepping through the logs, I get things like this: 2008-12-18 22:45:04,680 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.16.1.121:50010, storageID=DS-1732140560-172.16.1.121-50010-1228236234012, infoPort=50075, ipcPort=50020):Got exception while serving blk_7345861444716855534_7201 to /172.16.1.1: java.io.IOException: Offset 35307520 and length 10485760 don't match block blk_7345861444716855534_7201 ( blockLen 0 ) java.io.IOException: Offset 35307520 and length 10485760 don't match block blk_7345861444716855534_7201 ( blockLen 0 ) On the other hand, if I look for the block scanner activity: 2008-12-08 13:59:15,616 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_7345861444716855534_7201 There is indeed a zero-sized file on disk and matching *correct* metadata: [r...@node121 ~]# find /hadoop-data/ -name *7345861444716855534* - exec ls -lh {} \; -rw-r--r-- 1 root root 7 Dec 3 15:44 /hadoop-data/dfs/data/current/ subdir9/subdir6/blk_7345861444716855534_7201.meta -rw-r--r-- 1 root root 0 Dec 3 15:44 /hadoop-data/dfs/data/current/ subdir9/subdir6/blk_7345861444716855534 The metadata matches the 0-sized block, not the full one, of course. We recently went from 2 replicas to 3 replicas on Dec 11. On Dec 12, a replicas was created on node191: [r...@node191 ~]# find /hadoop-data/ -name *7345861444716855534* - exec ls -lh {} \; -rw-r--r-- 1 root root 7 Dec 12 08:53 /hadoop-data/dfs/data/current/ subdir40/subdir37/subdir42/blk_7345861444716855534_7201.meta -rw-r--r-- 1 root root 0 Dec 12 08:53 /hadoop-data/dfs/data/current/ subdir40/subdir37/subdir42/blk_7345861444716855534 The corresponding log entries are here: 2008-12-12 08:53:09,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 172.16.1.191:50010 2008-12-12 08:53:17,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_7345861444716855534_7201 src: /172.16.1.121:47799 dest: / 172.16.1.191:50010 of size 0 So, the incorrectly-sized block had a new copy created, the datanode reported the incorrect size (!), and the namenode never deleted it afterward. I unfortunately don't have the namenode logs from this period. Brian
Re: Failed to start TaskTracker server
Well u have some process which grabs this port and Hadoop is not able to bind the port By the time u check, there is a chance that socket connection has died but was occupied when hadoop processes was attempting Check all the processes running on the system Do any of the processes acquire ports ? -Sagar ascend1 wrote: I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes work properly but when I use bin/start-mapred.sh to start MapReduce framework only 3 or 4 TaskTracker could be started properly. All those couldn't be started have the same error. Here's the log: 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = msra-5lcd05/172.23.213.80 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@e51b2c 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@edf389 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@17b0998 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: socketlisten...@0.0.0.0:50060 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.net.BindException: Address already in use: JVM_Bind at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359) at java.net.ServerSocket.bind(ServerSocket.java:319) at java.net.ServerSocket.init(ServerSocket.java:185) at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477) at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503) at org.mortbay.http.SocketListener.start(SocketListener.java:203) at org.mortbay.http.HttpServer.doStart(HttpServer.java:761) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80 / Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several times, the machines list that could start TaskTracker seems randomly. Could anybody help me solve this problem?
Architecture question.
Hello All, I am designing an architecture which should support 10 million records storage capacity and 1 million updates / minute. Data persistancy is not that important as I will be purging this data every day. I am familiar with memcache but not hadoop. It will be great if I can get some points from the group regarding designing this architecture. Thanks, Aakash.
Re: Architecture question.
How large are the records ? 1 mil updates / mil. . . . you mind sharing the complexity of the updates ? On Fri, Dec 19, 2008 at 8:05 PM aakash_j j_shah aakash_j_s...@yahoo.com wrote: Hello All, I am designing an architecture which should support 10 million records storage capacity and 1 million updates / minute. Data persistancy is not that important as I will be purging this data every day. I am familiar with memcache but not hadoop. It will be great if I can get some points from the group regarding designing this architecture. Thanks, Aakash.
Re: Architecture question.
Hello Edwin, Thanks for the answer. Records are very small usually key is about 64 bytes ( ascii ) and updates are for 10 integer values. So I would say that record size including key is about 104 bytes. Sid. --- On Fri, 12/19/08, Edwin Gonzalez gonza...@zenbe.com wrote: From: Edwin Gonzalez gonza...@zenbe.com Subject: Re: Architecture question. To: core-user@hadoop.apache.org, aakash_j_s...@yahoo.com Date: Friday, December 19, 2008, 5:13 PM How large are the records ? 1 mil updates / mil. . . . you mind sharing the complexity of the updates ? On Fri, Dec 19, 2008 at 8:05 PM aakash_j j_shah aakash_j_s...@yahoo.com wrote: Hello All, I am designing an architecture which should support 10 million records storage capacity and 1 million updates / minute. Data persistancy is not that important as I will be purging this data every day. I am familiar with memcache but not hadoop. It will be great if I can get some points from the group regarding designing this architecture. Thanks, Aakash.
Re: Failed to start TaskTracker server
Well the machines are all servers that probably running many services but I have no permission to change or modify other users' programs or settings. Is there any way to change 50060 to other port? Sagar Naik wrote: Well u have some process which grabs this port and Hadoop is not able to bind the port By the time u check, there is a chance that socket connection has died but was occupied when hadoop processes was attempting Check all the processes running on the system Do any of the processes acquire ports ? -Sagar ascend1 wrote: I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes work properly but when I use bin/start-mapred.sh to start MapReduce framework only 3 or 4 TaskTracker could be started properly. All those couldn't be started have the same error. Here's the log: 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = msra-5lcd05/172.23.213.80 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@e51b2c 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@edf389 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@17b0998 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: socketlisten...@0.0.0.0:50060 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.net.BindException: Address already in use: JVM_Bind at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359) at java.net.ServerSocket.bind(ServerSocket.java:319) at java.net.ServerSocket.init(ServerSocket.java:185) at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477) at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503) at org.mortbay.http.SocketListener.start(SocketListener.java:203) at org.mortbay.http.HttpServer.doStart(HttpServer.java:761) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80 / Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several times, the machines list that could start TaskTracker seems randomly. Could anybody help me solve this problem?
Re: Failed to start TaskTracker server
- check hadoop-default.xml in here u will find all the ports used. Copy the xml-nodes from hadoop-default.xml to hadoop-site.xml. Change the port values in hadoop-site.xml and deploy it on datanodes . Rico wrote: Well the machines are all servers that probably running many services but I have no permission to change or modify other users' programs or settings. Is there any way to change 50060 to other port? Sagar Naik wrote: Well u have some process which grabs this port and Hadoop is not able to bind the port By the time u check, there is a chance that socket connection has died but was occupied when hadoop processes was attempting Check all the processes running on the system Do any of the processes acquire ports ? -Sagar ascend1 wrote: I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes work properly but when I use bin/start-mapred.sh to start MapReduce framework only 3 or 4 TaskTracker could be started properly. All those couldn't be started have the same error. Here's the log: 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = msra-5lcd05/172.23.213.80 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@e51b2c 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@edf389 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@17b0998 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: socketlisten...@0.0.0.0:50060 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.net.BindException: Address already in use: JVM_Bind at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359) at java.net.ServerSocket.bind(ServerSocket.java:319) at java.net.ServerSocket.init(ServerSocket.java:185) at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477) at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503) at org.mortbay.http.SocketListener.start(SocketListener.java:203) at org.mortbay.http.HttpServer.doStart(HttpServer.java:761) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80 / Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several times, the machines list that could start TaskTracker seems randomly. Could anybody help me solve this problem?
Re:Re: Failed to start TaskTracker server
I'll find and have a test. Thanks for your help! On 2008-12-20,Sagar Naik sn...@attributor.com wrote: - check hadoop-default.xml in here u will find all the ports used. Copy the xml-nodes from hadoop-default.xml to hadoop-site.xml. Change the port values in hadoop-site.xml and deploy it on datanodes . Rico wrote: Well the machines are all servers that probably running many services but I have no permission to change or modify other users' programs or settings. Is there any way to change 50060 to other port? Sagar Naik wrote: Well u have some process which grabs this port and Hadoop is not able to bind the port By the time u check, there is a chance that socket connection has died but was occupied when hadoop processes was attempting Check all the processes running on the system Do any of the processes acquire ports ? -Sagar ascend1 wrote: I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes work properly but when I use bin/start-mapred.sh to start MapReduce framework only 3 or 4 TaskTracker could be started properly. All those couldn't be started have the same error. Here's the log: 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = msra-5lcd05/172.23.213.80 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@e51b2c 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@edf389 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@17b0998 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: socketlisten...@0.0.0.0:50060 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.net.BindException: Address already in use: JVM_Bind at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359) at java.net.ServerSocket.bind(ServerSocket.java:319) at java.net.ServerSocket.init(ServerSocket.java:185) at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477) at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503) at org.mortbay.http.SocketListener.start(SocketListener.java:203) at org.mortbay.http.HttpServer.doStart(HttpServer.java:761) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80 / Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several times, the machines list that could start TaskTracker seems randomly. Could anybody help me solve this problem?