Rescheduling of already completed map/reduce task
Hi, The job froze after the filesystem hung on a machine which had successfully completed a map task. Is there a flag to enable the re scheduling of such a task ? Jstack of job tracker SocketListener0-2 prio=10 tid=0x08916000 nid=0x4a4f runnable [0x4d05c000..0x4d05ce30] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.mortbay.util.LineInput.fill(LineInput.java:469) at org.mortbay.util.LineInput.fillLine(LineInput.java:547) at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:293) at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:277) at org.mortbay.http.HttpRequest.readHeader(HttpRequest.java:238) at org.mortbay.http.HttpConnection.readRequest(HttpConnection.java:861) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:907) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Locked ownable synchronizers: - None SocketListener0-1 prio=10 tid=0x4da8c800 nid=0xeeb runnable [0x4d266000..0x4d2670b0] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.mortbay.util.LineInput.fill(LineInput.java:469) at org.mortbay.util.LineInput.fillLine(LineInput.java:547) at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:293) at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:277) at org.mortbay.http.HttpRequest.readHeader(HttpRequest.java:238) at org.mortbay.http.HttpConnection.readRequest(HttpConnection.java:861) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:907) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) IPC Server listener on 54311 daemon prio=10 tid=0x4df70400 nid=0xe86 runnable [0x4d9fe000..0x4d9feeb0] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked 0x54fb4320 (a sun.nio.ch.Util$1) - locked 0x54fb4310 (a java.util.Collections$UnmodifiableSet) - locked 0x54fb40b8 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:296) Locked ownable synchronizers: - None IPC Server Responder daemon prio=10 tid=0x4da22800 nid=0xe85 runnable [0x4db75000..0x4db75e30] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked 0x54f0 (a sun.nio.ch.Util$1) - locked 0x54fdce10 (a java.util.Collections$UnmodifiableSet) - locked 0x54fdcc18 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.hadoop.ipc.Server$Responder.run(Server.java:455) Locked ownable synchronizers: - None RMI TCP Accept-0 daemon prio=10 tid=0x4da13400 nid=0xe31 runnable [0x4de55000..0x4de56130] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384) - locked 0x54f6dae0 (a java.net.SocksSocketImpl) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34) at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369) at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341) at java.lang.Thread.run(Thread.java:619) Locked ownable synchronizers: - None -Sagar
Multithreaded Reducer
Hi, I would like to implement a Multi-threaded reducer. As per my understanding , the system does not have one coz we expect the output to be sorted. However, in my case I dont need the output sorted. Can u pl point to me any other issues or it would be safe to do so -Sagar
Re: Multithreaded Reducer
Two things - multi-threaded is preferred over multi-processes. The process I m planning is IO bound so I can really take advantage of multi-threads (100 threads) - Correct me if I m wrong. The next MR_JOB in the pipeline will have increased number of splits to process as the number of reducer-outputs (from prev job) have increased . This leads to increase in the map-task completion time. -Sagar Aaron Kimball wrote: Rather than implementing a multi-threaded reducer, why not simply increase the number of reducer tasks per machine via mapred.tasktracker.reduce.tasks.maximum, and increase the total number of reduce tasks per job via mapred.reduce.tasks to ensure that they're all filled. This will effectively utilize a higher number of cores. - Aaron On Fri, Apr 10, 2009 at 11:12 AM, Sagar Naik sn...@attributor.com wrote: Hi, I would like to implement a Multi-threaded reducer. As per my understanding , the system does not have one coz we expect the output to be sorted. However, in my case I dont need the output sorted. Can u pl point to me any other issues or it would be safe to do so -Sagar
Re: connecting two clusters
Hi, I m not sure if u have looked at this option. But instead of having two HDFS , u can have one HDFS and two map-red clusters (pointing to same HDFS) and then do the sync mechanisms -Sagar Mithila Nagendra wrote: Hello Aaron Yes it makes a lot of sense! Thank you! :) The incremental wavefront model is another option we are looking at. Currently we have a two map/reduce levels, the upper level has to wait until the lower map/reduce has produced the entire result set. We want to avoid this... We were thinking of using two separate clusters so that these levels can run on them - hoping to achieve better resource utilization. We were hoping to connect the two clusters in some way so that the processes can interact - but it seems like Hadoop is limited in that sense. I was wondering how a common HDFS system can be setup for this purpose. I tried looking for material for synchronization between two map-reduce clusters - there is limited/no data available out on the Web! If we stick to the incremental wavefront model, then we could probably work with one cluster. Mithila On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball aa...@cloudera.com wrote: Hi Mithila, Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they begin; the inputs for the job are then fixed. So additional files that arrive in the input directory after processing has begun, etc, do not participate in the job. And HDFS does not currently support appends to files, so existing files cannot be updated. A typical way in which this sort of problem is handled is to do processing in incremental wavefronts; process A generates some data which goes in an incoming directory for process B; process B starts on a timer every so often and collects the new input files and works on them. After it's done, it moves those inputs which it processed into a done directory. In the mean time, new files may have arrived. After another time interval, another round of process B starts. The major limitation of this model is that it requires that your process work incrementally, or that you are emitting a small enough volume of data each time in process B that subsequent iterations can load into memory a summary table of results from previous iterations. Look into using the DistributedCache to disseminate such files. Also, why are you using two MapReduce clusters for this, as opposed to one? Is there a common HDFS cluster behind them? You'll probably get much better performance for the overall process if the output data from one job does not need to be transferred to another cluster before it is further processed. Does this model make sense? - Aaron On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Aaron, We hope to achieve a level of pipelining between two clusters - similar to how pipelining is done in executing RDB queries. You can look at it as the producer-consumer problem, one cluster produces some data and the other cluster consumes it. The issue that has to be dealt with here is the data exchange between the clusters - synchronized interaction between the map-reduce jobs on the two clusters is what I m hoping to achieve. Mithila On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote: Clusters don't really have identities beyond the addresses of the NameNodes and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the namenodes of the source and destination clusters. The 8020 in each address assumes that they're on the default port. Hadoop provides no inter-task or inter-job synchronization primitives, on purpose (even within a cluster, the most you get in terms of synchronization is the ability to join on the status of a running job to determine that it's completed). The model is designed to be as identity-independent as possible to make it more resiliant to failure. If individual jobs/tasks could lock common resources, then the intermittent failure of tasks could easily cause deadlock. Using a file as a scoreboard or other communication mechanism between multiple jobs is not something explicitly designed for, and likely to end in frustration. Can you describe the goal you're trying to accomplish? It's likely that there's another, more MapReduce-y way of looking at the job and refactoring the code to make it work more cleanly with the intended programming model. - Aaron On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the
Re: safemode forever
It means tht not all blocks have been reported Can u check how many datanodes have reported in UI or bin/hadoop dfsadmin -report In case u have to disable the safemode check bin/hadoop dfsadmin -safemode command it has options to enter/leave/get -Sagar javateck javateck wrote: Hi, I'm wondering if anyone has solutions about the nonstopped safe mode, any way to get it around? thanks, error: org.apache.hadoop.dfs.SafeModeException: Cannot delete /mapred/system. Name node is in safe mode. The ratio of reported blocks 0.4696 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
Re: hadoop-a small doubt
Yes u can Java Client : Copy the conf dir (same as one on namenode/datanode) and hadoop jars shud be in the classpath of client Non Java Client : http://wiki.apache.org/hadoop/MountableHDFS -Sagar -Sagar deepya wrote: Hi, I am SreeDeepya doing MTech in IIIT.I am working on a project named cost effective and scalable storage server.I configured a small hadoop cluster with only two nodes one namenode and one datanode.I am new to hadoop. I have a small doubt. Can a system not in the hadoop cluster access the namenode or the datanodeIf yes,then can you please tell me the necessary configurations that has to be done. Thanks in advance. SreeDeepya
Re: Design issue for a problem using Map Reduce
Here is one thought N maps and 1 Reduce, input to map: t,w(t) output of map t, w(t)*w(t) I assume t is an integer. So in case of 1 reducer, u will receive t0, square(w(0) t1, square(w(1) t2, square(w(2) t3, square(w(3) Note this wiil be a sorted series on t. in reduce static prevF = 0; reduce(t, square_w_t) { f = square_w_t * A + B * prevF ; output.collect(t,f) prevF = f } According to me the step of B*F(t-1) is inherently sequential. So all we can do is parallelize the a*w(t)*w(t) part. -Sagar some speed wrote: Hello all, I am trying to implement a Map Reduce Chain to solve a particular statistic problem. I have come to a point where I have to solve the following type of equation in Hadoop: F(t)= A*w(t)*w(t) + B*F(t-1); Given: F(0)=0, A and B are Alpha and Beta and their values are known. Now, W is series of numbers (There could be *a million* or more numbers). So to Solve the equation in terms of Map Reduce, there are basically 2 issues which I can think of: 1) How will I be able to get the value of F(t-1) since it means as each step i need the value from the previous iteration. And that is not possible while computing parallely. 2) the w(t) values have to be read and applied in order also ,and, again that is a prb while computing parallely. Can some please help me go abt this problem and overcome the issues? Thanks, Sharath
Re: Not able to copy a file to HDFS after installing
where is the namenode running ? localhost or some other host -Sagar Rajshekar wrote: Hello, I am new to Hadoop and I jus installed on Ubuntu 8.0.4 LTS as per guidance of a web site. I tested it and found working fine. I tried to copy a file but it is giving some error pls help me out had...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1$ bin/hadoop jar hadoop-0.17.2.1-examples.jar wordcount /home/hadoop/Download\ URLs.txt download-output 09/02/02 11:18:59 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s). 09/02/02 11:19:00 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 2 time(s). 09/02/02 11:19:01 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 3 time(s). 09/02/02 11:19:02 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 4 time(s). 09/02/02 11:19:04 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 5 time(s). 09/02/02 11:19:05 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 6 time(s). 09/02/02 11:19:06 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 7 time(s). 09/02/02 11:19:07 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 8 time(s). 09/02/02 11:19:08 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 9 time(s). 09/02/02 11:19:09 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 10 time(s). java.lang.RuntimeException: java.net.ConnectException: Connection refused at org.apache.hadoop.mapred.JobConf.getWorkingDirecto ry(JobConf.java:356) at org.apache.hadoop.mapred.FileInputFormat.setInputP aths(FileInputFormat.java:331) at org.apache.hadoop.mapred.FileInputFormat.setInputP aths(FileInputFormat.java:304) at org.apache.hadoop.examples.WordCount.run(WordCount .java:146) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:65) at org.apache.hadoop.examples.WordCount.main(WordCoun t.java:155) at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Native MethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(De legatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ProgramDriver$ProgramDescri ption.invoke(ProgramDriver.java:6 at org.apache.hadoop.util.ProgramDriver.driver(Progra mDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(Exam pleDriver.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Native MethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(De legatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:155 ) at org.apache.hadoop.mapred.JobShell.run(JobShell.jav a:194) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.ja va:220) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketC hannelImpl.java:592) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.jav a:11 at org.apache.hadoop.ipc.Client$Connection.setupIOstr eams(Client.java:174) at org.apache.hadoop.ipc.Client.getConnection(Client. java:623) at org.apache.hadoop.ipc.Client.call(Client.java:546) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java: 212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(U nknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode( DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.j ava:17 at org.apache.hadoop.dfs.DistributedFileSystem.initia lize(DistributedFileSystem.java:6 at org.apache.hadoop.fs.FileSystem.createFileSystem(F ileSystem.java:1280) at org.apache.hadoop.fs.FileSystem.access$300(FileSys tem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSyst em.java:1291) at org.apache.hadoop.fs.FileSystem.get(FileSystem.jav a:203) at org.apache.hadoop.fs.FileSystem.get(FileSystem.jav a:10 at org.apache.hadoop.mapred.JobConf.getWorkingDirecto ry(JobConf.java:352)
Re: My tasktrackers keep getting lost...
Can u post the output from hadoop-argus-hostname-jobtracker.out -Sagar jason hadoop wrote: When I was at Attributor we experienced periodic odd XFS hangs that would freeze up the Hadoop Server processes resulting in them going away. Sometimes XFS would deadlock all writes to the log file and the server would freeze trying to log a message. Can't even JSTACK the jvm. We never had any traction on resolving the XFS deadlocks and simply reboot the machines when the problem occured. On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff ian.sobor...@nist.gov wrote: I hope someone can help me out. I'm getting started with Hadoop, have written the firt part of my project (a custom InputFormat), and am now using that to test out my cluster setup. I'm running 0.19.0. I have five dual-core Linux workstations with most of a 250GB disk available for playing, and am controlling things from my Mac Pro. (This is not the production cluster, that hasn't been assembled yet. This is just to get the code working and figure out the bumps.) My test data is about 18GB of web pages, and the test app at the moment just counts the number of web pages in each bundle file. The map jobs run just fine, but when it gets into the reduce, the TaskTrackers all get lost to the JobTracker. I can't see why, because the TaskTrackers are all still running on the slaves. Also, the jobdetails URL starts returning an HTTP 500 error, although other links from that page still work. I've tried going onto the slaves and manually restarting the tasktrackers with hadoop-daemon.sh, and also turning on job restarting in the site conf and then running stop-mapred/start-mapred. The trackers start up and try to clean up and get going again, but they then just get lost again. Here's some error output from the master jobtracker: 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200902021252_0002_r_05_1' from 'tracker_darling:localhost.localdomain/127.0.0.1:58336' 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004592_1 is 796370 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004592_1 timed out. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: attempt_200902021252_0002_m_004582_1 is 794199 ms debug. 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching task attempt_200902021252_0002_m_004582_1 timed out. 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/ 127.0.0.1:52769'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/ 127.0.0.1:52808'; resending the previous 'lost' response 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/ 127.0.0.1:54464'; Resending the previous 'lost' response 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/ 127.0.0.1:45749'; resending the previous 'lost' response 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54311 caught: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123) at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java :48) at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja va:101) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1 59) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907) 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from unknown Tracker : tracker_monocacy:localhost.localdomain/ 127.0.0.1:54464 And from a slave: 2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on local exception: null at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164) at
Re: Question about HDFS capacity and remaining
Hi Brian, Is it possible to publish these test results along with configuration options ? -Sagar Brian Bockelman wrote: For what it's worth, our organization did extensive tests on many filesystems benchmarking their performance when they are 90 - 95% full. Only XFS retained most of its performance when it was mostly full (ext4 was not tested)... so, if you are thinking of pushing things to the limits, that might be something worth considering. Brian On Jan 30, 2009, at 11:18 AM, stephen mulcahy wrote: Bryan Duxbury wrote: Hm, very interesting. Didn't know about that. What's the purpose of the reservation? Just to give root preference or leave wiggle room? If it's not strictly necessary it seems like it would make sense to reduce it to essentially 0%. AFAIK It is needed for defragmentation / fsck to work properly and your filesystem performance will degrade a lot if you reduce this to 0% (but I'd love to hear otherwise :) -stephen
Re: sudden instability in 0.18.2
Pl check which nodes have these failures. I guess the new tasktrackers/machines are not configured correctly. As a result, the map-task will die and the remaining map-tasks will be sucked onto these machines -Sagar David J. O'Dell wrote: We've been running 0.18.2 for over a month on an 8 node cluster. Last week we added 4 more nodes to the cluster and have experienced 2 failures to the tasktrackers since then. The namenodes are running fine but all jobs submitted will die when submitted with this error on the tasktrackers. 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: attempt_200901280756_0012_m_74_2 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200901280756_0012_m_74_2 Child Error java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403) I tried running the tasktrackers in debug mode but the entries above are all that show up in the logs. As of now my cluster is down.
Re: tools for scrubbing HDFS data nodes?
Check out fsck bin/hadoop fsck path -files -location -blocks Sriram Rao wrote: By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. Sriram On Wed, Jan 28, 2009 at 5:57 PM, Aaron Kimball aa...@cloudera.com wrote: By scrub do you mean delete the blocks from the node? Read your conf/hadoop-site.xml file to determine where dfs.data.dir points, then for each directory in that list, just rm the directory. If you want to ensure that your data is preserved with appropriate replication levels on the rest of your clutser, you should use Hadoop's DataNode Decommission feature to up-replicate the data before you blow a copy away. - Aaron On Wed, Jan 28, 2009 at 2:10 PM, Sriram Rao srirams...@gmail.com wrote: Hi, Is there a tool that one could run on a datanode to scrub all the blocks on that node? Sriram
Re: tools for scrubbing HDFS data nodes?
In addition to datanode itself finding corrupted blocks (As Owen mention) if the client finds a corrupted - block, it will go to other replica Whts your replication factor ? -Sagar Sriram Rao wrote: Does this read every block of every file from all replicas and verify that the checksums are good? Sriram On Wed, Jan 28, 2009 at 6:20 PM, Sagar Naik sn...@attributor.com wrote: Check out fsck bin/hadoop fsck path -files -location -blocks Sriram Rao wrote: By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. Sriram On Wed, Jan 28, 2009 at 5:57 PM, Aaron Kimball aa...@cloudera.com wrote: By scrub do you mean delete the blocks from the node? Read your conf/hadoop-site.xml file to determine where dfs.data.dir points, then for each directory in that list, just rm the directory. If you want to ensure that your data is preserved with appropriate replication levels on the rest of your clutser, you should use Hadoop's DataNode Decommission feature to up-replicate the data before you blow a copy away. - Aaron On Wed, Jan 28, 2009 at 2:10 PM, Sriram Rao srirams...@gmail.com wrote: Hi, Is there a tool that one could run on a datanode to scrub all the blocks on that node? Sriram
Re: HDFS - millions of files in one directory?
System with: 1 billion small files. Namenode will need to maintain the data-structure for all those files. System will have atleast 1 block per file. And if u have replication factor set to 3, the system will have 3 billion blocks. Now , if you try to read all these files in a job , you will be making as many as 1 billion socket connections to get these blocks. (Big Brothers, correct me if I m wrong) Datanodes routinely check for available disk space and collect block reports. These operations are directly dependent on number of blocks on a datanode. Getting all data in one file, avoids all this unnecessary IO and memory occupied by namenode Number of maps in map-reduce job are based on number of blocks. In case of multiple files, we will have a large number of map-tasks. -Sagar Mark Kerzner wrote: Carfield, you might be right, and I may be able to combine them in one large file. What would one use for a delimiter, so that it would never be encountered in normal binary files? Performance does matter (rarely it doesn't). What are the differences in performance between using multiple files and one large file? I would guess that one file should in fact give better hardware/OS performance, because it is more predictable and allows buffering. thank you, Mark On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim carfi...@carfield.com.hkwrote: Really? I thought any file can be combines as long as you can figure out an delimiter is ok, and you really cannot have some delimiters? Like X? And in the worst case, or if performance is not really a matter, may be just encode all binary to and from ascii? On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner markkerz...@gmail.com wrote: Yes, flip suggested such solution, but his files are text, so he could combine them all in a large text file, with each lined representing initial files. My files, however, are binary, so I do not see how I could combine them. However, since my numbers are limited by about 1 billion files total, I should be OK to put them all in a few directories with under, say, 10,000 files each. Maybe a little balanced tree, but 3-4 four levels should suffice. Thank you, Mark On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim carfi...@carfield.com.hk wrote: Possible simple having a file large in size instead of having a lot of small files? On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance a directory tree if I want to store millions of files, or can I put them all in the same directory? Thank you, Mark
Mapred job parallelism
Hi Guys, I was trying to setup a cluster so that two jobs can run simultaneously. The conf : number of nodes : 4(say) mapred.tasktracker.map.tasks.maximum=2 and in the joblClient mapred.map.tasks=4 (# of nodes) I also have a condition, that each job should have only one map-task per node In short, created 8 map slots and set the number of mappers to 4. So now, we have two jobs running simultaneously However, I realized that, if a tasktracker happens to die, potentially, I will have 2 map-tasks running on a node Setting mapred.tasktracker.map.tasks.maximum=1 in Jobclient has no effect. It is tasktracker property and cant be changed per job Any ideas on how to have 2 jobs running simultaneously ? -Sagar
Re: Calling a mapreduce job from inside another
You can also play with the priority of the jobs to have the innermost job finish first -Sagar Devaraj Das wrote: You can chain job submissions at the client. Also, you can run more than one job in parallel (if you have enough task slots). An example of chaining jobs is there in src/examples/org/apache/hadoop/examples/Grep.java where the jobs grep-search and grep-sort are chained.. On 1/18/09 9:58 AM, Aditya Desai aditya3...@gmail.com wrote: Is it possible to call a mapreduce job from inside another, if yes how? and is it possible to disable the reducer completely that is suspend the job immediately after call to map has been terminated. I have tried -reducer NONE. I am using the streaming api to code in python Regards, Aditya Desai.
Locks in hadoop
I would like to implement a locking mechanism across the hdfs cluster I assume there is no inherent support for it I was going to do it with files. According to my knowledge, file creation is an atomic operation. So the file-based lock should work. I need to think through with all conditions but if some one has better idea/solution, pl share Thanks -Sagar
Namenode freeze
Hi Datanode goes down. and then looks like ReplicationMonitor tries to even-out the replication However while doing so, it holds the lock on FsNameSystem With this lock held, other threads wait on this lock to respond As a result, the namenode does not list the dirs/ Web-UI does not respond I would appreciate any pointers for this problem ? (Hadoop .18.1) -Sagar Namenode freeze stackdump : 2009-01-14 00:57:02 Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode): SocketListener0-4 prio=10 tid=0x2aac54008000 nid=0x644d in Object.wait() [0x4535a000..0x4535aa80] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x2aab6cb1dba0 (a org.mortbay.util.ThreadPool$PoolThread) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) - locked 0x2aab6cb1dba0 (a org.mortbay.util.ThreadPool$PoolThread) Locked ownable synchronizers: - None SocketListener0-5 prio=10 tid=0x2aac54008c00 nid=0x63f1 in Object.wait() [0x4545b000..0x4545bb00] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x2aab6c2ea1a8 (a org.mortbay.util.ThreadPool$PoolThread) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522) - locked 0x2aab6c2ea1a8 (a org.mortbay.util.ThreadPool$PoolThread) Locked ownable synchronizers: - None Trash Emptier daemon prio=10 tid=0x511ca400 nid=0x1fd waiting on condition [0x45259000..0x45259a00] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.fs.Trash$Emptier.run(Trash.java:219) at java.lang.Thread.run(Thread.java:619) Locked ownable synchronizers: - None org.apache.hadoop.dfs.dfsclient$leasechec...@767a9224 daemon prio=10 tid=0x51384400 nid=0x1fc sleeping[0x45158000..0x45158a80] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:792) at java.lang.Thread.run(Thread.java:619) Locked ownable synchronizers: - None IPC Server handler 44 on 54310 daemon prio=10 tid=0x2aac40183c00 nid=0x1f4 waiting for monitor entry [0x44f56000..0x44f56d80] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSNamesystem.blockReportProcessed(FSNamesystem.java:1880) - waiting to lock 0x2aaab423a530 (a org.apache.hadoop.dfs.FSNamesystem) at org.apache.hadoop.dfs.FSNamesystem.handleHeartbeat(FSNamesystem.java:2127) at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:602) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Locked ownable synchronizers: - None IPC Server handler 43 on 54310 daemon prio=10 tid=0x2aac40182400 nid=0x1f3 waiting for monitor entry [0x44e55000..0x44e55a00] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:922) - waiting to lock 0x2aaab423a530 (a org.apache.hadoop.dfs.FSNamesystem) at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:903) at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Locked ownable synchronizers: - None IPC Server handler 42 on 54310 daemon prio=10 tid=0x2aac40181000 nid=0x1f2 waiting for monitor entry [0x44d54000..0x44d54a80] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSNamesystem.blockReportProcessed(FSNamesystem.java:1880) - waiting to lock 0x2aaab423a530 (a org.apache.hadoop.dfs.FSNamesystem) at org.apache.hadoop.dfs.FSNamesystem.handleHeartbeat(FSNamesystem.java:2127) at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:602) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Locked ownable synchronizers: - None IPC Server handler 41 on 54310 daemon
Re: 0.18.1 datanode psuedo deadlock problem
Hi Raghu, The periodic du and block reports thread thrash the disk. (Block Reports takes abt on an avg 21 mins ) and I think all the datanode threads are not able to do much and freeze org.apache.hadoop.dfs.datanode$dataxcei...@f2127a daemon prio=10 tid=0x41f06000 nid=0x7c7c waiting for monitor entry [0x43918000..0x43918f50] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSDataset.getFile(FSDataset.java:1158) - waiting to lock 0x54e550e0 (a org.apache.hadoop.dfs.FSDataset) at org.apache.hadoop.dfs.FSDataset.validateBlockFile(FSDataset.java:1074) at org.apache.hadoop.dfs.FSDataset.isValidBlock(FSDataset.java:1066) at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:894) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2322) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045) at java.lang.Thread.run(Thread.java:619) Locked ownable synchronizers: - None org.apache.hadoop.dfs.datanode$dataxcei...@1bcee17 daemon prio=10 tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry [0x459fe000..0x459ff0d0] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) - waiting to lock 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934) - locked 0x54e550e0 (a org.apache.hadoop.dfs.FSDataset) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2322) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045) at java.lang.Thread.run(Thread.java:619) Locked ownable synchronizers: - None DataNode: [/data/dfs-video-18/dfs/data] daemon prio=10 tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0] java.lang.Thread.State: RUNNABLE at java.lang.String.lastIndexOf(String.java:1628) at java.io.File.getName(File.java:399) at org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181) at org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) - locked 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890) at java.lang.Thread.run(Thread.java:619) and the lock is *0x54e550e0 *is held by another similar thread and tht thread is waiting on FSVolume blocked by getBlockReport() Infact, during this time, the datanode appears as dead node and clients keep on getting createBlockException with timeout We dont see this problem on other DNs with less number of blocks. So I think, 2Million files is an issue here Pl correct me if I missed on something -Sagar Raghu Angadi wrote: The scan required for each block report is well known issue and it can be fixed. It was discussed multiple times (e.g. https://issues.apache.org/jira/browse/HADOOP-3232?focusedCommentId=12587795#action_12587795 ). Earlier, inline 'du' on datanodes used to cause the same problem and they they were moved to a separate thread (HADOOP-3232). block reports can do the same... Though 2M blocks on DN is very large, there is no reason block reports should break things. Once we fix block reports, something else might break.. but that is different issue. Raghu. Jason Venner wrote: The problem we are having is that datanodes periodically stall for 10-15 minutes and drop off the active list and then come back. What is going on is that a long operation set is holding the lock on on FSDataset.volumes, and all of the other block service requests stall behind this lock. DataNode: [/data/dfs-video-18/dfs/data] daemon prio=10 tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0] java.lang.Thread.State: RUNNABLE at java.lang.String.lastIndexOf(String.java:1628) at java.io.File.getName(File.java:399) at org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181) at org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412) at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) - locked 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet) at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890) at
Re: cannot allocate memory error
{HADOOP_HOME}/conf/hadoop-env.sh export HADOOP_HEAPSIZE the default is 1M, so I think that there could be another issue -Sagar sagar arlekar wrote: Hello, I am new to hadoop. I am running hapdoop 0.17 in a Eucalyptus cloud instance (its a centos image on xen) bin/hadoop dfs -ls / gives the following Exception 08/12/31 08:58:10 WARN fs.FileSystem: localhost:9000 is a deprecated filesystem name. Use hdfs://localhost:9000/ instead. 08/12/31 08:58:10 WARN fs.FileSystem: uri=hdfs://localhost:9000 javax.security.auth.login.LoginException: Login failed: Cannot run program whoami: java.io.IOException: error=12, Cannot allocate memory at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1353) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1289) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108) at org.apache.hadoop.fs.FsShell.init(FsShell.java:87) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1717) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1866) Bad connection to FS. command aborted. Running the command again gives. bin/hadoop dfs -ls / Error occurred during initialization of VM Could not reserve enough space for object heap Changing value of 'mapred.child.java.opts' property in hadoop-site.xml did not help. Kindly help me. What could I do to give more memory to hadoop? BTW is there a way to search through the mail archive? I only saw the mails listed according to year and months. Regards, Sagar
Re: Threads per mapreduce job
mapred.map.multithreadedrunner.threads is the property u r looking for Michael wrote: Hi everyone: How do I control the number of threads per mapreduce job. I am using bin/hadoop jar wordcount to run jobs and even though I have found these settings in hadoop-default.xml and changed the values to 1: namemapred.tasktracker.map.tasks.maximum/name namemapred.tasktracker.reduce.tasks.maximum/name The output of the job seems to indicate otherwise. 08/12/26 18:21:12 INFO mapred.JobClient: Job Counters 08/12/26 18:21:12 INFO mapred.JobClient: Launched reduce tasks=1 08/12/26 18:21:12 INFO mapred.JobClient: Rack-local map tasks=12 08/12/26 18:21:12 INFO mapred.JobClient: Launched map tasks=17 08/12/26 18:21:12 INFO mapred.JobClient: Data-local map tasks=4 I have 2 servers running the mapreduce process and the datanode process. Thanks, Michael
Re: Failed to start TaskTracker server
Well u have some process which grabs this port and Hadoop is not able to bind the port By the time u check, there is a chance that socket connection has died but was occupied when hadoop processes was attempting Check all the processes running on the system Do any of the processes acquire ports ? -Sagar ascend1 wrote: I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes work properly but when I use bin/start-mapred.sh to start MapReduce framework only 3 or 4 TaskTracker could be started properly. All those couldn't be started have the same error. Here's the log: 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = msra-5lcd05/172.23.213.80 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@e51b2c 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@edf389 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@17b0998 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: socketlisten...@0.0.0.0:50060 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.net.BindException: Address already in use: JVM_Bind at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359) at java.net.ServerSocket.bind(ServerSocket.java:319) at java.net.ServerSocket.init(ServerSocket.java:185) at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477) at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503) at org.mortbay.http.SocketListener.start(SocketListener.java:203) at org.mortbay.http.HttpServer.doStart(HttpServer.java:761) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80 / Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several times, the machines list that could start TaskTracker seems randomly. Could anybody help me solve this problem?
Re: Failed to start TaskTracker server
- check hadoop-default.xml in here u will find all the ports used. Copy the xml-nodes from hadoop-default.xml to hadoop-site.xml. Change the port values in hadoop-site.xml and deploy it on datanodes . Rico wrote: Well the machines are all servers that probably running many services but I have no permission to change or modify other users' programs or settings. Is there any way to change 50060 to other port? Sagar Naik wrote: Well u have some process which grabs this port and Hadoop is not able to bind the port By the time u check, there is a chance that socket connection has died but was occupied when hadoop processes was attempting Check all the processes running on the system Do any of the processes acquire ports ? -Sagar ascend1 wrote: I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes work properly but when I use bin/start-mapred.sh to start MapReduce framework only 3 or 4 TaskTracker could be started properly. All those couldn't be started have the same error. Here's the log: 2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: / STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = msra-5lcd05/172.23.213.80 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@e51b2c 2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@edf389 2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@17b0998 2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: socketlisten...@0.0.0.0:50060 2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.net.BindException: Address already in use: JVM_Bind at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359) at java.net.ServerSocket.bind(ServerSocket.java:319) at java.net.ServerSocket.init(ServerSocket.java:185) at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477) at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503) at org.mortbay.http.SocketListener.start(SocketListener.java:203) at org.mortbay.http.HttpServer.doStart(HttpServer.java:761) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698) 2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80 / Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several times, the machines list that could start TaskTracker seems randomly. Could anybody help me solve this problem?
.18.1 jobtracker deadlock
Hi, Found one Java-level deadlock: = SocketListener0-7: waiting to lock monitor 0x0845e1fc (object 0x54f95838, a org.apache.hadoop.mapred.JobTracker), which is held by IPC Server handler 0 on 54311 IPC Server handler 0 on 54311: waiting to lock monitor 0x4d671064 (object 0x57250a60, a org.apache.hadoop.mapred.JobInProgress), which is held by initJobs initJobs: waiting to lock monitor 0x0845e1fc (object 0x54f95838, a org.apache.hadoop.mapred.JobTracker), which is held by IPC Server handler 0 on 54311 Java stack information for the threads listed above: === SocketListener0-7: at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:1826) - waiting to lock 0x54f95838 (a org.apache.hadoop.mapred.JobTracker) at org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:135) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) IPC Server handler 0 on 54311: at org.apache.hadoop.mapred.JobInProgress.kill(JobInProgress.java:1451) - waiting to lock 0x57250a60 (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) - locked 0x54f95838 (a org.apache.hadoop.mapred.JobTracker) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) initJobs: at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:1015) - waiting to lock 0x54f95838 (a org.apache.hadoop.mapred.JobTracker) at org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:1656) - locked 0x57250a60 (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.kill(JobInProgress.java:1469) - locked 0x57250a60 (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobTracker$JobInitThread.run(JobTracker.java:416) at java.lang.Thread.run(Thread.java:619) Found 1 deadlock. I found this condition. I will try to work on this -Sagar
DiskUsage ('du -sk') probably hangs Datanode
I see createBlockException and Abandoning block quite often When I check the datanode, they are running. I can browse file system from that datanode:50075 However, I also notice tht a du forked off from the DN. This 'du' run anywhere from 6mins to 30 mins. During this time no logs are generated . DN appears in S1 state and the 'du' in D state. Is it possible tht jvm has bug or hdd is bad. I m using /usr/java/jdk1.6.0_07/bin/java and planing to move onto 11 However, I start noticing this after DFS is 50% (on avg) full Pl help me with some pointers Hadoop version : .18.1 -Sagar
Re: DiskUsage ('du -sk') probably hangs Datanode
Brian Bockelman wrote: Hey Sagar, If the 'du' is in the D state, then that probably means bad things for your hardware. I recommend looking in dmesg and /var/log/messages for anything interesting, as well as perform a hard-drive diagnostic test (may be as simple as a SMART tests) to see if there's an issue. I can't say for sure, but the 'du' is probably not hanging the Datanode; it's probably a symptom of larger problems. Thanks Brian I will start SMART tests Pl tell me what direction I should look in case of larger problems. Brian On Dec 17, 2008, at 8:29 PM, Sagar Naik wrote: I see createBlockException and Abandoning block quite often When I check the datanode, they are running. I can browse file system from that datanode:50075 However, I also notice tht a du forked off from the DN. This 'du' run anywhere from 6mins to 30 mins. During this time no logs are generated . DN appears in S1 state and the 'du' in D state. Is it possible tht jvm has bug or hdd is bad. I m using /usr/java/jdk1.6.0_07/bin/java and planing to move onto 11 However, I start noticing this after DFS is 50% (on avg) full Pl help me with some pointers Hadoop version : .18.1 -Sagar
Re: occasional createBlockException in Hadoop .18.1
Hi, Some data points on this issue. 1) du runs for 20-30 secs. 2) after some time , I dont see any activity in datanode logs 3) I cant even jstack the datanode (forced it , gave me a DebuggerException, double checked the pid), the datanode:50075/stacks takes forever to respond I can telnet to datanode:50010 I think, the disk is bad or something Pl suggest some pointers to analyze this problem -Sagar Sagar Naik wrote: CLIENT EXCEPTION: 2008-12-14 08:41:46,919 [Thread-90] INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.50.80.133:54045 remote=/10.50.80.108:50010] 2008-12-14 08:41:46,919 [Thread-90] INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-7364265396616885025_5870078 2008-12-14 08:41:46,920 [Thread-90] INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.50.80.108:50010 DATANODE 2008-12-14 08:40:39,215 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_-7364265396616885025_5870078 src: /10.50.80.133:54045 dest: /10.50.80.133:50010 . . . . . I occasionally see the datanode as deadnode. When the datanode is deadnode, I see the du forked from datanode. The du is seen in D state Any pointers to debug this information would help me -Sagar
Re: Q about storage architecture
http://hadoop.apache.org/core/docs/r0.18.2/hdfs_design.html Sirisha Akkala wrote: Hi I would like to know if Hadoop architecture more resembles SAN or NAS? -I'm guessing it is NAS. Or does it fall under a totally different category? If so, can you please email brief information? thanks,sirisha.
Re: getting Configuration object in mapper
check : mapred.task.is.map Craig Macdonald wrote: I have a related question - I have a class which is both mapper and reducer. How can I tell in configure() if the current task is map or a reduce task? Parse the taskid? C Owen O'Malley wrote: On Dec 4, 2008, at 9:19 PM, abhinit wrote: I have set some variable using the JobConf object. jobConf.set(Operator, operator) etc. How can I get an instance of Configuration object/ JobConf object inside a map method so that I can retrieve these variables. In your Mapper class, implement a method like: public void configure(JobConf job) { ... } This will be called when the object is created with the job conf. -- Owen
Re: Bad connection to FS. command aborted.
Check u r conf in the classpath. Check if Namenode is running U r not able to connect to the intended Namenode -Sagar elangovan anbalahan wrote: im getting this error message when i am dong *bash-3.2$ bin/hadoop dfs -put urls urls* please lemme know the resolution, i have a project submission in a few hours
Re: Bad connection to FS. command aborted.
hadoop version ? command : bin/hadoop version -Sagar elangovan anbalahan wrote: i tried that but nothing happened bash-3.2$ bin/hadoop dfs -put urll urll put: java.io.IOException: failed to create file /user/nutch/urll/.urls.crc on client 192.168.1.6 because target-length is 0, below MIN_REPLICATION (1) bash-3.2$ bin/hadoop dfs -cat urls/part-0* urls bash-3.2$ bin/hadoop dfs -ls urls Found 0 items bash-3.2$ bin/hadoop dfs -ls urll Found 0 items bash-3.2$ bin/hadoop dfs -ls Found 2 items /user/nutch/$dir /user/nutch/urlldir how do i get rid of the foll error: *put: java.io.IOException: failed to create file /user/nutch/urll/.urls.crc on client 192.168.1.6 because target-length is 0, below MIN_REPLICATION (1) * On Thu, Dec 4, 2008 at 1:29 PM, Elia Mazzawi [EMAIL PROTECTED]wrote: you didn't say what the error was? but you can try this it should do the same thing bin/hadoop dfs -cat urls/part-0* urls elangovan anbalahan wrote: im getting this error message when i am dong *bash-3.2$ bin/hadoop dfs -put urls urls* please lemme know the resolution, i have a project submission in a few hours
Re: Hadoop datanode crashed - SIGBUS
Brian Bockelman wrote: Hardware/memory problems? I m not sure. SIGBUS is relatively rare; it sometimes indicates a hardware error in the memory system, depending on your arch. *uname -a : * Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006 i686 i686 i386 GNU/Linux *top's top* Cpu(s): 0.1% us, 1.1% sy, 0.0% ni, 98.0% id, 0.8% wa, 0.0% hi, 0.0% si Mem: 8288280k total, 1575680k used, 6712600k free, 5392k buffers Swap: 16386292k total, 68k used, 16386224k free, 522408k cached 8 core , xeon 2GHz Brian On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote: Couple of the datanodes crashed with the following error The /tmp is 15% occupied # # An unexpected error has been detected by Java Runtime Environment: # # SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408 # [Too many errors, abort] Pl suggest how should I go to debug this particular problem -Sagar Thanks to Brian -Sagar
Re: Hadoop datanode crashed - SIGBUS
None of the jobs use compression for sure -Sagar Brian Bockelman wrote: I'd run memcheck overnight on the nodes that caused the problem, just to be sure. Another (unlikely) possibility is that the JNI callouts for the native libraries Hadoop use (for the Compression codecs, I believe) have crashed or were set up wrong, and died fatally enough to take out the JVM. Are you using any compression? Does your job complete successfully in local mode, if the crash correlates well with a job running? Brian On Dec 1, 2008, at 3:32 PM, Sagar Naik wrote: Brian Bockelman wrote: Hardware/memory problems? I m not sure. SIGBUS is relatively rare; it sometimes indicates a hardware error in the memory system, depending on your arch. *uname -a : * Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006 i686 i686 i386 GNU/Linux *top's top* Cpu(s): 0.1% us, 1.1% sy, 0.0% ni, 98.0% id, 0.8% wa, 0.0% hi, 0.0% si Mem: 8288280k total, 1575680k used, 6712600k free, 5392k buffers Swap: 16386292k total, 68k used, 16386224k free, 522408k cached 8 core , xeon 2GHz Brian On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote: Couple of the datanodes crashed with the following error The /tmp is 15% occupied # # An unexpected error has been detected by Java Runtime Environment: # # SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408 # [Too many errors, abort] Pl suggest how should I go to debug this particular problem -Sagar Thanks to Brian -Sagar
Re: Hadoop datanode crashed - SIGBUS
hi, I dont have additional information on it. If u know any other flag tht I need to turn on , pl do tell me . The flags tht are currently on are -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParallelGC -Dcom.sun.management.jmxremote But this is what is listed in stdout (datanode.out) file Java version : java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) Server VM (build 10.0-b23, mixed mode) I will try to stress test the memory. -Sagar Chris Collins wrote: Was there anything mentioned as part of the tombstone message about problematic frame? What java are you using? There are a few reasons for SIGBUS errors, one is illegal address alignment, but from java thats very unlikelythere were some issues with the native zip library in older vm's. As Brian pointed out, sometimes this points to a hw issue. C On Dec 1, 2008, at 1:32 PM, Sagar Naik wrote: Brian Bockelman wrote: Hardware/memory problems? I m not sure. SIGBUS is relatively rare; it sometimes indicates a hardware error in the memory system, depending on your arch. *uname -a : * Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006 i686 i686 i386 GNU/Linux *top's top* Cpu(s): 0.1% us, 1.1% sy, 0.0% ni, 98.0% id, 0.8% wa, 0.0% hi, 0.0% si Mem: 8288280k total, 1575680k used, 6712600k free, 5392k buffers Swap: 16386292k total, 68k used, 16386224k free, 522408k cached 8 core , xeon 2GHz Brian On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote: Couple of the datanodes crashed with the following error The /tmp is 15% occupied # # An unexpected error has been detected by Java Runtime Environment: # # SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408 # [Too many errors, abort] Pl suggest how should I go to debug this particular problem -Sagar Thanks to Brian -Sagar
Re: Namenode BlocksMap on Disk
We can also try to mount the particular dir on ramfs and reduce the performance degradation -Sagar Billy Pearson wrote: I would like to see something like this also I run 32bit servers so I am limited on how much memory I can use for heap. Besides just storing to disk I would like to see some sort of cache like a block cache that will cache parts the BlocksMap this would help reduce the hits to disk for lookups and still give us the ability to lower the memory requirement for the namenode. Billy Dennis Kubes [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] From time to time a message pops up on the mailing list about OOM errors for the namenode because of too many files. Most recently there was a 1.7 million file installation that was failing. I know the simple solution to this is to have a larger java heap for the namenode. But the non-simple way would be to convert the BlocksMap for the NameNode to be stored on disk and then queried and updated for operations. This would eliminate memory problems for large file installations but also might degrade performance slightly. Questions: 1) Is there any current work to allow the namenode to store on disk versus is memory? This could be a configurable option. 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I am willing to put forth the work to make this happen. Just want to make sure I am not going down the wrong path to begin with. Dennis
64 bit namenode and secondary namenode 32 bit datanode
I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
Re: 64 bit namenode and secondary namenode 32 bit datanode
lohit wrote: I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? what are the advantages here. I just have multiple entries master file. I am not aware of image corruption (did not take look into it). I did for SNN redundancy Pl correct me if I am wrong Thanks Sagar Wondering if there are chances of image corruption. Thanks, lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 3:58:53 PM Subject: 64 bit namenode and secondary namenode 32 bit datanode I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
Re: Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
Include the ${HADOOP}/conf/ dir in the classpath of the java program Alternatively, u can also try, bin/hadoop jar your_jar main_class args -Sagar Saju K K wrote: This is in referance with the sample application in the JAVAWord http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html?page=5 bin/hadoop dfs -mkdir /opt/www/hadoop/hadoop-0.18.2/words bin/hadoop dfs -put word1 /opt/www/hadoop/hadoop-0.18.2/words bin/hadoop dfs -put word2 /opt/www/hadoop/hadoop-0.18.2/words bin/hadoop dfs -put word3 /opt/www/hadoop/hadoop-0.18.2/words bin/hadoop dfs -put word4 /opt/www/hadoop/hadoop-0.18.2/words When i browse through the http://serdev40.apac.nokia.com:50075/browseDirectory.jsp .I could see the files in the directory Also below commands execute properly bin/hadoop dfs -ls /opt/www/hadoop/hadoop-0.18.2/words/ bin/hadoop dfs -ls /opt/www/hadoop/hadoop-0.18.2/words/word1 bin/hadoop dfs -cat /opt/www/hadoop/hadoop-0.18.2/words/word1 But on executing this command ,i am getting an error java -Xms1024m -Xmx1024m com.nokia.tag.test.EchoOhce /opt/www/hadoop/hadoop-0.18.2/words/ result java -Xms1024m -Xmx1024m com.nokia.tag.test.EchoOhce /opt/www/hadoop/hadoop-0.18.2/words result 08/11/24 10:52:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/www/hadoop/hadoop-0.18.2/words at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:210) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026) at com.nokia.tag.test.EchoOhce.run(EchoOhce.java:123) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.nokia.tag.test.EchoOhce.main(EchoOhce.java:129) Can anybody know why there is a failure from the Java application
Re: Hadoop 18.1 ls stalls
Unfortunately, I am not getting it now, bcoz we have turned off our services. and ( I cant start them immediately) But I used to get retryInvocationHandler (or similar) in stack and ls tht took 4-5 secs had listed 37 files only. It was not lsr option. Thats what surprised me. -Sagar Raghu Angadi wrote: Sagar Naik wrote: Thanks Raghu, *datapoints:* - So when I use FSShell client, it gets into retry mode for getFilesInfo() call and takes a long time. What does retry mode mean? - Also, when do a ls operation, it takes secs(4/5) . - 1.6 million files and namenode is mostly full with heap(2400M) (from ui) When you say 'ls', how many does it return? (ie. ls of one file, or -lsr of thousands of files etc). None of the IPC threads in your stack trace is doing any work.
Re: Hadoop Installation
Mithila Nagendra wrote: Hello I m currently a student at Arizona State University, Tempe, Arizona, pursuing my masters in Computer Science. I m currently involved in a research project that makes use of Hadoop to run various map reduce functions. Hence I searched the web on whats the best way to install Hadoop on different nodes in a cluster, and I stumbled upon your website. I used your tutorial on How to install hadoop on a linux system by Michael Noll to setup Hadoop on a UNIX system. I have a few questions related to it: 1. Does it matter that I m installing hadoop on UNIX and not on LINUX - do I have to follow different steps? 2. The configuration for hadoop-site.xml - does it remain the same no matter what platform is being used? Do I just type the same thing out in the file hadoop-site.xml present in the hadoop installation on the node? 3. When I try to start the daemons by executing the command conf/start-all.sh, I get an exception which says hadoop: user specified log class 'org.apache.commons.logging.impl.Log4JLogger' cannot be found or is not usable - this happens when tasktracker is being started. What steps do I take to deal with this exception? start-all is in {HADOOP_HOME}/bin/ . What is your hadoop version ? I could send you the screen shot of the exception if you wish. It would be of immense help if you could provide answers for the above questions. Thank you! Looking forward to your reply. Best Regards Mithila Nagendra
Re: Recovering NN failure when the SNN data is on another server
Take backup of you dfs.data.dir (both on namenode and secondary namenode). If secondary namenode is not running on same machine as namenode, copy over the fs.checkpoint.dir from secondary onto namenode. start only the namenode . The importCheckpoint fails for a valid NN image. If you want to override NN image by SNN's image , delete the dfs.name.dir For additional info : https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173 Pl note I am not an expert. Just had similar problem and this worked for me -Sagar Yossi Ittach wrote: Hi all I apologize if the topic has already been answered - I couldn't find it. I'm trying to restart a failed NN using hadoop namenode -importCheckpoint , and the SNN is configured on another server. However , the NN keeps looking for the SNN data folder on the local server , and not on the SNN Server. Any ideas? 10X! Vale et me ama Yossi
Re: Recovering NN failure when the SNN data is on another server
Let me correct myself. - backup of dfs.data.dir and dfs.name.dir on NN and SNN - If secondary namenode is not running on same machine as namenode, copy over the fs.checkpoint.dir from secondary onto namenode. - If you want to override NN image by SNN's image , delete the dfs.name.dir (dfs.name.dir has been backed-up) - start only the namenode with -importCheckpoint - For additional info : https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173 -Sagar Sagar Naik wrote: Take backup of you dfs.data.dir (both on namenode and secondary namenode). If secondary namenode is not running on same machine as namenode, copy over the fs.checkpoint.dir from secondary onto namenode. start only the namenode . The importCheckpoint fails for a valid NN image. If you want to override NN image by SNN's image , delete the dfs.name.dir For additional info : https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173 Pl note I am not an expert. Just had similar problem and this worked for me -Sagar Yossi Ittach wrote: Hi all I apologize if the topic has already been answered - I couldn't find it. I'm trying to restart a failed NN using hadoop namenode -importCheckpoint , and the SNN is configured on another server. However , the NN keeps looking for the SNN data folder on the local server , and not on the SNN Server. Any ideas? 10X! Vale et me ama Yossi
Re: Recovery of files in hadoop 18
Hey Lohit, Thanks for you help. I did as per your suggestion. imported from secondary namenode. we have some corrupted files. But for some reason, the namenode is still in safe_mode. It has been an hour or so. The fsck report is : Total size:6954466496842 B (Total open files size: 543469222 B) Total dirs:1159 Total files: 1354155 (Files currently being written: 7673) Total blocks (validated): 1375725 (avg. block size 5055128 B) (Total open file blocks (not validated): 50) CORRUPT FILES:1574 MISSING BLOCKS: 1574 MISSING SIZE: 1165735334 B CORRUPT BLOCKS: 1574 Minimally replicated blocks: 1374151 (99.88559 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 26619 (1.9349071 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 2.977127 Corrupt blocks:1574 Missing replicas: 26752 (0.65317154 %) Do you think, I should manually override the safemode and delete all the corrupted files and restart -Sagar lohit wrote: If you have enabled thrash. They should be moved to trash folder before permanently deleting them, restore them back. (hope you have that set fs.trash.interval) If not Shut down the cluster. Take backup of you dfs.data.dir (both on namenode and secondary namenode). Secondary namenode should have last updated image, try to start namenode from that image, dont use the edits from namenode yet. Try do importCheckpoint explained in here https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173. Start only namenode and run fsck -files. it will throw lot of messages saying you are missing blocks but thats fine since you havent started datanodes yet. But if it shows your files, that means they havent been deleted yet. This will give you a view of system of last backup. Start datanode If its up, try running fsck and check consistency of the sytem. you would lose all changes that has happened since the last checkpoint. Hope that helps, Lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, November 14, 2008 10:38:45 AM Subject: Recovery of files in hadoop 18 Hi, I accidentally deleted the root folder in our hdfs. I have stopped the hdfs Is there any way to recover the files from secondary namenode Pl help -Sagar
Re: Recovery of files in hadoop 18
I had a secondary namenode running on the namenode machine. I deleted the dfs.name.dir then bin/hadoop namenode -importCheckpoint. and restarted the dfs. I guess the deletion of name.dir will delete the edit logs. Can u pl tell me that this will not lead to replaying the delete transactions ? Thanks for help/advice -Sagar lohit wrote: NameNode would not come out of safe mode as it is still waiting for datanodes to report those blocks which it expects. I should have added, try to get a full output of fsck fsck path -openforwrite -files -blocks -location. -openforwrite files should tell you what files where open during the checkpoint, you might want to double check that is the case, the files were being writting during that moment. May be by looking at the filename you could tell if that was part of a job which was running. For any missing block, you might also want to cross verify on the datanode to see if is really missing. Once you are convinced that those are the only corrupt files which you can live with, start datanodes. Namenode woudl still not come out of safemode as you have missing blocks, leave it for a while, run fsck look around, if everything ok, bring namenode out of safemode. I hope you had started this namenode with old image and empty edits. You do not want your latest edits to be replayed, which has your delete transactions. Thanks, Lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, November 14, 2008 12:11:46 PM Subject: Re: Recovery of files in hadoop 18 Hey Lohit, Thanks for you help. I did as per your suggestion. imported from secondary namenode. we have some corrupted files. But for some reason, the namenode is still in safe_mode. It has been an hour or so. The fsck report is : Total size:6954466496842 B (Total open files size: 543469222 B) Total dirs:1159 Total files: 1354155 (Files currently being written: 7673) Total blocks (validated): 1375725 (avg. block size 5055128 B) (Total open file blocks (not validated): 50) CORRUPT FILES:1574 MISSING BLOCKS: 1574 MISSING SIZE: 1165735334 B CORRUPT BLOCKS: 1574 Minimally replicated blocks: 1374151 (99.88559 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 26619 (1.9349071 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 2.977127 Corrupt blocks:1574 Missing replicas: 26752 (0.65317154 %) Do you think, I should manually override the safemode and delete all the corrupted files and restart -Sagar lohit wrote: If you have enabled thrash. They should be moved to trash folder before permanently deleting them, restore them back. (hope you have that set fs.trash.interval) If not Shut down the cluster. Take backup of you dfs.data.dir (both on namenode and secondary namenode). Secondary namenode should have last updated image, try to start namenode from that image, dont use the edits from namenode yet. Try do importCheckpoint explained in here https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173. Start only namenode and run fsck -files. it will throw lot of messages saying you are missing blocks but thats fine since you havent started datanodes yet. But if it shows your files, that means they havent been deleted yet. This will give you a view of system of last backup. Start datanode If its up, try running fsck and check consistency of the sytem. you would lose all changes that has happened since the last checkpoint. Hope that helps, Lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, November 14, 2008 10:38:45 AM Subject: Recovery of files in hadoop 18 Hi, I accidentally deleted the root folder in our hdfs. I have stopped the hdfs Is there any way to recover the files from secondary namenode Pl help -Sagar
Re: HDFS from non-hadoop Program
Can u make sure the files in hadoop conf dir is in the classpath of the java program -Sagar Wasim Bari wrote: Hello, I am trying to access HDFS from a non-hadoop program using java. When I try to get Configuration file, it shows exception both in DEBUG mode and normal one: org.apache.hadoop.conf.Configuration: java.io.IOException: config()at org.apache.hadoop.conf.Configuration.init(Configuration.java:156) With the same Configuration files when I try to access from a single stand alone program, it runs perfectly fine. Some people posted same issues before but no solution is posted. anyone found the solution ? Thanks wasim
Missing blocks from bin/hadoop text but fsck is all right
Hi, We have a strange problem on getting out some of our files bin/hadoop dfs -text dir/* gives me missing block exceptions. 0/8/11/04 10:45:09 [main] INFO dfs.DFSClient: Could not obtain block blk_6488385702283300787_1247408 from any node: java.io.IOException: No live nodes contain current block 08/11/04 10:45:12 [main] INFO dfs.DFSClient: Could not obtain block blk_6488385702283300787_1247408 from any node: java.io.IOException: No live nodes contain current block 08/11/04 10:45:15 [main] INFO dfs.DFSClient: Could not obtain block blk_6488385702283300787_1247408 from any node: java.io.IOException: No live nodes contain current block 08/11/04 10:45:18 [main] WARN dfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_6488385702283300787_1247408 file=some_filepath-1 at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1369) at java.io.DataInputStream.readShort(DataInputStream.java:295) at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:396) at org.apache.hadoop.fs.FsShell.access$1(FsShell.java:394) at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865) at org.apache.hadoop.fs.FsShell.text(FsShell.java:421) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)/ but when I do a bin/hadoop dfs -text some_filepath-1. I do get all the data the fsck on this parent of this file revealed no problems. jstack on FSshell revealed nothin much /Debugger attached successfully. Server compiler detected. JVM version is 10.0-b19 Deadlock Detection: No deadlocks found. Thread 3358: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$LeaseChecker.run() @bci=124, line=792 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=619 (Interpreted frame) Thread 3357: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=62, line=397 (Interpreted frame) - org.apache.hadoop.ipc.Client$Connection.run() @bci=63, line=440 (Interpreted frame) Thread 3342: (state = BLOCKED) Thread 3341: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Interpreted frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame) Thread 3340: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Interpreted frame) Thread 3330: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(org.apache.hadoop.dfs.LocatedBlock) @bci=181, line=1470 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(long) @bci=133, line=1312 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(byte[], int, int) @bci=61, line=1417 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.read() @bci=7, line=1369 (Compiled frame) - java.io.DataInputStream.readShort() @bci=4, line=295 (Compiled frame) - org.apache.hadoop.fs.FsShell.forMagic(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=7, line=396 (Interpreted frame) - org.apache.hadoop.fs.FsShell.access$1(org.apache.hadoop.fs.FsShell, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=3, line=394 (Interpreted frame) - org.apache.hadoop.fs.FsShell$2.process(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=28, line=419 (Interpreted frame) - org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=40, line=1865 (Interpreted frame) - org.apache.hadoop.fs.FsShell.text(java.lang.String) @bci=26, line=421 (Interpreted frame) - org.apache.hadoop.fs.FsShell.doall(java.lang.String, java.lang.String[], int) @bci=246, line=1532 (Interpreted frame) - org.apache.hadoop.fs.FsShell.run(java.lang.String[]) @bci=586, line=1730 (Interpreted frame) - org.apache.hadoop.util.ToolRunner.run(org.apache.hadoop.conf.Configuration, org.apache.hadoop.util.Tool, java.lang.String[]) @bci=38, line=65 (Interpreted frame) -
Re: Missing blocks from bin/hadoop text but fsck is all right
Hi, We were hitting file descriptor limits :). Increased it and got solved. Thanks Jason -Sagar Sagar Naik wrote: Hi, We have a strange problem on getting out some of our files bin/hadoop dfs -text dir/* gives me missing block exceptions. 0/8/11/04 10:45:09 [main] INFO dfs.DFSClient: Could not obtain block blk_6488385702283300787_1247408 from any node: java.io.IOException: No live nodes contain current block 08/11/04 10:45:12 [main] INFO dfs.DFSClient: Could not obtain block blk_6488385702283300787_1247408 from any node: java.io.IOException: No live nodes contain current block 08/11/04 10:45:15 [main] INFO dfs.DFSClient: Could not obtain block blk_6488385702283300787_1247408 from any node: java.io.IOException: No live nodes contain current block 08/11/04 10:45:18 [main] WARN dfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_6488385702283300787_1247408 file=some_filepath-1 at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1369) at java.io.DataInputStream.readShort(DataInputStream.java:295) at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:396) at org.apache.hadoop.fs.FsShell.access$1(FsShell.java:394) at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865) at org.apache.hadoop.fs.FsShell.text(FsShell.java:421) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)/ but when I do a bin/hadoop dfs -text some_filepath-1. I do get all the data the fsck on this parent of this file revealed no problems. jstack on FSshell revealed nothin much /Debugger attached successfully. Server compiler detected. JVM version is 10.0-b19 Deadlock Detection: No deadlocks found. Thread 3358: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$LeaseChecker.run() @bci=124, line=792 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=619 (Interpreted frame) Thread 3357: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=62, line=397 (Interpreted frame) - org.apache.hadoop.ipc.Client$Connection.run() @bci=63, line=440 (Interpreted frame) Thread 3342: (state = BLOCKED) Thread 3341: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Interpreted frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame) Thread 3340: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Interpreted frame) Thread 3330: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(org.apache.hadoop.dfs.LocatedBlock) @bci=181, line=1470 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(long) @bci=133, line=1312 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(byte[], int, int) @bci=61, line=1417 (Interpreted frame) - org.apache.hadoop.dfs.DFSClient$DFSInputStream.read() @bci=7, line=1369 (Compiled frame) - java.io.DataInputStream.readShort() @bci=4, line=295 (Compiled frame) - org.apache.hadoop.fs.FsShell.forMagic(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=7, line=396 (Interpreted frame) - org.apache.hadoop.fs.FsShell.access$1(org.apache.hadoop.fs.FsShell, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=3, line=394 (Interpreted frame) - org.apache.hadoop.fs.FsShell$2.process(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=28, line=419 (Interpreted frame) - org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=40, line=1865 (Interpreted frame) - org.apache.hadoop.fs.FsShell.text(java.lang.String) @bci=26, line=421 (Interpreted frame) - org.apache.hadoop.fs.FsShell.doall(java.lang.String, java.lang.String[], int) @bci=246, line=1532 (Interpreted frame) - org.apache.hadoop.fs.FsShell.run(java.lang.String[]) @bci=586, line=1730 (Interpreted frame) - org.apache.hadoop.util.ToolRunner.run
Re: namenode failure
Pl check your classpath entries. Looks like hadoop-core jar before you shutdown the cluster and after u changed hadoop-env.sh are different -Sagar Songting Chen wrote: Hi, I modified the classpath in hadoop-env.sh in namenode and datanodes before shutting down the cluster. Then problem appears: I cannot stop hadoop cluster at all. The stop-all.sh shows no datanode/namenode, while all the java processes are running. So I manually killed the java process. Now the namenode seems to be corrupted and always stays in Safe mode, while the datanodes complain the following weird error: 2008-10-27 17:28:44,141 FATAL org.apache.hadoop.dfs.DataNode: Incompatible build versions: namenode BV = ; datanode BV = 694836 2008-10-27 17:28:44,244 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible build versions: namenode BV = ; datanode BV = 694836 at org.apache.hadoop.dfs.DataNode.handshake(DataNode.java:403) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:250) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:190) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2987) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2942) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2950) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3072) My question is how to recover from such failure. And I guess the correct practice for changing the CLASSPATH is to shut down the cluster, apply the change, restart the cluster. Thanks, -Songting
Hadoop .16 : Task failures
Hi, We are using Hadoop 0.16 and on our heavy IO job we are seeing lot of these exceptions. We are seeing lot of task failures more than 50% :(. They are two reasons from log: a) Task task_200810092310_0003_m_20_0 failed to report status for 600 seconds. Killing! - b) java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1824) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571) /Tasktracker log: /Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out 2008-10-10 05:50:10,485 INFO org.apache.hadoop.fs.DFSClient: Abandoning block blk_-5660296346325180487 . .. . Parent Died. /Datanode log / 2008-10-10 00:00:23,066 INFO org.apache.hadoop.dfs.DataNode: PacketResponder blk_6562287961399683551 1 Exception java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1823) at java.lang.Thread.run(Thread.java:619) 2008-10-10 00:00:23,067 ERROR org.apache.hadoop.dfs.DataNode: /localhost ip /:50010:DataXceiver: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2263) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) 2008-10-10 00:53:53,790 INFO org.apache.hadoop.dfs.DataNode: Exception in receiveBlock for block blk_-3482274249842371655 java.net.SocketException: Connection reset 2008-10-10 00:53:53,791 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_-3482274249842371655 received exception java.net.SocketException: Connection reset 2008-10-10 00:53:53,791 ERROR org.apache.hadoop.dfs.DataNode: /localhost ip/:50010:DataXceiver: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2263) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) Any pointer would help us a lot -Sagar
Re: Getting started questions
Dennis Kubes wrote: John Howland wrote: I've been reading up on Hadoop for a while now and I'm excited that I'm finally getting my feet wet with the examples + my own variations. If anyone could answer any of the following questions, I'd greatly appreciate it. 1. I'm processing document collections, with the number of documents ranging from 10,000 - 10,000,000. What is the best way to store this data for effective processing? AFAIK hadoop doesn't do well with, although it can handle, a large number of small files. So it would be better to read in the documents and store them in SequenceFile or MapFile format. This would be similar to the way the Fetcher works in Nutch. 10M documents in a sequence/map file on DFS is comparatively small and can be handled efficiently. - The bodies of the documents usually range from 1K-100KB in size, but some outliers can be as big as 4-5GB. I would say store your document objects as Text objects, not sure if Text has a max size. I think it does but not sure what that is. If it does you can always store as a BytesWritable which is just an array of bytes. But you are going to have memory issues reading in and writing out that large of a record. - I will also need to store some metadata for each document which I figure could be stored as JSON or XML. - I'll typically filter on the metadata and then doing standard operations on the bodies, like word frequency and searching. It is possible to create an OutputFormat that writes out multiple files. You could also use a MapWritable as the value to store the document and associated metadata. Is there a canned FileInputFormat that makes sense? Should I roll my own? How can I access the bodies as streams so I don't have to read them into RAM A writable is read into RAM so even treating it like a stream doesn't get around that. One thing you might want to consider is to tar up say X documents at a time and store that as a file in DFS. You would have many of these files. Then have an index that has the offsets of the files and their keys (document ids). That index can be passed as input into a MR job that can then go to DFS and stream out the file as you need it. The job will be slower because you are doing it this way but it is a solution to handling such large documents as streams. all at once? Am I right in thinking that I should treat each document as a record and map across them, or do I need to be more creative in what I'm mapping across? 2. Some of the tasks I want to run are pure map operations (no reduction), where I'm calculating new metadata fields on each document. To end up with a good result set, I'll need to copy the entire input record + new fields into another set of output files. Is there a better way? I haven't wanted to go down the HBase road because it can't handle very large values (for the bodies) and it seems to make the most sense to keep the document bodies together with the metadata, to allow for the greatest locality of reference on the datanodes. If you don't specify a reducer, the IdentityReducer is run which simply passes through output. One can set number of reducers to zero and reduce phase will not take place. 3. I'm sure this is not a new idea, but I haven't seen anything regarding it... I'll need to run several MR jobs as a pipeline... is there any way for the map tasks in a subsequent stage to begin processing data from previous stage's reduce task before that reducer has fully finished? Yup, just use FileOutputFormat.getOutputPath(previousJobConf); Dennis Whatever insight folks could lend me would be a big help in crossing the chasm from the Word Count and associated examples to something more real. A whole heap of thanks in advance, John
Re: Aborting Map Function
Chaman Singh Verma wrote: Hello, I am developing one application with MapReduce and in that whenever some MapTask condition is met, I would like to broadcast to all other MapTask to abort their work. I am not quite sure whether such broadcasting functionality currently exist in Hadoop MapReduce. Could someone give some hints. Although extending this functionality may be easy as all the slaves periodically ping the master, I was just thinking of piggybacking one bit information from the slave to the master and master may send this information to all the slaves in the next round. Any suggestions to this approach ? Thanks. With Regards - Chaman Singh Verma Poona, India One possible solution could be to use Counters (http://hadoop.apache.org/core/docs/r0.16.2/api/org/apache/hadoop/mapred/Counters.html) Though it is advisable to look into details of implementation of it, and see if it can be used for multi-process shared variable