Re: "Could not obtain block" error
Thanks Raghu, But, both block file and the .meta file are 0 sized files!! Thanks, Murali On 10/30/08 12:16 AM, "Raghu Angadi" <[EMAIL PROTECTED]> wrote: > > One work around for you is to go to the datanode and remove the .crc > file for this block (find /datanodedir -name blk_5994030096182059653\*). > Be careful not to remove the block file itself.
Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values
I uploaded a patch that does a secondary sort. Take a look at: https://issues.apache.org/jira/browse/HADOOP-4545 It reads input with two numbers per a line. Such as: -1 -4 -3 23 5 10 -1 -2 -1 300 -1 10 4 1 4 2 4 10 4 -1 4 -10 10 20 10 30 10 25 And produces output like (with 2 reduces): part-0: 4 -10 4 -1 4 1 4 2 4 10 10 20 10 25 10 30 part-1: -3 23 -1 -4 -1 -2 -1 10 -1 300 5 10 -- Owen
Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?
Zhengguo 'Mike' SUN wrote: Hi, Peeyush, I guess I didn't make myself clear. I am trying to run a Hadoop pipes job with a combination of Java classes and C++ classes. So the command I am using is like: hadoop pipes -conf myconf.xml -inputformat MyInputFormat.class -input in -output out And it threw ClassNotFoundException for my InputSplit class. As I understand "hadoop jar" is used to run a jar file, which is not my case. And there is a -jar option in "hadoop pipes". But, unfortunately, it is not working for me. So the question I want to ask is how to include customized Java classes, such as MyInputSplit, in a pipes job? You are right. -jar option also doesn't add the jar file to classpath on the client-side. You can use -libjars option with 0.19. Then, the command looks like hadoop pipes -conf myconf.xml -libjars -inputformat MyInputFormat.class -input in -output out I don't see a way to do this in 0.17.*, one way could be you add it explicitly to the classpath on client-side, and add it through the option -jar for the job. Thanks, Amareshwari Thanks, Mike From: Peeyush Bishnoi <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org; core-user@hadoop.apache.org Sent: Wednesday, October 29, 2008 12:52:18 PM Subject: RE: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hello Zhengguo , Yes , -libjars is the new feature in Hadoop. This feature has been available from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x example to use -libjars... hadoop jar -libjars ... Thanks , --- Peeyush -Original Message- From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED] Sent: Wed 10/29/2008 9:22 AM To: core-user@hadoop.apache.org Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, Amareshwari, Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I see is -jar, which didn't work for me. And besides passing them as jar file, is there any other ways to do that? Thanks Mike From: Amareshwari Sriramadasu <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, October 28, 2008 11:58:33 PM Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, How are you passing your classes to the pipes job? If you are passing them as a jar file, you can use -libjars option. From branch 0.19, the libjar files are added to the client classpath also. Thanks Amareshwari Zhengguo 'Mike' SUN wrote: Hi, I implemented customized classes for InputFormat, InputSplit and RecordReader in Java and was trying to use them in a C++ pipes job. The customized InputFormat class could be included using the -inputformat option, but it threw ClassNotFoundException for my customized InputSplit class. It seemed the classpath has not been correctly set. Is there any way that let me include my customized classes in a pipes job?
Re: TaskTrackers disengaging from JobTracker
> > I wrote a patch to address the NPE in JobTracker.killJob() and compiled > it against TRUNK. I've put this on the cluster and it's now been holding > steady for the last hour or so.. so that plus whatever other differences > there are between 18.1 and TRUNK may have fixed things. (I'll submit the > patch to the JIRA as soon as it finishes cranking against the JUnit tests) > Aaron, I don't think this is a solution to the problem you are seeing. The IPC handlers are tolerant to exceptions. In particular, they must not die in the event of an exception during RPC processing. Could you please get a stack trace of the JobTracker threads (without your patch) when the TTs are unable to talk to it. Access the url http://:/stacks That will tell us what the handlers are up to. > - Aaron > > > Devaraj Das wrote: >> >> On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote: >> >>> The system load and memory consumption on the JT are both very close to >>> "idle" states -- it's not overworked, I don't think >>> >>> I may have an idea of the problem, though. Digging back up a ways into the >>> JT logs, I see this: >>> >>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server >>> handler 4 on 9001, call killJob(job_200810290855_0025) from >>> 10.1.143.245:48253: error: java.io.IOException: >>> java.lang.NullPointerException >>> java.io.IOException: java.lang.NullPointerException >>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45>>> ) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl >>> .j >>> ava:37) >>> at java.lang.reflect.Method.invoke(Method.java:599) >>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) >>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) >>> >>> >>> >>> This exception is then repeated for all the IPC server handlers. So I think >>> the problem is that all the handler threads are dying one by one due to this >>> NPE. >>> >> >> This should not happen. IPC handler catches Throwable and handles that. >> Could you give more details like the kind of jobs (long/short) you are >> running, how many tasks they have, etc. >> >>> This something I can fix myself, or is a patch available? >>> >>> - Aaron >>> >>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: >>> It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, > I'm working with a 30 node Hadoop cluster that has just started > demonstrating some weird behavior. It's run without incident for a few > weeks.. and now: > > The cluster will run smoothly for 90--120 minutes or so, handling jobs > continually during this time. Then suddenly it will be the case that all > 29 > TaskTrackers will get disconnected from the JobTracker. All the tracker > daemon processes are still running on each machine; but the JobTracker > will > say "0 nodes available" on the web status screen. Restarting MapReduce > fixes > this for another 90--120 minutes. > > This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, > but > we're running on 0.18.1. > > I found this in a TaskTracker log: > > 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught > exception: java.io.IOException: Call failed on local exception > at java.lang.Throwable.(Throwable.java:67) > at org.apache.hadoop.ipc.Client.call(Client.java:718) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) > at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) > at > > >> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045> >> >> >> ) > at > org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) > at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) > at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) > at sun.nio.ch.IOUtil.read(IOUtil.java:207) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > at > > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream > .j > ava:55) > at > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:14 > 0) > at > org.apache.hadoop.net.SocketInputStream.read(Sock
Re: TaskTrackers disengaging from JobTracker
Just as I wrote that, Murphy's law struck :) This did not fix the issue after all. I think the problem is occurring because a huge amount of network bandwidth is being consumed by the jobs. What settings (timeouts, thread counts, etc), if any, ought I dial up to correct for this? Thanks, - Aaron Aaron Kimball wrote: It's a cluster being used for a university course; there are 30 students all running code which (to be polite) probably tests the limits of Hadoop's failure recovery logic. :) The current assignment is PageRank over Wikipedia; a 20 GB input corpus. Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 reduce tasks. I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) - Aaron Devaraj Das wrote: On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote: The system load and memory consumption on the JT are both very close to "idle" states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say "0 nodes available" on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>> ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j ava:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.re
Re: TaskTrackers disengaging from JobTracker
It's a cluster being used for a university course; there are 30 students all running code which (to be polite) probably tests the limits of Hadoop's failure recovery logic. :) The current assignment is PageRank over Wikipedia; a 20 GB input corpus. Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 reduce tasks. I wrote a patch to address the NPE in JobTracker.killJob() and compiled it against TRUNK. I've put this on the cluster and it's now been holding steady for the last hour or so.. so that plus whatever other differences there are between 18.1 and TRUNK may have fixed things. (I'll submit the patch to the JIRA as soon as it finishes cranking against the JUnit tests) - Aaron Devaraj Das wrote: On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote: The system load and memory consumption on the JT are both very close to "idle" states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say "0 nodes available" on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>> ) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j ava:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>> ) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java
Re: TaskTrackers disengaging from JobTracker
On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote: > The system load and memory consumption on the JT are both very close to > "idle" states -- it's not overworked, I don't think > > I may have an idea of the problem, though. Digging back up a ways into the > JT logs, I see this: > > 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 4 on 9001, call killJob(job_200810290855_0025) from > 10.1.143.245:48253: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j > ava:37) > at java.lang.reflect.Method.invoke(Method.java:599) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) > > > > This exception is then repeated for all the IPC server handlers. So I think > the problem is that all the handler threads are dying one by one due to this > NPE. > This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. > This something I can fix myself, or is a patch available? > > - Aaron > > On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > >> It's possible that the JobTracker is under duress and unable to respond to >> the TaskTrackers... what do the JobTracker logs say? >> >> Arun >> >> >> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: >> >> Hi all, >>> >>> I'm working with a 30 node Hadoop cluster that has just started >>> demonstrating some weird behavior. It's run without incident for a few >>> weeks.. and now: >>> >>> The cluster will run smoothly for 90--120 minutes or so, handling jobs >>> continually during this time. Then suddenly it will be the case that all >>> 29 >>> TaskTrackers will get disconnected from the JobTracker. All the tracker >>> daemon processes are still running on each machine; but the JobTracker >>> will >>> say "0 nodes available" on the web status screen. Restarting MapReduce >>> fixes >>> this for another 90--120 minutes. >>> >>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, >>> but >>> we're running on 0.18.1. >>> >>> I found this in a TaskTracker log: >>> >>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught >>> exception: java.io.IOException: Call failed on local exception >>> at java.lang.Throwable.(Throwable.java:67) >>> at org.apache.hadoop.ipc.Client.call(Client.java:718) >>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >>> at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) >>> at >>> >>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>> ) >>> at >>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) >>> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) >>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) >>> Caused by: java.io.IOException: Connection reset by peer >>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) >>> at sun.nio.ch.IOUtil.read(IOUtil.java:207) >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >>> at >>> >>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j >>> ava:55) >>> at >>> >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >>> at java.io.FilterInputStream.read(FilterInputStream.java:127) >>> at >>> >>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>> ) >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:248) >>> at java.io.DataInputStream.readInt(DataInputStream.java:381) >>> at >>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) >>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >>> >>> >>> As well as a few of these warnings: >>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON >>> THREADS >>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060 >>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF >>> THREADS: [EMAIL PROTECTED]:50060 >>> >>> >>> >>> The NameNode and DataNodes are completely fine. Can't be a DNS issue, >>> because all DNS is served
Debugging / Logging in Hadoop?
I'm curious to what the best method for debugging and logging in Hadoop? I put together a small cluster today and a simple application to process log files. While it worked well, I had trouble trying to get logging information out. Is there any way to attach a debugger, or get log4j to write a log file? I tried setting up a Logger in the class I used for the map/reduce, but I had no luck. Thanks.
Re: Integration with compute cluster
Hi, You want to store your logs in HDFS (by copying them from your production machines, presumably) and then write custom MapReduce jobs that know how to process, correlate data in the logs, and output data in some format that suits you. What you do with that output is then up to you. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: shahab mehmandoust <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Wednesday, October 29, 2008 7:29:35 PM > Subject: Integration with compute cluster > > Hi, > > We have one prod server with web logs and a db server. We want to correlate > the data in the logs and the db. With a hadoop implementation (for scaling > up later), do we need to transfer the data to a machine (designated as the > compute cluster: http://hadoop.apache.org/core/images/architecture.gif), run > map/reduce there, and then transfer the output elsewhere for our analysis? > > I'm confused about the compute cluster; does it encompass the data sources > (here the prod server and the db)? > > Thanks, > Shahab
Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?
Hi, Peeyush, I guess I didn't make myself clear. I am trying to run a Hadoop pipes job with a combination of Java classes and C++ classes. So the command I am using is like: hadoop pipes -conf myconf.xml -inputformat MyInputFormat.class -input in -output out And it threw ClassNotFoundException for my InputSplit class. As I understand "hadoop jar" is used to run a jar file, which is not my case. And there is a -jar option in "hadoop pipes". But, unfortunately, it is not working for me. So the question I want to ask is how to include customized Java classes, such as MyInputSplit, in a pipes job? Thanks, Mike From: Peeyush Bishnoi <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org; core-user@hadoop.apache.org Sent: Wednesday, October 29, 2008 12:52:18 PM Subject: RE: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hello Zhengguo , Yes , -libjars is the new feature in Hadoop. This feature has been available from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x example to use -libjars... hadoop jar -libjars ... Thanks , --- Peeyush -Original Message- From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED] Sent: Wed 10/29/2008 9:22 AM To: core-user@hadoop.apache.org Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, Amareshwari, Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I see is -jar, which didn't work for me. And besides passing them as jar file, is there any other ways to do that? Thanks Mike From: Amareshwari Sriramadasu <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, October 28, 2008 11:58:33 PM Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, How are you passing your classes to the pipes job? If you are passing them as a jar file, you can use -libjars option. From branch 0.19, the libjar files are added to the client classpath also. Thanks Amareshwari Zhengguo 'Mike' SUN wrote: > Hi, > > I implemented customized classes for InputFormat, InputSplit and RecordReader > in Java and was trying to use them in a C++ pipes job. The customized > InputFormat class could be included using the -inputformat option, but it > threw ClassNotFoundException for my customized InputSplit class. It seemed > the classpath has not been correctly set. Is there any way that let me > include my customized classes in a pipes job? > > > > >
Re: Datanode not detecting full disk
Hi Raghu, Each DN machine has 3 partitions, e.g.: FilesystemSize Used Avail Use% Mounted on /dev/sda1 20G 8.0G 11G 44% / /dev/sda3 1.4T 756G 508G 60% /data tmpfs 3.9G 0 3.9G 0% /dev/shm All of the paths in hadoop-site.xml point to /data, which is the partition that filled up to 100% (I deleted a bunch of files from HDFS since then). So I guess the question is whether the DN looks at just the partition its data directory is on, or all partitions when it determines disk usage. -- Stefan > From: Raghu Angadi <[EMAIL PROTECTED]> > Reply-To: > Date: Wed, 29 Oct 2008 11:57:07 -0700 > To: > Subject: Re: Datanode not detecting full disk > > Stefan Will wrote: >> Hi Jeff, >> >> Yeah, it looks like I'm running into the issues described in the bug. I'm >> running 0.18.1 on CentOS 5 by the way. Measuring available disk space >> appears to be harder than I thought ... and here I was under the impression >> the percentage in df was a pretty clear indicator of how full the disk is >> ;-) >> >> How did you guys solve/work around this ? > > How many partitions do you have? If it is just one and NameNode thinks > it has space though 'available' in df shows very less or no space, then > you need to file a jira. There should be no case where DN reports more > space than what 'available' field in 'df' shows. > > But if you have more partitions and only some of them are full, then it > is a different issue.. which should still be fixed. > > Raghu. > >> -- Stefan >> >> >>> From: Jeff Hammerbacher <[EMAIL PROTECTED]> >>> Reply-To: >>> Date: Mon, 27 Oct 2008 12:40:08 -0700 >>> To: >>> Subject: Re: Datanode not detecting full disk >>> >>> Hey Stefan, >>> >>> We used to have trouble with this issue at Facebook. What version are >>> you running? You might get more information on this ticket: >>> https://issues.apache.org/jira/browse/HADOOP-2991. >>> >>> Regards, >>> Jeff >>> >>> On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote: Each of my datanodes has a system and a data partition, with dfs.data.dir pointed to the data partition. The data partition just filled up to 100% on all of my nodes (as evident via df), but the NameNode web ui still shows them only 88-94% full (interestingly, the numbers differ even though the machines are configured identically). I thought the datanodes used df to determine free space ? How is the storage utilization determined ? -- Stefan >> >>
Integration with compute cluster
Hi, We have one prod server with web logs and a db server. We want to correlate the data in the logs and the db. With a hadoop implementation (for scaling up later), do we need to transfer the data to a machine (designated as the compute cluster: http://hadoop.apache.org/core/images/architecture.gif), run map/reduce there, and then transfer the output elsewhere for our analysis? I'm confused about the compute cluster; does it encompass the data sources (here the prod server and the db)? Thanks, Shahab
Any examples using Hadoop Pipes with binary SequenceFiles?
Hi folks; I'm interested in reading binary data, running it through some C++ code, and writing the result as binary data. It looks like SequenceFiles and Pipes are the way to do it, but I can't find any examples or docs beyond the API specification. Can someone point me to an example where this is done? Thanks, -Joel
Re: SecondaryNameNode on separate machine
SecondaryNameNode uses http protocol to transfer the image and the edits from the primary name-node and vise versa. So the secondary does not access local files on the primary directly. The primary NN should know the secondary's http address. And the secondary NN need to know both fs.default.name and dfs.http.address of the primary. In general we usually create one configuration file hadoop-site.xml and copy it to all other machines. So you don't need to set up different values for all servers. Regards, --Konstantin Tomislav Poljak wrote: Hi, I'm not clear on how does SecondaryNameNode communicates with NameNode (if deployed on separate machine). Does SecondaryNameNode uses direct connection (over some port and protocol) or is it enough for SecondaryNameNode to have access to data which NameNode writes locally on disk? Tomislav On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote: I think a lot of the confusion comes from this thread : http://www.nabble.com/NameNode-failover-procedure-td11711842.html Particularly because the wiki was updated with wrong information, not maliciously I'm sure. This information is now gone for good. Otis, your solution is pretty much like the one given by Dhruba Borthakur and augmented by Konstantin Shvachko later in the thread but I never did it myself. One thing should be clear though, the NN is and will remain a SPOF (just like HBase's Master) as long as a distributed manager service (like Zookeeper) is not plugged into Hadoop to help with failover. J-D On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: Hi, So what is the "recipe" for avoiding NN SPOF using only what comes with Hadoop? From what I can tell, I think one has to do the following two things: 1) configure primary NN to save namespace and xa logs to multiple dirs, one of which is actually on a remotely mounted disk, so that the data actually lives on a separate disk on a separate box. This saves namespace and xa logs on multiple boxes in case of primary NN hardware failure. 2) configure secondary NN to periodically merge fsimage+edits and create the fsimage checkpoint. This really is a second NN process running on another box. It sounds like this secondary NN has to somehow have access to fsimage & edits files from the primary NN server. http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes not describe the best practise around that - the recommended way to give secondary NN access to primary NN's fsimage and edits files. Should one mount a disk from the primary NN box to the secondary NN box to get access to those files? Or is there a simpler way? In any case, this checkpoint is just a merge of fsimage+edits files and again is there in case the box with the primary NN dies. That's what's described on http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore or less. Is this sufficient, or are there other things one has to do to eliminate NN SPOF? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jean-Daniel Cryans <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, October 28, 2008 8:14:44 PM Subject: Re: SecondaryNameNode on separate machine Tomislav. Contrary to popular belief the secondary namenode does not provide failover, it's only used to do what is described here : http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode So the term "secondary" does not mean "a second one" but is more like "a second part of". J-D On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote: Hi, I'm trying to implement NameNode failover (or at least NameNode local data backup), but it is hard since there is no official documentation. Pages on this subject are created, but still empty: http://wiki.apache.org/hadoop/NameNodeFailover http://wiki.apache.org/hadoop/SecondaryNameNode I have been browsing the web and hadoop mailing list to see how this should be implemented, but I got even more confused. People are asking do we even need SecondaryNameNode etc. (since NameNode can write local data to multiple locations, so one of those locations can be a mounted disk from other machine). I think I understand the motivation for SecondaryNameNode (to create a snapshoot of NameNode data every n seconds/hours), but setting (deploying and running) SecondaryNameNode on different machine than NameNode is not as trivial as I expected. First I found that if I need to run SecondaryNameNode on other machine than NameNode I should change masters file on NameNode (change localhost to SecondaryNameNode host) and set some properties in hadoop-site.xml on SecondaryNameNode (fs.default.name, fs.checkpoint.dir, fs.checkpoint.period etc.) This was enough to start SecondaryNameNode when starting NameNode with bin/start-dfs.sh , but it didn't create image on SecondaryNameNode. Then I found that I need to set dfs
Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values
David, You can address this by using these two settings on your JobConf. i.e. conf.setOutputValueGroupingComparator(YourKeyComparator.class); conf.setOutputKeyComparatorClass(YourKeyAndValueComparator.class); Both classes should extend WritableComparator. The OutputValueGroupingComparator is the one that will sort the order that they keys are passed to your reduce function. The OutputKeyComparatorClass will sort the order that the values are returned from your iterator in your reduce function. Hope that helps, Tony On Wed, Oct 29, 2008 at 10:59 AM, David M. Coe <[EMAIL PROTECTED]>wrote: > Would the input using this method be sorted before the reducer? I have > implemented this and only the keycomparatorclass is called. This gives > the effect that if I output the data here it is sorted. However; it > sorts comparing both the right and the left as you suggest so the > reducer is given unique right-left instead of being given right that > happen to be sorted using the left. > > What I get: > > text file -> > map: -> 0 0 -> reducer >0 1 -> reducer >8 0 -> reducer >8 1 -> reducer > > What I'd like: > > text file -> > map: *** > -> 0 0 \ > -> 0 1 | -> reducer > -> 0 8 / > *** > -> 8 0 \ -> reducer > -> 8 1 / > *** > -> 123 3 -> reducer > > What is the best way to do this? The keys must be secondary sorted > before the reduce, but I cannot think of a way to do this. > > Thank you. > > > > Owen O'Malley wrote: > > > > On Oct 28, 2008, at 7:53 AM, David M. Coe wrote: > > > >> My mapper is Mapper and my > >> reducer is the identity. I configure the program using: > >> > >> conf.setOutputKeyClass(IntWritable.class); > >> conf.setOutputValueClass(IntWritable.class); > >> > >> conf.setMapperClass(MapClass.class); > >> conf.setReducerClass(IdentityReducer.class); > >> > >> conf.setOutputKeyComparatorClass(IntWritable.Comparator.class); > >> conf.setOutputValueGroupingComparator(IntWritable.Comparator.class); > > > > The problem is that your map needs to look like: > > > > class IntPair implements Writable { > > private int left; > > private int right; > > public void set(int left, int right) { ... } > > public int getLeft() {...} > > public int getRight() {...} > > } > > > > your Mapper should be Mapper > > and should emit > > > > IntPair key = new IntPair(); > > IntegerWritable value = new IntegerWritale(); > > ... > > key.set(keyValue, valueValue); > > value.set(keyValue,); > > output.collect(key, value); > > > > Your sort comparator should take compare both left and right in the pair. > > The grouping comparator should only look at left in the pair. > > > > Your Reducer should be Reducer > IntWritable> > > > > output.collect(key.getLeft(), value); > > > > Is that clearer? > > > > -- Owen > >
Re: SecondaryNameNode on separate machine
Hi, I'm not clear on how does SecondaryNameNode communicates with NameNode (if deployed on separate machine). Does SecondaryNameNode uses direct connection (over some port and protocol) or is it enough for SecondaryNameNode to have access to data which NameNode writes locally on disk? Tomislav On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote: > I think a lot of the confusion comes from this thread : > http://www.nabble.com/NameNode-failover-procedure-td11711842.html > > Particularly because the wiki was updated with wrong information, not > maliciously I'm sure. This information is now gone for good. > > Otis, your solution is pretty much like the one given by Dhruba Borthakur > and augmented by Konstantin Shvachko later in the thread but I never did it > myself. > > One thing should be clear though, the NN is and will remain a SPOF (just > like HBase's Master) as long as a distributed manager service (like > Zookeeper) is not plugged into Hadoop to help with failover. > > J-D > > On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Hi, > > So what is the "recipe" for avoiding NN SPOF using only what comes with > > Hadoop? > > > > From what I can tell, I think one has to do the following two things: > > > > 1) configure primary NN to save namespace and xa logs to multiple dirs, one > > of which is actually on a remotely mounted disk, so that the data actually > > lives on a separate disk on a separate box. This saves namespace and xa > > logs on multiple boxes in case of primary NN hardware failure. > > > > 2) configure secondary NN to periodically merge fsimage+edits and create > > the fsimage checkpoint. This really is a second NN process running on > > another box. It sounds like this secondary NN has to somehow have access to > > fsimage & edits files from the primary NN server. > > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes > > not describe the best practise around that - the recommended way to > > give secondary NN access to primary NN's fsimage and edits files. Should > > one mount a disk from the primary NN box to the secondary NN box to get > > access to those files? Or is there a simpler way? > > In any case, this checkpoint is just a merge of fsimage+edits files and > > again is there in case the box with the primary NN dies. That's what's > > described on > > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore > > or less. > > > > Is this sufficient, or are there other things one has to do to eliminate NN > > SPOF? > > > > > > Thanks, > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > - Original Message > > > From: Jean-Daniel Cryans <[EMAIL PROTECTED]> > > > To: core-user@hadoop.apache.org > > > Sent: Tuesday, October 28, 2008 8:14:44 PM > > > Subject: Re: SecondaryNameNode on separate machine > > > > > > Tomislav. > > > > > > Contrary to popular belief the secondary namenode does not provide > > failover, > > > it's only used to do what is described here : > > > > > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode > > > > > > So the term "secondary" does not mean "a second one" but is more like "a > > > second part of". > > > > > > J-D > > > > > > On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote: > > > > > > > Hi, > > > > I'm trying to implement NameNode failover (or at least NameNode local > > > > data backup), but it is hard since there is no official documentation. > > > > Pages on this subject are created, but still empty: > > > > > > > > http://wiki.apache.org/hadoop/NameNodeFailover > > > > http://wiki.apache.org/hadoop/SecondaryNameNode > > > > > > > > I have been browsing the web and hadoop mailing list to see how this > > > > should be implemented, but I got even more confused. People are asking > > > > do we even need SecondaryNameNode etc. (since NameNode can write local > > > > data to multiple locations, so one of those locations can be a mounted > > > > disk from other machine). I think I understand the motivation for > > > > SecondaryNameNode (to create a snapshoot of NameNode data every n > > > > seconds/hours), but setting (deploying and running) SecondaryNameNode > > on > > > > different machine than NameNode is not as trivial as I expected. First > > I > > > > found that if I need to run SecondaryNameNode on other machine than > > > > NameNode I should change masters file on NameNode (change localhost to > > > > SecondaryNameNode host) and set some properties in hadoop-site.xml on > > > > SecondaryNameNode (fs.default.name, fs.checkpoint.dir, > > > > fs.checkpoint.period etc.) > > > > > > > > This was enough to start SecondaryNameNode when starting NameNode with > > > > bin/start-dfs.sh , but it didn't create image on SecondaryNameNode. > > Then > > > > I found that I need to set dfs.http.address on NameNode address (so now > > > > I have NameNode
Re: TaskTrackers disengaging from JobTracker
Could the version of Java being used matter? I just realized this cluster runs IBM Java, not Sun: java version "1.6.0" Java(TM) SE Runtime Environment (build pxi3260sr2-20080818_01(SR2)) IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux x86-32 jvmxi3260-20080816_2209 3 (JIT enabled, AOT enabled) J9VM - 20080816_022093_lHdSMr JIT - r9_20080721_1330ifx2 GC - 20080724_AA) JCL - 20080808_02 - Aaron On Wed, Oct 29, 2008 at 2:43 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote: > The system load and memory consumption on the JT are both very close to > "idle" states -- it's not overworked, I don't think > > I may have an idea of the problem, though. Digging back up a ways into the > JT logs, I see this: > > 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: > error: java.io.IOException: java.lang.NullPointerException > > java.io.IOException: java.lang.NullPointerException > at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) > at java.lang.reflect.Method.invoke(Method.java:599) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) > > > > This exception is then repeated for all the IPC server handlers. So I think > the problem is that all the handler threads are dying one by one due to this > NPE. > > This something I can fix myself, or is a patch available? > > - Aaron > > > On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > >> It's possible that the JobTracker is under duress and unable to respond to >> the TaskTrackers... what do the JobTracker logs say? >> >> Arun >> >> >> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: >> >> Hi all, >>> >>> I'm working with a 30 node Hadoop cluster that has just started >>> demonstrating some weird behavior. It's run without incident for a few >>> weeks.. and now: >>> >>> The cluster will run smoothly for 90--120 minutes or so, handling jobs >>> continually during this time. Then suddenly it will be the case that all >>> 29 >>> TaskTrackers will get disconnected from the JobTracker. All the tracker >>> daemon processes are still running on each machine; but the JobTracker >>> will >>> say "0 nodes available" on the web status screen. Restarting MapReduce >>> fixes >>> this for another 90--120 minutes. >>> >>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, >>> but >>> we're running on 0.18.1. >>> >>> I found this in a TaskTracker log: >>> >>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: >>> Caught >>> exception: java.io.IOException: Call failed on local exception >>> at java.lang.Throwable.(Throwable.java:67) >>> at org.apache.hadoop.ipc.Client.call(Client.java:718) >>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >>> at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) >>> at >>> >>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) >>> at >>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) >>> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) >>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) >>> Caused by: java.io.IOException: Connection reset by peer >>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) >>> at sun.nio.ch.IOUtil.read(IOUtil.java:207) >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >>> at >>> >>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) >>> at >>> >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >>> at java.io.FilterInputStream.read(FilterInputStream.java:127) >>> at >>> >>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:248) >>> at java.io.DataInputStream.readInt(DataInputStream.java:381) >>> at >>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) >>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >>> >>> >>> As well as a few of these warnings: >>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON >>> THREADS >>> ((40-40+0)<1) on [EMAIL PROTECTE
Re: TaskTrackers disengaging from JobTracker
The system load and memory consumption on the JT are both very close to "idle" states -- it's not overworked, I don't think I may have an idea of the problem, though. Digging back up a ways into the JT logs, I see this: 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) This exception is then repeated for all the IPC server handlers. So I think the problem is that all the handler threads are dying one by one due to this NPE. This something I can fix myself, or is a patch available? - Aaron On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > It's possible that the JobTracker is under duress and unable to respond to > the TaskTrackers... what do the JobTracker logs say? > > Arun > > > On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: > > Hi all, >> >> I'm working with a 30 node Hadoop cluster that has just started >> demonstrating some weird behavior. It's run without incident for a few >> weeks.. and now: >> >> The cluster will run smoothly for 90--120 minutes or so, handling jobs >> continually during this time. Then suddenly it will be the case that all >> 29 >> TaskTrackers will get disconnected from the JobTracker. All the tracker >> daemon processes are still running on each machine; but the JobTracker >> will >> say "0 nodes available" on the web status screen. Restarting MapReduce >> fixes >> this for another 90--120 minutes. >> >> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, >> but >> we're running on 0.18.1. >> >> I found this in a TaskTracker log: >> >> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught >> exception: java.io.IOException: Call failed on local exception >> at java.lang.Throwable.(Throwable.java:67) >> at org.apache.hadoop.ipc.Client.call(Client.java:718) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >> at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) >> at >> >> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) >> at >> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) >> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) >> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) >> Caused by: java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcher.read0(Native Method) >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) >> at sun.nio.ch.IOUtil.read(IOUtil.java:207) >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >> at >> >> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) >> at >> >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >> at java.io.FilterInputStream.read(FilterInputStream.java:127) >> at >> >> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) >> at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) >> at java.io.BufferedInputStream.read(BufferedInputStream.java:248) >> at java.io.DataInputStream.readInt(DataInputStream.java:381) >> at >> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) >> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >> >> >> As well as a few of these warnings: >> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON >> THREADS >> ((40-40+0)<1) on [EMAIL PROTECTED]:50060 >> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF >> THREADS: [EMAIL PROTECTED]:50060 >> >> >> >> The NameNode and DataNodes are completely fine. Can't be a DNS issue, >> because all DNS is served through /etc/hosts files. NameNode and >> JobTracker >> are on the same machine. >> >> Any help is appreciated >> Thanks >> - Aaron Kimball >> > >
Re: TaskTrackers disengaging from JobTracker
It's possible that the JobTracker is under duress and unable to respond to the TaskTrackers... what do the JobTracker logs say? Arun On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say "0 nodes available" on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763 , but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org .apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java: 1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java: 928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream $Reader.performIO(SocketInputStream.java:55) at org .apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection $PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java: 499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball
TaskTrackers disengaging from JobTracker
Hi all, I'm working with a 30 node Hadoop cluster that has just started demonstrating some weird behavior. It's run without incident for a few weeks.. and now: The cluster will run smoothly for 90--120 minutes or so, handling jobs continually during this time. Then suddenly it will be the case that all 29 TaskTrackers will get disconnected from the JobTracker. All the tracker daemon processes are still running on each machine; but the JobTracker will say "0 nodes available" on the web status screen. Restarting MapReduce fixes this for another 90--120 minutes. This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but we're running on 0.18.1. I found this in a TaskTracker log: 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call failed on local exception at java.lang.Throwable.(Throwable.java:67) at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:207) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) As well as a few of these warnings: 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on [EMAIL PROTECTED]:50060 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF THREADS: [EMAIL PROTECTED]:50060 The NameNode and DataNodes are completely fine. Can't be a DNS issue, because all DNS is served through /etc/hosts files. NameNode and JobTracker are on the same machine. Any help is appreciated Thanks - Aaron Kimball
Re: nagios to monitor hadoop datanodes!
All I have to say is wow! I never tried jconsole before. I have hadoop_trunk checked out and the JMX has all kinds of great information. I am going to look at how I can get JMX/cacti/and hadoop working together. Just as an FYI there are separate ENV variables for each now. If you override hadoop_ops you get a port conflict. It should be like this. export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=10001" Thanks Brian.
Re: Datanode not detecting full disk
Stefan Will wrote: Hi Jeff, Yeah, it looks like I'm running into the issues described in the bug. I'm running 0.18.1 on CentOS 5 by the way. Measuring available disk space appears to be harder than I thought ... and here I was under the impression the percentage in df was a pretty clear indicator of how full the disk is ;-) How did you guys solve/work around this ? How many partitions do you have? If it is just one and NameNode thinks it has space though 'available' in df shows very less or no space, then you need to file a jira. There should be no case where DN reports more space than what 'available' field in 'df' shows. But if you have more partitions and only some of them are full, then it is a different issue.. which should still be fixed. Raghu. -- Stefan From: Jeff Hammerbacher <[EMAIL PROTECTED]> Reply-To: Date: Mon, 27 Oct 2008 12:40:08 -0700 To: Subject: Re: Datanode not detecting full disk Hey Stefan, We used to have trouble with this issue at Facebook. What version are you running? You might get more information on this ticket: https://issues.apache.org/jira/browse/HADOOP-2991. Regards, Jeff On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote: Each of my datanodes has a system and a data partition, with dfs.data.dir pointed to the data partition. The data partition just filled up to 100% on all of my nodes (as evident via df), but the NameNode web ui still shows them only 88-94% full (interestingly, the numbers differ even though the machines are configured identically). I thought the datanodes used df to determine free space ? How is the storage utilization determined ? -- Stefan
Re: "Could not obtain block" error
If have only one copy of the block and it is mostly corrupted.. Namenode itself can not correct it. Of course, DFSClient should not print error in a infinite loop. I think there was an old bug where crc file got overwritten by 0 length file. One work around for you is to go to the datanode and remove the .crc file for this block (find /datanodedir -name blk_5994030096182059653\*). Be careful not to remove the block file itself. longer term fix : upgrade to more recent version. Raghu. murali krishna wrote: Hi, When I try to read one of the file from dfs, I get the following error in an infinite loop (using 0.15.3) “08/10/28 23:43:15 INFO fs.DFSClient: Could not obtain block blk_5994030096182059653 from any node: java.io.IOException: No live nodes contain current block” Fsck showed that the file is HEALTHY but under replicated (1 instead of configured 2). I checked the datanode log where the only replica exists for that block and I can see repeated errors while serving that bock. 2008-10-22 23:55:39,378 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_59940300961820596 53 to 68.142.212.228:50010 got java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.dfs.DataNode$BlockSender.(DataNode.java:1061) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1446) at java.lang.Thread.run(Thread.java:619) Any idea what is going on and how can I fix this ? Thanks, Murali
RE: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?
Hello Zhengguo , Yes , -libjars is the new feature in Hadoop. This feature has been available from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x example to use -libjars... hadoop jar -libjars ... Thanks , --- Peeyush -Original Message- From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED] Sent: Wed 10/29/2008 9:22 AM To: core-user@hadoop.apache.org Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, Amareshwari, Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I see is -jar, which didn't work for me. And besides passing them as jar file, is there any other ways to do that? Thanks Mike From: Amareshwari Sriramadasu <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, October 28, 2008 11:58:33 PM Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, How are you passing your classes to the pipes job? If you are passing them as a jar file, you can use -libjars option. From branch 0.19, the libjar files are added to the client classpath also. Thanks Amareshwari Zhengguo 'Mike' SUN wrote: > Hi, > > I implemented customized classes for InputFormat, InputSplit and RecordReader > in Java and was trying to use them in a C++ pipes job. The customized > InputFormat class could be included using the -inputformat option, but it > threw ClassNotFoundException for my customized InputSplit class. It seemed > the classpath has not been correctly set. Is there any way that let me > include my customized classes in a pipes job? > > > > >
Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?
Hi, Amareshwari, Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I see is -jar, which didn't work for me. And besides passing them as jar file, is there any other ways to do that? Thanks Mike From: Amareshwari Sriramadasu <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, October 28, 2008 11:58:33 PM Subject: Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job? Hi, How are you passing your classes to the pipes job? If you are passing them as a jar file, you can use -libjars option. From branch 0.19, the libjar files are added to the client classpath also. Thanks Amareshwari Zhengguo 'Mike' SUN wrote: > Hi, > > I implemented customized classes for InputFormat, InputSplit and RecordReader > in Java and was trying to use them in a C++ pipes job. The customized > InputFormat class could be included using the -inputformat option, but it > threw ClassNotFoundException for my customized InputSplit class. It seemed > the classpath has not been correctly set. Is there any way that let me > include my customized classes in a pipes job? > > > > >
Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values
Would the input using this method be sorted before the reducer? I have implemented this and only the keycomparatorclass is called. This gives the effect that if I output the data here it is sorted. However; it sorts comparing both the right and the left as you suggest so the reducer is given unique right-left instead of being given right that happen to be sorted using the left. What I get: text file -> map: -> 0 0 -> reducer 0 1 -> reducer 8 0 -> reducer 8 1 -> reducer What I'd like: text file -> map: *** -> 0 0 \ -> 0 1 | -> reducer -> 0 8 / *** -> 8 0 \ -> reducer -> 8 1 / *** -> 123 3 -> reducer What is the best way to do this? The keys must be secondary sorted before the reduce, but I cannot think of a way to do this. Thank you. Owen O'Malley wrote: > > On Oct 28, 2008, at 7:53 AM, David M. Coe wrote: > >> My mapper is Mapper and my >> reducer is the identity. I configure the program using: >> >> conf.setOutputKeyClass(IntWritable.class); >> conf.setOutputValueClass(IntWritable.class); >> >> conf.setMapperClass(MapClass.class); >> conf.setReducerClass(IdentityReducer.class); >> >> conf.setOutputKeyComparatorClass(IntWritable.Comparator.class); >> conf.setOutputValueGroupingComparator(IntWritable.Comparator.class); > > The problem is that your map needs to look like: > > class IntPair implements Writable { > private int left; > private int right; > public void set(int left, int right) { ... } > public int getLeft() {...} > public int getRight() {...} > } > > your Mapper should be Mapper > and should emit > > IntPair key = new IntPair(); > IntegerWritable value = new IntegerWritale(); > ... > key.set(keyValue, valueValue); > value.set(keyValue,); > output.collect(key, value); > > Your sort comparator should take compare both left and right in the pair. > The grouping comparator should only look at left in the pair. > > Your Reducer should be Reducer IntWritable> > > output.collect(key.getLeft(), value); > > Is that clearer? > > -- Owen
Re: How does an offline Datanode come back up ?
Norbert Burger wrote: Along these lines, I'm curious what "management tools" folks are using to ensure cluster availability (ie., auto-restart failed datanodes/namenodes). Are you using a custom cron script, or maybe something more complex (Ganglia, Nagios, puppet, etc.)? We use SmartFrog, http://smartfrog.org/ , to do this kind of thing, not just because it comes from our organisation, but because it gives us the ability to manage other parts of the system at the same time. To get SF deploying Hadoop in a way I'm happy with, I have had to make a fair few changes to the lifecycle of the "services" -things like namenode, datanode, jobtracker and task tracker. Most of the changes are in HADOOP-3628, though I need to push through another iteration of this [1]. Even with the changes I'm worried about race conditions and shutdown, as the existing code assumes that every node starts in its own process -which is what I recommend for production. We gave a talk on this topic in august at the Hadoop UK event [2] None of this stuff is in a public release yet, but I may cut one next week which includes an unsupported 0.20-alpha-patched version of Hadoop in an RPM. This RPM can be pushed out to the machines through your RPM publish mechanism of choice; when the SmartFrog daemon comes up it deploys whatever it has been told to, or it announces to the world it is unpurposed and gets told to deploy whatever someone it trusts talks to. Failure handing is still interesting. With a language like SmartFrog you can declare how failures can be handled; we have various workflowy containers to do things like -retry and restart -kill and report upwards (default) -roll back the whole deployment and restart For things like task trackers and such like, such loss is best handled by killing and restarting. But the filesystem is much more temperamental -and it is FS and HDD failures that create the most stress in any project. That and the accidental deletions of the entire dataset. A node in the cluster that is only a tasktracker is disposable: any problems you may as well flip the power switch and have the PXE reboot bring it back to a blank state. Datanode failures, though, that's an issue. If the data on the node is replicated in >1 place, I'd decomission the node and do the same thing. If the data isn't adequately replicated yet, you want to get the stuff off it first. And if you think its a physical HDD problem, time to stop using that particular disk. I think everyone is still learning the main failure modes of a cluster, and still deciding how to react. [1] https://issues.apache.org/jira/browse/HADOOP-3628 [2] http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf Thanks, Norbert On 10/28/08, Steve Loughran <[EMAIL PROTECTED]> wrote: wmitchell wrote: Hi All, Ive been working michael nolls multi-node cluster setup example (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I then on my slave machine -- which is currently running a datanode killed the process in an effort to try to simulate some sort of failure on the slave machine datanode. I had assumed that the namenode would have been polling its datanodes and thus attempted to bring up any node that goes down. On looking at my slave machine it seems that the datanode process is still down (I've checked jps). That's up to you or your management tools. The namenode knows that the datanode is unreachable, but doesn't know how to go about reconnecting it to the network. Which, given there are many causes of "down", sort of makes sense. The switch failing, the hdds dying or the process crashing, all look the same: no datanode heartbeats. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1
We'll try it out... On Oct 28, 2008, at 3:00 PM, Arun C Murthy wrote: On Oct 27, 2008, at 7:05 PM, Grant Ingersoll wrote: Hi, Over in Mahout (lucene.a.o/mahout), we are seeing an oddity with some of our clustering code and Hadoop 0.18.1. The thread in context is at: http://mahout.markmail.org/message/vcyvlz2met7fnthr The problem seems to occur when going from 0.17.2 to 0.18.1. In the user logs, we are seeing the following exception: 2008-10-27 21:18:37,014 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 5011 bytes 2008-10-27 21:18:37,033 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200810272112_0011_r_00_0 Merge of the inmemory files threw an exception: java.io.IOException: Intermedate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2147) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.run(ReduceTask.java:2078) Caused by: java.lang.NumberFormatException: For input string: "[" If you are sure that this isn't caused by your application-logic, you could try running with http://issues.apache.org/jira/browse/HADOOP-4277 . That bug caused many a ship to sail in large circles, hopelessly. Arun at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java: 1224) at java.lang.Double.parseDouble(Double.java:510) at org.apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java: 60) at org .apache .mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:256) at org .apache .mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java: 38) at org .apache .mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java: 31) at org.apache.hadoop.mapred.ReduceTask $ReduceCopier.combineAndSpill(ReduceTask.java:2174) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access $3100(ReduceTask.java:341) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2134) And in the main output log (from running bin/hadoop jar mahout/ examples/build/apache-mahout-examples-0.1-dev.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) we see: 08/10/27 21:18:41 INFO mapred.JobClient: Task Id : attempt_200810272112_0011_r_00_0, Status : FAILED java.io.IOException: attempt_200810272112_0011_r_00_0The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java:2207) If I run this exact same job on 0.17.2 it all runs fine. I suppose either a bug was introduced in 0.18.1 or a bug was fixed that we were relying on. Looking at the release notes between the fixes, nothing in particular struck me as related. If it helps, I can provide the instructions for how to run the example in question (they need to be written up anyway!) I see some related things at http://hadoop.markmail.org/search/?q=Merge+of+the+inmemory+files+threw+an+exception , but those are older, it seems, so not sure what to make of them. Thanks, Grant -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: SecondaryNameNode on separate machine
I think a lot of the confusion comes from this thread : http://www.nabble.com/NameNode-failover-procedure-td11711842.html Particularly because the wiki was updated with wrong information, not maliciously I'm sure. This information is now gone for good. Otis, your solution is pretty much like the one given by Dhruba Borthakur and augmented by Konstantin Shvachko later in the thread but I never did it myself. One thing should be clear though, the NN is and will remain a SPOF (just like HBase's Master) as long as a distributed manager service (like Zookeeper) is not plugged into Hadoop to help with failover. J-D On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hi, > So what is the "recipe" for avoiding NN SPOF using only what comes with > Hadoop? > > From what I can tell, I think one has to do the following two things: > > 1) configure primary NN to save namespace and xa logs to multiple dirs, one > of which is actually on a remotely mounted disk, so that the data actually > lives on a separate disk on a separate box. This saves namespace and xa > logs on multiple boxes in case of primary NN hardware failure. > > 2) configure secondary NN to periodically merge fsimage+edits and create > the fsimage checkpoint. This really is a second NN process running on > another box. It sounds like this secondary NN has to somehow have access to > fsimage & edits files from the primary NN server. > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes > not describe the best practise around that - the recommended way to > give secondary NN access to primary NN's fsimage and edits files. Should > one mount a disk from the primary NN box to the secondary NN box to get > access to those files? Or is there a simpler way? > In any case, this checkpoint is just a merge of fsimage+edits files and > again is there in case the box with the primary NN dies. That's what's > described on > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore > or less. > > Is this sufficient, or are there other things one has to do to eliminate NN > SPOF? > > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Jean-Daniel Cryans <[EMAIL PROTECTED]> > > To: core-user@hadoop.apache.org > > Sent: Tuesday, October 28, 2008 8:14:44 PM > > Subject: Re: SecondaryNameNode on separate machine > > > > Tomislav. > > > > Contrary to popular belief the secondary namenode does not provide > failover, > > it's only used to do what is described here : > > > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode > > > > So the term "secondary" does not mean "a second one" but is more like "a > > second part of". > > > > J-D > > > > On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote: > > > > > Hi, > > > I'm trying to implement NameNode failover (or at least NameNode local > > > data backup), but it is hard since there is no official documentation. > > > Pages on this subject are created, but still empty: > > > > > > http://wiki.apache.org/hadoop/NameNodeFailover > > > http://wiki.apache.org/hadoop/SecondaryNameNode > > > > > > I have been browsing the web and hadoop mailing list to see how this > > > should be implemented, but I got even more confused. People are asking > > > do we even need SecondaryNameNode etc. (since NameNode can write local > > > data to multiple locations, so one of those locations can be a mounted > > > disk from other machine). I think I understand the motivation for > > > SecondaryNameNode (to create a snapshoot of NameNode data every n > > > seconds/hours), but setting (deploying and running) SecondaryNameNode > on > > > different machine than NameNode is not as trivial as I expected. First > I > > > found that if I need to run SecondaryNameNode on other machine than > > > NameNode I should change masters file on NameNode (change localhost to > > > SecondaryNameNode host) and set some properties in hadoop-site.xml on > > > SecondaryNameNode (fs.default.name, fs.checkpoint.dir, > > > fs.checkpoint.period etc.) > > > > > > This was enough to start SecondaryNameNode when starting NameNode with > > > bin/start-dfs.sh , but it didn't create image on SecondaryNameNode. > Then > > > I found that I need to set dfs.http.address on NameNode address (so now > > > I have NameNode address in both fs.default.name and dfs.http.address). > > > > > > Now I get following exception: > > > > > > 2008-10-28 09:18:00,098 ERROR NameNode.Secondary - Exception in > > > doCheckpoint: > > > 2008-10-28 09:18:00,098 ERROR NameNode.Secondary - > > > java.net.SocketException: Unexpected end of file from server > > > > > > My questions are following: > > > How to resolve this problem (this exception)? > > > Do I need additional property in SecondaryNameNode's hadoop-site.xml or > > > NameNode's hadoop-site.xml? > > > > > > How should NameNode failover work i
Re: How does an offline Datanode come back up ?
Someone on the list is looking at monitoring hadoop features with nagios. Nagios can be configured with an event_handler. In the past I have written event handlers to do operations like this. If down --- use SSH key and restart. HoweverSince you have an SSH key on your master node, you should be able to have a centralized node restarter running from the master cron. Maybe an interesting argument to run a separate nagios as your hadoop user! In any case you can also run a cronjob on each slave as suggested above. The thing about all systems like this is you have to remember to shut them down when you actually want the service down for service etc. We run Nagios and cacti so I would like to develop check scripts for these services. I am going to get SVN repo together if anyone is interested in contributing let me know.
Re: Ideal number of mappers and reducers; any physical limits?
> I doubt that it is stored as an explicit matrix. Each page would probably > have a big table (or file) entry and would have a list of links including > link text. Oh.. Probably, and some random walk on the link graph. On Wed, Oct 29, 2008 at 2:12 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Tue, Oct 28, 2008 at 5:15 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote: > >> ... >> In single machine, as far as we >> know graph can be stored to linked list or matrix. >> > > Since the matrix is normally very sparse for large graphs, these two > approaches are pretty similar. > > >> ... So, I guess google's web graph will be stored as a matrix in a >> bigTable. >> > > I doubt that it is stored as an explicit matrix. Each page would probably > have a big table (or file) entry and would have a list of links including > link text. > > > Have you seen my 2D block algorithm post?? -- >> http://blog.udanax.org/2008/10/parallel-matrix-multiply-on-hadoop.html >> > > I have now. Block decomposition for multiplies almost always applies only > to dense matrix operations. For most sparse matrix representations > extracting a block is only efficient if it is full width or height. For > very sparse matrix operations, the savings due to reuse of intermediate > results are completely dominated by the I/O cost so block decompositions are > much less helpful. > > In many cases, it isn't even very helpful to send around entire rows and > sending individual elements is about as efficient. > > FYI, Hama (http://incubator.apache.org/hama/) will be handled graph >> algorithms since it is a related with adjacency matrix and topological >> algebra. And I think 2000 node hadoop/hbase cluster is big enough if a >> sequential/random read/write speed will be improved 800%. :-) >> > > I think that a 5 node cluster is big enough without any improvement in > read/write speed. > > Of course, it depends on the size of the problem. I was only working with a > matrix with a few tens of billions of non-zero values. > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: Datanode not detecting full disk
Hey Stefan, It's always fun when seemingly trivial problems turn out to be nontrivial. As for the solution: if I recall correctly (someone from Facebook please hop in here), we just jacked up dfs.datanode.du.reserved to a sizable amount, like 2 GB or something. Regards, Jeff On Tue, Oct 28, 2008 at 11:31 PM, Stefan Will <[EMAIL PROTECTED]> wrote: > Hi Jeff, > > Yeah, it looks like I'm running into the issues described in the bug. I'm > running 0.18.1 on CentOS 5 by the way. Measuring available disk space > appears to be harder than I thought ... and here I was under the impression > the percentage in df was a pretty clear indicator of how full the disk is > ;-) > > How did you guys solve/work around this ? > > -- Stefan > > >> From: Jeff Hammerbacher <[EMAIL PROTECTED]> >> Reply-To: >> Date: Mon, 27 Oct 2008 12:40:08 -0700 >> To: >> Subject: Re: Datanode not detecting full disk >> >> Hey Stefan, >> >> We used to have trouble with this issue at Facebook. What version are >> you running? You might get more information on this ticket: >> https://issues.apache.org/jira/browse/HADOOP-2991. >> >> Regards, >> Jeff >> >> On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote: >>> Each of my datanodes has a system and a data partition, with dfs.data.dir >>> pointed to the data partition. The data partition just filled up to 100% on >>> all of my nodes (as evident via df), but the NameNode web ui still shows >>> them only 88-94% full (interestingly, the numbers differ even though the >>> machines are configured identically). I thought the datanodes used df to >>> determine free space ? How is the storage utilization determined ? >>> >>> -- Stefan >>> > > >
"Could not obtain block" error
Hi, When I try to read one of the file from dfs, I get the following error in an infinite loop (using 0.15.3) “08/10/28 23:43:15 INFO fs.DFSClient: Could not obtain block blk_5994030096182059653 from any node: java.io.IOException: No live nodes contain current block” Fsck showed that the file is HEALTHY but under replicated (1 instead of configured 2). I checked the datanode log where the only replica exists for that block and I can see repeated errors while serving that bock. 2008-10-22 23:55:39,378 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_59940300961820596 53 to 68.142.212.228:50010 got java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.dfs.DataNode$BlockSender.(DataNode.java:1061) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1446) at java.lang.Thread.run(Thread.java:619) Any idea what is going on and how can I fix this ? Thanks, Murali