Re: "Could not obtain block" error

2008-10-29 Thread Murali Krishna
Thanks Raghu,

But, both block file and the .meta file are 0 sized files!!

Thanks,
Murali


On 10/30/08 12:16 AM, "Raghu Angadi" <[EMAIL PROTECTED]> wrote:

> 
> One work around for you is to go to the datanode and remove the .crc
> file for this block (find /datanodedir -name blk_5994030096182059653\*).
> Be careful not to remove the block file itself.



Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-29 Thread Owen O'Malley

I uploaded a patch that does a secondary sort. Take a look at:

https://issues.apache.org/jira/browse/HADOOP-4545

It reads input with two numbers per a line. Such as:

-1 -4
-3 23
5 10
-1 -2
-1 300
-1 10
4 1
4 2
4 10
4 -1
4 -10
10 20
10 30
10 25

And produces output like (with 2 reduces):

part-0:

4   -10
4   -1
4   1
4   2
4   10

10  20
10  25
10  30

part-1:

-3  23

-1  -4
-1  -2
-1  10
-1  300

5   10

-- Owen


Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-29 Thread Amareshwari Sriramadasu

Zhengguo 'Mike' SUN wrote:

Hi, Peeyush,

I guess I didn't make myself clear. I am trying to run a Hadoop pipes job with 
a combination of Java classes and C++ classes. So the command I am using is 
like:

hadoop pipes -conf myconf.xml -inputformat MyInputFormat.class -input in 
-output out

And it threw ClassNotFoundException for my InputSplit class.
As I understand "hadoop jar" is used to run a jar file, which is not my case. And there 
is a -jar option in "hadoop pipes". But, unfortunately, it is not working for me. So the 
question I want to ask is how to include customized Java classes, such as MyInputSplit, in a pipes 
job?

  
You are right. -jar option also doesn't add the jar file to classpath on 
the client-side. You can use -libjars option with 0.19. Then, the 
command looks like


hadoop pipes -conf myconf.xml -libjars  -inputformat 
MyInputFormat.class -input in -output out

I don't see  a way to do this in 0.17.*, one way could be you add it 
explicitly to the classpath on client-side, and add it through the 
option -jar for the job.

Thanks,
Amareshwari

Thanks,
Mike





From: Peeyush Bishnoi <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Sent: Wednesday, October 29, 2008 12:52:18 PM
Subject: RE: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hello Zhengguo ,

Yes , -libjars is the new feature in Hadoop. This feature has been available from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x 


example to use -libjars...

hadoop jar -libjars  ...


Thanks ,

---
Peeyush


-Original Message-
From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED]
Sent: Wed 10/29/2008 9:22 AM
To: core-user@hadoop.apache.org
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi, Amareshwari,

Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I 
see is -jar, which didn't work for me. And besides passing them as jar file, is 
there any other ways to do that?

Thanks
Mike



From: Amareshwari Sriramadasu <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, October 28, 2008 11:58:33 PM
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi,

How are you passing your classes to the pipes job? If you are passing 
them as a jar file, you can use -libjars option. From branch 0.19, the 
libjar files are added to the client classpath also.


Thanks
Amareshwari
Zhengguo 'Mike' SUN wrote:
  

Hi,

I implemented customized classes for InputFormat, InputSplit and RecordReader 
in Java and was trying to use them in a C++ pipes job. The customized 
InputFormat class could be included using the -inputformat option, but it threw 
ClassNotFoundException for my customized InputSplit class. It seemed the 
classpath has not been correctly set. Is there any way that let me include my 
customized classes in a pipes job?



 
 




  
  




Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Devaraj Das
> 
> I wrote a patch to address the NPE in JobTracker.killJob() and compiled
> it against TRUNK. I've put this on the cluster and it's now been holding
> steady for the last hour or so.. so that plus whatever other differences
> there are between 18.1 and TRUNK may have fixed things. (I'll submit the
> patch to the JIRA as soon as it finishes cranking against the JUnit tests)
> 

Aaron, I don't think this is a solution to the problem you are seeing. The
IPC handlers are tolerant to exceptions. In particular, they must not die in
the event of an exception during RPC processing. Could you please get a
stack trace of the JobTracker threads (without your patch) when the TTs are
unable to talk to it. Access the url http://:/stacks
That will tell us what the handlers are up to.

> - Aaron
> 
> 
> Devaraj Das wrote:
>> 
>> On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:
>> 
>>> The system load and memory consumption on the JT are both very close to
>>> "idle" states -- it's not overworked, I don't think
>>> 
>>> I may have an idea of the problem, though. Digging back up a ways into the
>>> JT logs, I see this:
>>> 
>>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 4 on 9001, call killJob(job_200810290855_0025) from
>>> 10.1.143.245:48253: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45>>>
)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>>> .j
>>> ava:37)
>>> at java.lang.reflect.Method.invoke(Method.java:599)
>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>>> 
>>> 
>>> 
>>> This exception is then repeated for all the IPC server handlers. So I think
>>> the problem is that all the handler threads are dying one by one due to this
>>> NPE.
>>> 
>> 
>> This should not happen. IPC handler catches Throwable and handles that.
>> Could you give more details like the kind of jobs (long/short) you are
>> running, how many tasks they have, etc.
>> 
>>> This something I can fix myself, or is a patch available?
>>> 
>>> - Aaron
>>> 
>>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>>> 
 It's possible that the JobTracker is under duress and unable to respond to
 the TaskTrackers... what do the JobTracker logs say?
 
 Arun
 
 
 On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
 
  Hi all,
> I'm working with a 30 node Hadoop cluster that has just started
> demonstrating some weird behavior. It's run without incident for a few
> weeks.. and now:
> 
> The cluster will run smoothly for 90--120 minutes or so, handling jobs
> continually during this time. Then suddenly it will be the case that all
> 29
> TaskTrackers will get disconnected from the JobTracker. All the tracker
> daemon processes are still running on each machine; but the JobTracker
> will
> say "0 nodes available" on the web status screen. Restarting MapReduce
> fixes
> this for another 90--120 minutes.
> 
> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
> but
> we're running on 0.18.1.
> 
> I found this in a TaskTracker log:
> 
> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
> exception: java.io.IOException: Call failed on local exception
>   at java.lang.Throwable.(Throwable.java:67)
>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>   at
> 
> 
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>
>> >>
>> )
>   at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
> Caused by: java.io.IOException: Connection reset by peer
>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>   at
> 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream
> .j
> ava:55)
>   at
> 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:14
> 0)
>   at
> org.apache.hadoop.net.SocketInputStream.read(Sock

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
Just as I wrote that, Murphy's law struck :) This did not fix the issue 
after all.


I think the problem is occurring because a huge amount of network 
bandwidth is being consumed by the jobs. What settings (timeouts, thread 
counts, etc), if any, ought I dial up to correct for this?


Thanks,
- Aaron

Aaron Kimball wrote:
It's a cluster being used for a university course; there are 30 students 
all running code which (to be polite) probably tests the limits of 
Hadoop's failure recovery logic. :)


The current assignment is PageRank over Wikipedia; a 20 GB input corpus. 
Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 
reduce tasks.


I wrote a patch to address the NPE in JobTracker.killJob() and compiled 
it against TRUNK. I've put this on the cluster and it's now been holding 
steady for the last hour or so.. so that plus whatever other differences 
there are between 18.1 and TRUNK may have fixed things. (I'll submit the 
patch to the JIRA as soon as it finishes cranking against the JUnit tests)


- Aaron


Devaraj Das wrote:


On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:


The system load and memory consumption on the JT are both very close to
"idle" states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways 
into the

JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) 

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j 


ava:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I 
think
the problem is that all the handler threads are dying one by one due 
to this

NPE.



This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.


This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> 
wrote:


It's possible that the JobTracker is under duress and unable to 
respond to

the TaskTrackers... what do the JobTracker logs say?

Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

 Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case 
that all

29
TaskTrackers will get disconnected from the JobTracker. All the 
tracker

daemon processes are still running on each machine; but the JobTracker
will
say "0 nodes available" on the web status screen. Restarting MapReduce
fixes
this for another 90--120 minutes.

This looks similar to 
https://issues.apache.org/jira/browse/HADOOP-1763,

but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: 
Caught

exception: java.io.IOException: Call failed on local exception
  at java.lang.Throwable.(Throwable.java:67)
  at org.apache.hadoop.ipc.Client.call(Client.java:718)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
  at


org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>> 


)

  at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) 


  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
  at

org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j 


ava:55)
  at

org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) 


  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) 


  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) 


  at java.io.FilterInputStream.read(FilterInputStream.java:127)
  at


org.apache.hadoop.ipc.Client$Connection$PingInputStream.re

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
It's a cluster being used for a university course; there are 30 students 
all running code which (to be polite) probably tests the limits of 
Hadoop's failure recovery logic. :)


The current assignment is PageRank over Wikipedia; a 20 GB input corpus. 
Individual jobs run ~5--15 minutes in length, using 300 map tasks and 50 
reduce tasks.


I wrote a patch to address the NPE in JobTracker.killJob() and compiled 
it against TRUNK. I've put this on the cluster and it's now been holding 
steady for the last hour or so.. so that plus whatever other differences 
there are between 18.1 and TRUNK may have fixed things. (I'll submit the 
patch to the JIRA as soon as it finishes cranking against the JUnit tests)


- Aaron


Devaraj Das wrote:


On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:


The system load and memory consumption on the JT are both very close to
"idle" states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j

ava:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.



This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.


This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:


It's possible that the JobTracker is under duress and unable to respond to
the TaskTrackers... what do the JobTracker logs say?

Arun


On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:

 Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all
29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker
will
say "0 nodes available" on the web status screen. Restarting MapReduce
fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
  at java.lang.Throwable.(Throwable.java:67)
  at org.apache.hadoop.ipc.Client.call(Client.java:718)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
  at



org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>
)

  at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
  at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
  at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
  at sun.nio.ch.IOUtil.read(IOUtil.java:207)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
  at

org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
ava:55)
  at

org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
  at java.io.FilterInputStream.read(FilterInputStream.java:127)
  at



org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>
)

  at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
  at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
  at java.io.DataInputStream.readInt(DataInputStream.java:381)
  at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
  at org.apache.hadoop.ipc.Client$Connection.run(Client.java

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Devaraj Das


On 10/30/08 3:13 AM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:

> The system load and memory consumption on the JT are both very close to
> "idle" states -- it's not overworked, I don't think
> 
> I may have an idea of the problem, though. Digging back up a ways into the
> JT logs, I see this:
> 
> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 4 on 9001, call killJob(job_200810290855_0025) from
> 10.1.143.245:48253: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
> ava:37)
> at java.lang.reflect.Method.invoke(Method.java:599)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
> 
> 
> 
> This exception is then repeated for all the IPC server handlers. So I think
> the problem is that all the handler threads are dying one by one due to this
> NPE.
> 

This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.

> This something I can fix myself, or is a patch available?
> 
> - Aaron
> 
> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
> 
>> It's possible that the JobTracker is under duress and unable to respond to
>> the TaskTrackers... what do the JobTracker logs say?
>> 
>> Arun
>> 
>> 
>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>> 
>>  Hi all,
>>> 
>>> I'm working with a 30 node Hadoop cluster that has just started
>>> demonstrating some weird behavior. It's run without incident for a few
>>> weeks.. and now:
>>> 
>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>>> continually during this time. Then suddenly it will be the case that all
>>> 29
>>> TaskTrackers will get disconnected from the JobTracker. All the tracker
>>> daemon processes are still running on each machine; but the JobTracker
>>> will
>>> say "0 nodes available" on the web status screen. Restarting MapReduce
>>> fixes
>>> this for another 90--120 minutes.
>>> 
>>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
>>> but
>>> we're running on 0.18.1.
>>> 
>>> I found this in a TaskTracker log:
>>> 
>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>>> exception: java.io.IOException: Call failed on local exception
>>>   at java.lang.Throwable.(Throwable.java:67)
>>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>>   at
>>> 
>>> 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>
)
>>>   at
>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>>> Caused by: java.io.IOException: Connection reset by peer
>>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>>   at
>>> 
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
>>> ava:55)
>>>   at
>>> 
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>>>   at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>>   at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>   at
>>> 
>>> 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>
)
>>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>   at
>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>> 
>>> 
>>> As well as a few of these warnings:
>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>>> THREADS
>>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>>> THREADS: [EMAIL PROTECTED]:50060
>>> 
>>> 
>>> 
>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>>> because all DNS is served 

Debugging / Logging in Hadoop?

2008-10-29 Thread Scott Whitecross
I'm curious to what the best method for debugging and logging in  
Hadoop?  I put together a small cluster today and a simple application  
to process log files.  While it worked well, I had trouble trying to  
get logging information out.  Is there any way to attach a debugger,  
or get log4j to write a log file?  I tried setting up a Logger in the  
class I used for the map/reduce, but I had no luck.


Thanks.




Re: Integration with compute cluster

2008-10-29 Thread Otis Gospodnetic
Hi,

You want to store your logs in HDFS (by copying them from your production 
machines, presumably) and then write custom MapReduce jobs that know how to 
process, correlate data in the logs, and output data in some format that suits 
you.  What you do with that output is then up to you.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: shahab mehmandoust <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, October 29, 2008 7:29:35 PM
> Subject: Integration with compute cluster
> 
> Hi,
> 
> We have one prod server with web logs and a db server.  We want to correlate
> the data in the logs and the db.  With a hadoop implementation (for scaling
> up later), do we need to transfer the data to a machine (designated as the
> compute cluster: http://hadoop.apache.org/core/images/architecture.gif), run
> map/reduce there, and then transfer the output elsewhere for our analysis?
> 
> I'm confused about the compute cluster; does it encompass the data sources
> (here the prod server and the db)?
> 
> Thanks,
> Shahab



Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-29 Thread Zhengguo 'Mike' SUN
Hi, Peeyush,

I guess I didn't make myself clear. I am trying to run a Hadoop pipes job with 
a combination of Java classes and C++ classes. So the command I am using is 
like:

hadoop pipes -conf myconf.xml -inputformat MyInputFormat.class -input in 
-output out

And it threw ClassNotFoundException for my InputSplit class.
As I understand "hadoop jar" is used to run a jar file, which is not my case. 
And there is a -jar option in "hadoop pipes". But, unfortunately, it is not 
working for me. So the question I want to ask is how to include customized Java 
classes, such as MyInputSplit, in a pipes job?

Thanks,
Mike





From: Peeyush Bishnoi <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Sent: Wednesday, October 29, 2008 12:52:18 PM
Subject: RE: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hello Zhengguo ,

Yes , -libjars is the new feature in Hadoop. This feature has been available 
from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x 

example to use -libjars...

hadoop jar -libjars  ...


Thanks ,

---
Peeyush


-Original Message-
From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED]
Sent: Wed 10/29/2008 9:22 AM
To: core-user@hadoop.apache.org
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi, Amareshwari,

Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I 
see is -jar, which didn't work for me. And besides passing them as jar file, is 
there any other ways to do that?

Thanks
Mike



From: Amareshwari Sriramadasu <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, October 28, 2008 11:58:33 PM
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi,

How are you passing your classes to the pipes job? If you are passing 
them as a jar file, you can use -libjars option. From branch 0.19, the 
libjar files are added to the client classpath also.

Thanks
Amareshwari
Zhengguo 'Mike' SUN wrote:
> Hi,
>
> I implemented customized classes for InputFormat, InputSplit and RecordReader 
> in Java and was trying to use them in a C++ pipes job. The customized 
> InputFormat class could be included using the -inputformat option, but it 
> threw ClassNotFoundException for my customized InputSplit class. It seemed 
> the classpath has not been correctly set. Is there any way that let me 
> include my customized classes in a pipes job?
>
>
>
>  
>  


  

Re: Datanode not detecting full disk

2008-10-29 Thread Stefan Will
Hi Raghu,

Each DN machine has 3 partitions, e.g.:

FilesystemSize  Used Avail Use% Mounted on
/dev/sda1  20G  8.0G   11G  44% /
/dev/sda3 1.4T  756G  508G  60% /data
tmpfs 3.9G 0  3.9G   0% /dev/shm

All of the paths in hadoop-site.xml point to /data, which is the partition
that filled up to 100% (I deleted a bunch of files from HDFS since then). So
I guess the question is whether the DN looks at just the partition its data
directory is on, or all partitions when it determines disk usage.

-- Stefan


> From: Raghu Angadi <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Wed, 29 Oct 2008 11:57:07 -0700
> To: 
> Subject: Re: Datanode not detecting full disk
> 
> Stefan Will wrote:
>> Hi Jeff,
>> 
>> Yeah, it looks like I'm running into the issues described in the bug. I'm
>> running 0.18.1 on CentOS 5 by the way. Measuring available disk space
>> appears to be harder than I thought ... and here I was under the impression
>> the percentage in df was a pretty clear indicator of how full the disk is
>> ;-)
>> 
>> How did you guys solve/work around this ?
> 
> How many partitions do you have? If it is just one and NameNode thinks
> it has space though 'available' in df shows very less or no space, then
> you need to file a jira. There should be no case where DN reports more
> space than what 'available' field in 'df' shows.
> 
> But if you have more partitions and only some of them are full, then it
> is a different issue.. which should still be fixed.
> 
> Raghu.
> 
>> -- Stefan
>> 
>>  
>>> From: Jeff Hammerbacher <[EMAIL PROTECTED]>
>>> Reply-To: 
>>> Date: Mon, 27 Oct 2008 12:40:08 -0700
>>> To: 
>>> Subject: Re: Datanode not detecting full disk
>>> 
>>> Hey Stefan,
>>> 
>>> We used to have trouble with this issue at Facebook. What version are
>>> you running? You might get more information on this ticket:
>>> https://issues.apache.org/jira/browse/HADOOP-2991.
>>> 
>>> Regards,
>>> Jeff
>>> 
>>> On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote:
 Each of my datanodes has  a system and a data partition, with dfs.data.dir
 pointed to the data partition. The data partition just filled up to 100% on
 all of my nodes (as evident via df), but the NameNode web ui still shows
 them only 88-94% full (interestingly, the numbers differ even though the
 machines are configured identically). I thought the datanodes used df to
 determine free space ? How is the storage utilization determined ?
 
 -- Stefan
 
>> 
>> 




Integration with compute cluster

2008-10-29 Thread shahab mehmandoust
Hi,

We have one prod server with web logs and a db server.  We want to correlate
the data in the logs and the db.  With a hadoop implementation (for scaling
up later), do we need to transfer the data to a machine (designated as the
compute cluster: http://hadoop.apache.org/core/images/architecture.gif), run
map/reduce there, and then transfer the output elsewhere for our analysis?

I'm confused about the compute cluster; does it encompass the data sources
(here the prod server and the db)?

Thanks,
Shahab


Any examples using Hadoop Pipes with binary SequenceFiles?

2008-10-29 Thread Joel Welling
Hi folks;
  I'm interested in reading binary data, running it through some C++
code, and writing the result as binary data.  It looks like
SequenceFiles and Pipes are the way to do it, but I can't find any
examples or docs beyond the API specification.  Can someone point me to
an example where this is done?

Thanks,
-Joel



Re: SecondaryNameNode on separate machine

2008-10-29 Thread Konstantin Shvachko

SecondaryNameNode uses http protocol to transfer the image and the edits
from the primary name-node and vise versa.
So the secondary does not access local files on the primary directly.
The primary NN should know the secondary's http address.
And the secondary NN need to know both fs.default.name and dfs.http.address of 
the primary.

In general we usually create one configuration file hadoop-site.xml
and copy it to all other machines. So you don't need to set up different
values for all servers.

Regards,
--Konstantin

Tomislav Poljak wrote:

Hi,
I'm not clear on how does SecondaryNameNode communicates with NameNode
(if deployed on separate machine). Does SecondaryNameNode uses direct
connection (over some port and protocol) or is it enough for
SecondaryNameNode to have access to data which NameNode writes locally
on disk?

Tomislav

On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote:

I think a lot of the confusion comes from this thread :
http://www.nabble.com/NameNode-failover-procedure-td11711842.html

Particularly because the wiki was updated with wrong information, not
maliciously I'm sure. This information is now gone for good.

Otis, your solution is pretty much like the one given by Dhruba Borthakur
and augmented by Konstantin Shvachko later in the thread but I never did it
myself.

One thing should be clear though, the NN is and will remain a SPOF (just
like HBase's Master) as long as a distributed manager service (like
Zookeeper) is not plugged into Hadoop to help with failover.

J-D

On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:


Hi,
So what is the "recipe" for avoiding NN SPOF using only what comes with
Hadoop?

From what I can tell, I think one has to do the following two things:

1) configure primary NN to save namespace and xa logs to multiple dirs, one
of which is actually on a remotely mounted disk, so that the data actually
lives on a separate disk on a separate box.  This saves namespace and xa
logs on multiple boxes in case of primary NN hardware failure.

2) configure secondary NN to periodically merge fsimage+edits and create
the fsimage checkpoint.  This really is a second NN process running on
another box.  It sounds like this secondary NN has to somehow have access to
fsimage & edits files from the primary NN server.
http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes
 not describe the best practise around that - the recommended way to
give secondary NN access to primary NN's fsimage and edits files.  Should
one mount a disk from the primary NN box to the secondary NN box to get
access to those files?  Or is there a simpler way?
In any case, this checkpoint is just a merge of fsimage+edits files and
again is there in case the box with the primary NN dies.  That's what's
described on
http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore
 or less.

Is this sufficient, or are there other things one has to do to eliminate NN
SPOF?


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, October 28, 2008 8:14:44 PM
Subject: Re: SecondaryNameNode on separate machine

Tomislav.

Contrary to popular belief the secondary namenode does not provide

failover,

it's only used to do what is described here :


http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode

So the term "secondary" does not mean "a second one" but is more like "a
second part of".

J-D

On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote:


Hi,
I'm trying to implement NameNode failover (or at least NameNode local
data backup), but it is hard since there is no official documentation.
Pages on this subject are created, but still empty:

http://wiki.apache.org/hadoop/NameNodeFailover
http://wiki.apache.org/hadoop/SecondaryNameNode

I have been browsing the web and hadoop mailing list to see how this
should be implemented, but I got even more confused. People are asking
do we even need SecondaryNameNode etc. (since NameNode can write local
data to multiple locations, so one of those locations can be a mounted
disk from other machine). I think I understand the motivation for
SecondaryNameNode (to create a snapshoot of NameNode data every n
seconds/hours), but setting (deploying and running) SecondaryNameNode

on

different machine than NameNode is not as trivial as I expected. First

I

found that if I need to run SecondaryNameNode on other machine than
NameNode I should change masters file on NameNode (change localhost to
SecondaryNameNode host) and set some properties in hadoop-site.xml on
SecondaryNameNode (fs.default.name, fs.checkpoint.dir,
fs.checkpoint.period etc.)

This was enough to start SecondaryNameNode when starting NameNode with
bin/start-dfs.sh , but it didn't create image on SecondaryNameNode.

Then

I found that I need to set dfs

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-29 Thread Tony Pitluga
David,

You can address this by using these two settings on your JobConf. i.e.

conf.setOutputValueGroupingComparator(YourKeyComparator.class);
conf.setOutputKeyComparatorClass(YourKeyAndValueComparator.class);

Both classes should extend WritableComparator. The
OutputValueGroupingComparator is the one that will sort the order that they
keys are passed to your reduce function. The OutputKeyComparatorClass will
sort the order that the values are returned from your iterator in your
reduce function.

Hope that helps,
Tony

On Wed, Oct 29, 2008 at 10:59 AM, David M. Coe <[EMAIL PROTECTED]>wrote:

> Would the input using this method be sorted before the reducer?  I have
> implemented this and only the keycomparatorclass is called.  This gives
> the effect that if I output the data here it is sorted.  However; it
> sorts comparing both the right and the left as you suggest so the
> reducer is given unique right-left instead of being given right that
> happen to be sorted using the left.
>
> What I get:
>
> text file ->
> map: -> 0 0 -> reducer
>0 1 -> reducer
>8 0 -> reducer
>8 1 -> reducer
>
> What I'd like:
>
> text file ->
> map: ***
> -> 0 0  \
> -> 0 1  | -> reducer
> -> 0 8  /
> ***
> -> 8 0  \ -> reducer
> -> 8 1  /
> ***
> -> 123 3  -> reducer
>
> What is the best way to do this?  The keys must be secondary sorted
> before the reduce, but I cannot think of a way to do this.
>
> Thank you.
>
>
>
> Owen O'Malley wrote:
> >
> > On Oct 28, 2008, at 7:53 AM, David M. Coe wrote:
> >
> >> My mapper is Mapper and my
> >> reducer is the identity.  I configure the program using:
> >>
> >> conf.setOutputKeyClass(IntWritable.class);
> >> conf.setOutputValueClass(IntWritable.class);
> >>
> >> conf.setMapperClass(MapClass.class);
> >> conf.setReducerClass(IdentityReducer.class);
> >>
> >> conf.setOutputKeyComparatorClass(IntWritable.Comparator.class);
> >> conf.setOutputValueGroupingComparator(IntWritable.Comparator.class);
> >
> > The problem is that your map needs to look like:
> >
> > class IntPair implements Writable {
> >   private int left;
> >   private int right;
> >   public void set(int left, int right) { ... }
> >   public int getLeft() {...}
> >   public int getRight() {...}
> > }
> >
> > your Mapper should be Mapper
> > and should emit
> >
> > IntPair key = new IntPair();
> > IntegerWritable value = new IntegerWritale();
> > ...
> > key.set(keyValue, valueValue);
> > value.set(keyValue,);
> > output.collect(key, value);
> >
> > Your sort comparator should take compare both left and right in the pair.
> > The grouping comparator should only look at left in the pair.
> >
> > Your Reducer should be Reducer > IntWritable>
> >
> > output.collect(key.getLeft(), value);
> >
> > Is that clearer?
> >
> > -- Owen
>
>


Re: SecondaryNameNode on separate machine

2008-10-29 Thread Tomislav Poljak
Hi,
I'm not clear on how does SecondaryNameNode communicates with NameNode
(if deployed on separate machine). Does SecondaryNameNode uses direct
connection (over some port and protocol) or is it enough for
SecondaryNameNode to have access to data which NameNode writes locally
on disk?

Tomislav

On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote:
> I think a lot of the confusion comes from this thread :
> http://www.nabble.com/NameNode-failover-procedure-td11711842.html
> 
> Particularly because the wiki was updated with wrong information, not
> maliciously I'm sure. This information is now gone for good.
> 
> Otis, your solution is pretty much like the one given by Dhruba Borthakur
> and augmented by Konstantin Shvachko later in the thread but I never did it
> myself.
> 
> One thing should be clear though, the NN is and will remain a SPOF (just
> like HBase's Master) as long as a distributed manager service (like
> Zookeeper) is not plugged into Hadoop to help with failover.
> 
> J-D
> 
> On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > So what is the "recipe" for avoiding NN SPOF using only what comes with
> > Hadoop?
> >
> > From what I can tell, I think one has to do the following two things:
> >
> > 1) configure primary NN to save namespace and xa logs to multiple dirs, one
> > of which is actually on a remotely mounted disk, so that the data actually
> > lives on a separate disk on a separate box.  This saves namespace and xa
> > logs on multiple boxes in case of primary NN hardware failure.
> >
> > 2) configure secondary NN to periodically merge fsimage+edits and create
> > the fsimage checkpoint.  This really is a second NN process running on
> > another box.  It sounds like this secondary NN has to somehow have access to
> > fsimage & edits files from the primary NN server.
> > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes
> >  not describe the best practise around that - the recommended way to
> > give secondary NN access to primary NN's fsimage and edits files.  Should
> > one mount a disk from the primary NN box to the secondary NN box to get
> > access to those files?  Or is there a simpler way?
> > In any case, this checkpoint is just a merge of fsimage+edits files and
> > again is there in case the box with the primary NN dies.  That's what's
> > described on
> > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore
> >  or less.
> >
> > Is this sufficient, or are there other things one has to do to eliminate NN
> > SPOF?
> >
> >
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> > > From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> > > To: core-user@hadoop.apache.org
> > > Sent: Tuesday, October 28, 2008 8:14:44 PM
> > > Subject: Re: SecondaryNameNode on separate machine
> > >
> > > Tomislav.
> > >
> > > Contrary to popular belief the secondary namenode does not provide
> > failover,
> > > it's only used to do what is described here :
> > >
> > http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode
> > >
> > > So the term "secondary" does not mean "a second one" but is more like "a
> > > second part of".
> > >
> > > J-D
> > >
> > > On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote:
> > >
> > > > Hi,
> > > > I'm trying to implement NameNode failover (or at least NameNode local
> > > > data backup), but it is hard since there is no official documentation.
> > > > Pages on this subject are created, but still empty:
> > > >
> > > > http://wiki.apache.org/hadoop/NameNodeFailover
> > > > http://wiki.apache.org/hadoop/SecondaryNameNode
> > > >
> > > > I have been browsing the web and hadoop mailing list to see how this
> > > > should be implemented, but I got even more confused. People are asking
> > > > do we even need SecondaryNameNode etc. (since NameNode can write local
> > > > data to multiple locations, so one of those locations can be a mounted
> > > > disk from other machine). I think I understand the motivation for
> > > > SecondaryNameNode (to create a snapshoot of NameNode data every n
> > > > seconds/hours), but setting (deploying and running) SecondaryNameNode
> > on
> > > > different machine than NameNode is not as trivial as I expected. First
> > I
> > > > found that if I need to run SecondaryNameNode on other machine than
> > > > NameNode I should change masters file on NameNode (change localhost to
> > > > SecondaryNameNode host) and set some properties in hadoop-site.xml on
> > > > SecondaryNameNode (fs.default.name, fs.checkpoint.dir,
> > > > fs.checkpoint.period etc.)
> > > >
> > > > This was enough to start SecondaryNameNode when starting NameNode with
> > > > bin/start-dfs.sh , but it didn't create image on SecondaryNameNode.
> > Then
> > > > I found that I need to set dfs.http.address on NameNode address (so now
> > > > I have NameNode

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
Could the version of Java being used matter? I just realized this cluster
runs IBM Java, not Sun:

java version "1.6.0"
Java(TM) SE Runtime Environment (build pxi3260sr2-20080818_01(SR2))
IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux x86-32
jvmxi3260-20080816_2209
3 (JIT enabled, AOT enabled)
J9VM - 20080816_022093_lHdSMr
JIT  - r9_20080721_1330ifx2
GC   - 20080724_AA)
JCL  - 20080808_02

- Aaron

On Wed, Oct 29, 2008 at 2:43 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote:

> The system load and memory consumption on the JT are both very close to
> "idle" states -- it's not overworked, I don't think
>
> I may have an idea of the problem, though. Digging back up a ways into the
> JT logs, I see this:
>
> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 4 on 9001, call killJob(job_200810290855_0025) from 10.1.143.245:48253: 
> error: java.io.IOException: java.lang.NullPointerException
>
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
>
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>   at java.lang.reflect.Method.invoke(Method.java:599)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>
>
>
> This exception is then repeated for all the IPC server handlers. So I think
> the problem is that all the handler threads are dying one by one due to this
> NPE.
>
> This something I can fix myself, or is a patch available?
>
> - Aaron
>
>
> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>
>> It's possible that the JobTracker is under duress and unable to respond to
>> the TaskTrackers... what do the JobTracker logs say?
>>
>> Arun
>>
>>
>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>>
>>  Hi all,
>>>
>>> I'm working with a 30 node Hadoop cluster that has just started
>>> demonstrating some weird behavior. It's run without incident for a few
>>> weeks.. and now:
>>>
>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>>> continually during this time. Then suddenly it will be the case that all
>>> 29
>>> TaskTrackers will get disconnected from the JobTracker. All the tracker
>>> daemon processes are still running on each machine; but the JobTracker
>>> will
>>> say "0 nodes available" on the web status screen. Restarting MapReduce
>>> fixes
>>> this for another 90--120 minutes.
>>>
>>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
>>> but
>>> we're running on 0.18.1.
>>>
>>> I found this in a TaskTracker log:
>>>
>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:
>>> Caught
>>> exception: java.io.IOException: Call failed on local exception
>>>   at java.lang.Throwable.(Throwable.java:67)
>>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>>   at
>>>
>>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
>>>   at
>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>>> Caused by: java.io.IOException: Connection reset by peer
>>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>>   at
>>>
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>   at
>>>
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>>>   at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>>   at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>   at
>>>
>>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
>>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>   at
>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>>
>>>
>>> As well as a few of these warnings:
>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>>> THREADS
>>> ((40-40+0)<1) on [EMAIL PROTECTE

Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
The system load and memory consumption on the JT are both very close to
"idle" states -- it's not overworked, I don't think

I may have an idea of the problem, though. Digging back up a ways into the
JT logs, I see this:

2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001, call killJob(job_200810290855_0025) from
10.1.143.245:48253: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



This exception is then repeated for all the IPC server handlers. So I think
the problem is that all the handler threads are dying one by one due to this
NPE.

This something I can fix myself, or is a patch available?

- Aaron

On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

> It's possible that the JobTracker is under duress and unable to respond to
> the TaskTrackers... what do the JobTracker logs say?
>
> Arun
>
>
> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>
>  Hi all,
>>
>> I'm working with a 30 node Hadoop cluster that has just started
>> demonstrating some weird behavior. It's run without incident for a few
>> weeks.. and now:
>>
>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>> continually during this time. Then suddenly it will be the case that all
>> 29
>> TaskTrackers will get disconnected from the JobTracker. All the tracker
>> daemon processes are still running on each machine; but the JobTracker
>> will
>> say "0 nodes available" on the web status screen. Restarting MapReduce
>> fixes
>> this for another 90--120 minutes.
>>
>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
>> but
>> we're running on 0.18.1.
>>
>> I found this in a TaskTracker log:
>>
>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>> exception: java.io.IOException: Call failed on local exception
>>   at java.lang.Throwable.(Throwable.java:67)
>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>   at
>>
>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
>>   at
>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>> Caused by: java.io.IOException: Connection reset by peer
>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>   at
>>
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>   at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>>   at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>   at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>   at
>>
>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>   at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>
>>
>> As well as a few of these warnings:
>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>> THREADS
>> ((40-40+0)<1) on [EMAIL PROTECTED]:50060
>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>> THREADS: [EMAIL PROTECTED]:50060
>>
>>
>>
>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>> because all DNS is served through /etc/hosts files. NameNode and
>> JobTracker
>> are on the same machine.
>>
>> Any help is appreciated
>> Thanks
>> - Aaron Kimball
>>
>
>


Re: TaskTrackers disengaging from JobTracker

2008-10-29 Thread Arun C Murthy
It's possible that the JobTracker is under duress and unable to  
respond to the TaskTrackers... what do the JobTracker logs say?


Arun

On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:


Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that  
all 29
TaskTrackers will get disconnected from the JobTracker. All the  
tracker
daemon processes are still running on each machine; but the  
JobTracker will
say "0 nodes available" on the web status screen. Restarting  
MapReduce fixes

this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763 
, but

we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker:  
Caught

exception: java.io.IOException: Call failed on local exception
   at java.lang.Throwable.(Throwable.java:67)
   at org.apache.hadoop.ipc.Client.call(Client.java:718)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
   at
org 
.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java: 
1045)

   at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java: 
928)

   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
   at sun.nio.ch.FileDispatcher.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
   at
org.apache.hadoop.net.SocketInputStream 
$Reader.performIO(SocketInputStream.java:55)

   at
org 
.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 
140)

   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
150)

   at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
123)

   at java.io.FilterInputStream.read(FilterInputStream.java:127)
   at
org.apache.hadoop.ipc.Client$Connection 
$PingInputStream.read(Client.java:272)

   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
   at java.io.DataInputStream.readInt(DataInputStream.java:381)
   at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java: 
499)

   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON  
THREADS

((40-40+0)<1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and  
JobTracker

are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball




TaskTrackers disengaging from JobTracker

2008-10-29 Thread Aaron Kimball
Hi all,

I'm working with a 30 node Hadoop cluster that has just started
demonstrating some weird behavior. It's run without incident for a few
weeks.. and now:

The cluster will run smoothly for 90--120 minutes or so, handling jobs
continually during this time. Then suddenly it will be the case that all 29
TaskTrackers will get disconnected from the JobTracker. All the tracker
daemon processes are still running on each machine; but the JobTracker will
say "0 nodes available" on the web status screen. Restarting MapReduce fixes
this for another 90--120 minutes.

This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, but
we're running on 0.18.1.

I found this in a TaskTracker log:

2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call failed on local exception
at java.lang.Throwable.(Throwable.java:67)
at org.apache.hadoop.ipc.Client.call(Client.java:718)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
at sun.nio.ch.IOUtil.read(IOUtil.java:207)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
at java.io.FilterInputStream.read(FilterInputStream.java:127)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
at java.io.DataInputStream.readInt(DataInputStream.java:381)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


As well as a few of these warnings:
2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON THREADS
((40-40+0)<1) on [EMAIL PROTECTED]:50060
2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
THREADS: [EMAIL PROTECTED]:50060



The NameNode and DataNodes are completely fine. Can't be a DNS issue,
because all DNS is served through /etc/hosts files. NameNode and JobTracker
are on the same machine.

Any help is appreciated
Thanks
- Aaron Kimball


Re: nagios to monitor hadoop datanodes!

2008-10-29 Thread Edward Capriolo
All I have to say is wow! I never tried jconsole before. I have
hadoop_trunk checked out and the JMX has all kinds of great
information. I am going to look at how I can get JMX/cacti/and hadoop
working together.

Just as an FYI there are separate ENV variables for each now. If you
override hadoop_ops you get a port conflict. It should be like this.

export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=10001"

Thanks Brian.


Re: Datanode not detecting full disk

2008-10-29 Thread Raghu Angadi

Stefan Will wrote:

Hi Jeff,

Yeah, it looks like I'm running into the issues described in the bug. I'm
running 0.18.1 on CentOS 5 by the way. Measuring available disk space
appears to be harder than I thought ... and here I was under the impression
the percentage in df was a pretty clear indicator of how full the disk is
;-)

How did you guys solve/work around this ?


How many partitions do you have? If it is just one and NameNode thinks 
it has space though 'available' in df shows very less or no space, then 
you need to file a jira. There should be no case where DN reports more 
space than what 'available' field in 'df' shows.


But if you have more partitions and only some of them are full, then it 
is a different issue.. which should still be fixed.


Raghu.


-- Stefan

 

From: Jeff Hammerbacher <[EMAIL PROTECTED]>
Reply-To: 
Date: Mon, 27 Oct 2008 12:40:08 -0700
To: 
Subject: Re: Datanode not detecting full disk

Hey Stefan,

We used to have trouble with this issue at Facebook. What version are
you running? You might get more information on this ticket:
https://issues.apache.org/jira/browse/HADOOP-2991.

Regards,
Jeff

On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote:

Each of my datanodes has  a system and a data partition, with dfs.data.dir
pointed to the data partition. The data partition just filled up to 100% on
all of my nodes (as evident via df), but the NameNode web ui still shows
them only 88-94% full (interestingly, the numbers differ even though the
machines are configured identically). I thought the datanodes used df to
determine free space ? How is the storage utilization determined ?

-- Stefan








Re: "Could not obtain block" error

2008-10-29 Thread Raghu Angadi


If have only one copy of the block and it is mostly corrupted.. Namenode 
itself can not correct it. Of course, DFSClient should not print error 
in a infinite loop.


I think there was an old bug where crc file got overwritten by 0 length 
file.


One work around for you is to go to the datanode and remove the .crc 
file for this block (find /datanodedir -name blk_5994030096182059653\*). 
Be careful not to remove the block file itself.


longer term fix : upgrade to more recent version.

Raghu.

murali krishna wrote:

Hi,
When I try to read one of the file from dfs, I get the following error in an 
infinite loop (using 0.15.3)

“08/10/28 23:43:15 INFO fs.DFSClient: Could not obtain block 
blk_5994030096182059653 from any node:  java.io.IOException: No live nodes 
contain current block”

Fsck showed that the file is HEALTHY but under replicated (1 instead of 
configured 2). I checked the datanode log where the only replica exists for 
that block and I can see repeated errors while serving that bock.
2008-10-22 23:55:39,378 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_59940300961820596
53 to 68.142.212.228:50010 got java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at org.apache.hadoop.dfs.DataNode$BlockSender.(DataNode.java:1061)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1446)
at java.lang.Thread.run(Thread.java:619)

Any idea what is going on and how can I fix this ?

Thanks,
Murali





RE: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-29 Thread Peeyush Bishnoi
Hello Zhengguo ,

Yes , -libjars is the new feature in Hadoop. This feature has been available 
from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x 

example to use -libjars...

hadoop jar -libjars  ...


Thanks ,

---
Peeyush


-Original Message-
From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED]
Sent: Wed 10/29/2008 9:22 AM
To: core-user@hadoop.apache.org
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?
 
Hi, Amareshwari,

Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I 
see is -jar, which didn't work for me. And besides passing them as jar file, is 
there any other ways to do that?

Thanks
Mike



From: Amareshwari Sriramadasu <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, October 28, 2008 11:58:33 PM
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi,

How are you passing your classes to the pipes job? If you are passing 
them as a jar file, you can use -libjars option. From branch 0.19, the 
libjar files are added to the client classpath also.

Thanks
Amareshwari
Zhengguo 'Mike' SUN wrote:
> Hi,
>
> I implemented customized classes for InputFormat, InputSplit and RecordReader 
> in Java and was trying to use them in a C++ pipes job. The customized 
> InputFormat class could be included using the -inputformat option, but it 
> threw ClassNotFoundException for my customized InputSplit class. It seemed 
> the classpath has not been correctly set. Is there any way that let me 
> include my customized classes in a pipes job?
>
>
>
>  
>  


  



Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-29 Thread Zhengguo 'Mike' SUN
Hi, Amareshwari,

Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I 
see is -jar, which didn't work for me. And besides passing them as jar file, is 
there any other ways to do that?

Thanks
Mike



From: Amareshwari Sriramadasu <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, October 28, 2008 11:58:33 PM
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi,

How are you passing your classes to the pipes job? If you are passing 
them as a jar file, you can use -libjars option. From branch 0.19, the 
libjar files are added to the client classpath also.

Thanks
Amareshwari
Zhengguo 'Mike' SUN wrote:
> Hi,
>
> I implemented customized classes for InputFormat, InputSplit and RecordReader 
> in Java and was trying to use them in a C++ pipes job. The customized 
> InputFormat class could be included using the -inputformat option, but it 
> threw ClassNotFoundException for my customized InputSplit class. It seemed 
> the classpath has not been correctly set. Is there any way that let me 
> include my customized classes in a pipes job?
>
>
>
>  
>  


  

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-29 Thread David M. Coe
Would the input using this method be sorted before the reducer?  I have
implemented this and only the keycomparatorclass is called.  This gives
the effect that if I output the data here it is sorted.  However; it
sorts comparing both the right and the left as you suggest so the
reducer is given unique right-left instead of being given right that
happen to be sorted using the left.

What I get:

text file ->
map: -> 0 0 -> reducer
0 1 -> reducer
8 0 -> reducer
8 1 -> reducer

What I'd like:

text file ->
map: ***
 -> 0 0  \
 -> 0 1  | -> reducer
 -> 0 8  /
 ***
 -> 8 0  \ -> reducer
 -> 8 1  /
 ***
 -> 123 3  -> reducer

What is the best way to do this?  The keys must be secondary sorted
before the reduce, but I cannot think of a way to do this.

Thank you.



Owen O'Malley wrote:
> 
> On Oct 28, 2008, at 7:53 AM, David M. Coe wrote:
> 
>> My mapper is Mapper and my
>> reducer is the identity.  I configure the program using:
>>
>> conf.setOutputKeyClass(IntWritable.class);
>> conf.setOutputValueClass(IntWritable.class);
>>
>> conf.setMapperClass(MapClass.class);
>> conf.setReducerClass(IdentityReducer.class);
>>
>> conf.setOutputKeyComparatorClass(IntWritable.Comparator.class);
>> conf.setOutputValueGroupingComparator(IntWritable.Comparator.class);
> 
> The problem is that your map needs to look like:
> 
> class IntPair implements Writable {
>   private int left;
>   private int right;
>   public void set(int left, int right) { ... }
>   public int getLeft() {...}
>   public int getRight() {...}
> }
> 
> your Mapper should be Mapper
> and should emit
> 
> IntPair key = new IntPair();
> IntegerWritable value = new IntegerWritale();
> ...
> key.set(keyValue, valueValue);
> value.set(keyValue,);
> output.collect(key, value);
> 
> Your sort comparator should take compare both left and right in the pair.
> The grouping comparator should only look at left in the pair.
> 
> Your Reducer should be Reducer IntWritable>
> 
> output.collect(key.getLeft(), value);
> 
> Is that clearer?
> 
> -- Owen



Re: How does an offline Datanode come back up ?

2008-10-29 Thread Steve Loughran

Norbert Burger wrote:

Along these lines, I'm curious what "management tools" folks are using to
ensure cluster availability (ie., auto-restart failed datanodes/namenodes).

Are you using a custom cron script, or maybe something more complex
(Ganglia, Nagios, puppet, etc.)?



We use SmartFrog, http://smartfrog.org/ , to do this kind of thing, not 
just because it comes from our organisation, but because it gives us the 
ability to manage other parts of the system at the same time.


To get SF deploying Hadoop in a way I'm happy with, I have had to make a 
fair few changes to the lifecycle of the "services" -things like 
namenode, datanode, jobtracker and task tracker. Most of the changes are 
in HADOOP-3628, though I need
to push through another iteration of this [1]. Even with the changes I'm 
worried about race conditions and shutdown, as the existing code assumes 
that every node starts in its own process -which is what I recommend for 
production. We gave a talk on this topic in august at the Hadoop UK 
event [2]


None of this stuff is in a public release yet, but I may cut one next 
week which includes an unsupported 0.20-alpha-patched version of Hadoop 
in an RPM. This RPM can be pushed out to the machines through your RPM 
publish mechanism of choice; when the SmartFrog daemon comes up it 
deploys whatever it has been told to, or it announces to the world it is 
unpurposed and gets told to deploy whatever someone it trusts talks to.


Failure handing is still interesting. With a language like SmartFrog you 
can declare how failures can be handled; we have various workflowy 
containers to do things like

 -retry and restart
 -kill and report upwards (default)
 -roll back the whole deployment and restart
For things like task trackers and such like, such loss is best handled 
by killing and restarting. But the filesystem is much more temperamental 
-and it is FS and HDD failures that create the most stress in any 
project. That and the accidental deletions of the entire dataset. A node 
in the cluster that is only a tasktracker is disposable: any problems 
you may as well flip the power switch and have the PXE reboot bring it 
back to a blank state. Datanode failures, though, that's an issue. If 
the data on the node is replicated in >1 place, I'd decomission the node 
and do the same thing. If the data isn't adequately replicated yet, you 
want to get the stuff off it first. And if you think its a physical HDD 
problem, time to stop using that particular disk.


I think everyone is still learning the main failure modes of a cluster, 
and still deciding how to react.


[1] https://issues.apache.org/jira/browse/HADOOP-3628
[2] 
http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf




Thanks,
Norbert

On 10/28/08, Steve Loughran <[EMAIL PROTECTED]> wrote:

wmitchell wrote:


Hi All,

Ive been working michael nolls multi-node cluster setup example
(Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
then on my slave machine -- which is currently running a datanode killed
the
process in an effort to try to simulate some sort of failure on the slave
machine datanode. I had assumed that the namenode would have been polling
its datanodes and thus attempted to bring up any node that goes down. On
looking at my slave machine it seems that the datanode process is still
down
(I've checked jps).



That's up to you or your management tools. The namenode knows that the
datanode is unreachable, but doesn't know how to go about reconnecting it to
the network. Which, given there are many causes of "down", sort of makes
sense. The switch failing, the hdds dying or the process crashing, all look
the same: no datanode heartbeats.






--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1

2008-10-29 Thread Grant Ingersoll

We'll try it out...

On Oct 28, 2008, at 3:00 PM, Arun C Murthy wrote:



On Oct 27, 2008, at 7:05 PM, Grant Ingersoll wrote:


Hi,

Over in Mahout (lucene.a.o/mahout), we are seeing an oddity with  
some of our clustering code and Hadoop 0.18.1.  The thread in  
context is at:  http://mahout.markmail.org/message/vcyvlz2met7fnthr


The problem seems to occur when going from 0.17.2 to 0.18.1.  In  
the user logs, we are seeing the following exception:
2008-10-27 21:18:37,014 INFO org.apache.hadoop.mapred.Merger: Down  
to the last merge-pass, with 2 segments left of total size: 5011  
bytes
2008-10-27 21:18:37,033 WARN org.apache.hadoop.mapred.ReduceTask:  
attempt_200810272112_0011_r_00_0 Merge of the inmemory files  
threw an exception: java.io.IOException: Intermedate merge failed
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2147)
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
$InMemFSMergeThread.run(ReduceTask.java:2078)

Caused by: java.lang.NumberFormatException: For input string: "["


If you are sure that this isn't caused by your application-logic,  
you could try running with http://issues.apache.org/jira/browse/HADOOP-4277 
.


That bug caused many a ship to sail in large circles, hopelessly.

Arun



  at  
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java: 
1224)

  at java.lang.Double.parseDouble(Double.java:510)
  at  
org.apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java: 
60)
  at  
org 
.apache 
.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:256)
  at  
org 
.apache 
.mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java: 
38)
  at  
org 
.apache 
.mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java: 
31)
  at org.apache.hadoop.mapred.ReduceTask 
$ReduceCopier.combineAndSpill(ReduceTask.java:2174)
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access 
$3100(ReduceTask.java:341)
  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier 
$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2134)


And in the main output log (from running  bin/hadoop jar  mahout/ 
examples/build/apache-mahout-examples-0.1-dev.job  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) we see:
08/10/27 21:18:41 INFO mapred.JobClient: Task Id :  
attempt_200810272112_0011_r_00_0, Status : FAILED
java.io.IOException: attempt_200810272112_0011_r_00_0The reduce  
copier failed

  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
  at org.apache.hadoop.mapred.TaskTracker 
$Child.main(TaskTracker.java:2207)


If I run this exact same job on 0.17.2 it all runs fine.  I suppose  
either a bug was introduced in 0.18.1 or a bug was fixed that we  
were relying on.  Looking at the release notes between the fixes,  
nothing in particular struck me as related.  If it helps, I can  
provide the instructions for how to run the example in question  
(they need to be written up anyway!)



I see some related things at http://hadoop.markmail.org/search/?q=Merge+of+the+inmemory+files+threw+an+exception 
, but those are older, it seems, so not sure what to make of them.


Thanks,
Grant




--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: SecondaryNameNode on separate machine

2008-10-29 Thread Jean-Daniel Cryans
I think a lot of the confusion comes from this thread :
http://www.nabble.com/NameNode-failover-procedure-td11711842.html

Particularly because the wiki was updated with wrong information, not
maliciously I'm sure. This information is now gone for good.

Otis, your solution is pretty much like the one given by Dhruba Borthakur
and augmented by Konstantin Shvachko later in the thread but I never did it
myself.

One thing should be clear though, the NN is and will remain a SPOF (just
like HBase's Master) as long as a distributed manager service (like
Zookeeper) is not plugged into Hadoop to help with failover.

J-D

On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Hi,
> So what is the "recipe" for avoiding NN SPOF using only what comes with
> Hadoop?
>
> From what I can tell, I think one has to do the following two things:
>
> 1) configure primary NN to save namespace and xa logs to multiple dirs, one
> of which is actually on a remotely mounted disk, so that the data actually
> lives on a separate disk on a separate box.  This saves namespace and xa
> logs on multiple boxes in case of primary NN hardware failure.
>
> 2) configure secondary NN to periodically merge fsimage+edits and create
> the fsimage checkpoint.  This really is a second NN process running on
> another box.  It sounds like this secondary NN has to somehow have access to
> fsimage & edits files from the primary NN server.
> http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes
>  not describe the best practise around that - the recommended way to
> give secondary NN access to primary NN's fsimage and edits files.  Should
> one mount a disk from the primary NN box to the secondary NN box to get
> access to those files?  Or is there a simpler way?
> In any case, this checkpoint is just a merge of fsimage+edits files and
> again is there in case the box with the primary NN dies.  That's what's
> described on
> http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodemore
>  or less.
>
> Is this sufficient, or are there other things one has to do to eliminate NN
> SPOF?
>
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> > Sent: Tuesday, October 28, 2008 8:14:44 PM
> > Subject: Re: SecondaryNameNode on separate machine
> >
> > Tomislav.
> >
> > Contrary to popular belief the secondary namenode does not provide
> failover,
> > it's only used to do what is described here :
> >
> http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode
> >
> > So the term "secondary" does not mean "a second one" but is more like "a
> > second part of".
> >
> > J-D
> >
> > On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote:
> >
> > > Hi,
> > > I'm trying to implement NameNode failover (or at least NameNode local
> > > data backup), but it is hard since there is no official documentation.
> > > Pages on this subject are created, but still empty:
> > >
> > > http://wiki.apache.org/hadoop/NameNodeFailover
> > > http://wiki.apache.org/hadoop/SecondaryNameNode
> > >
> > > I have been browsing the web and hadoop mailing list to see how this
> > > should be implemented, but I got even more confused. People are asking
> > > do we even need SecondaryNameNode etc. (since NameNode can write local
> > > data to multiple locations, so one of those locations can be a mounted
> > > disk from other machine). I think I understand the motivation for
> > > SecondaryNameNode (to create a snapshoot of NameNode data every n
> > > seconds/hours), but setting (deploying and running) SecondaryNameNode
> on
> > > different machine than NameNode is not as trivial as I expected. First
> I
> > > found that if I need to run SecondaryNameNode on other machine than
> > > NameNode I should change masters file on NameNode (change localhost to
> > > SecondaryNameNode host) and set some properties in hadoop-site.xml on
> > > SecondaryNameNode (fs.default.name, fs.checkpoint.dir,
> > > fs.checkpoint.period etc.)
> > >
> > > This was enough to start SecondaryNameNode when starting NameNode with
> > > bin/start-dfs.sh , but it didn't create image on SecondaryNameNode.
> Then
> > > I found that I need to set dfs.http.address on NameNode address (so now
> > > I have NameNode address in both fs.default.name and dfs.http.address).
> > >
> > > Now I get following exception:
> > >
> > > 2008-10-28 09:18:00,098 ERROR NameNode.Secondary - Exception in
> > > doCheckpoint:
> > > 2008-10-28 09:18:00,098 ERROR NameNode.Secondary -
> > > java.net.SocketException: Unexpected end of file from server
> > >
> > > My questions are following:
> > > How to resolve this problem (this exception)?
> > > Do I need additional property in SecondaryNameNode's hadoop-site.xml or
> > > NameNode's hadoop-site.xml?
> > >
> > > How should NameNode failover work i

Re: How does an offline Datanode come back up ?

2008-10-29 Thread Edward Capriolo
Someone on the list is looking at monitoring hadoop features with
nagios. Nagios can be configured with an event_handler. In the past I
have written event handlers to do operations like this. If down ---
use SSH key and restart.

HoweverSince you have an SSH key on your master node, you should
be able to have a centralized node restarter running from the master
cron. Maybe an interesting argument to run a separate nagios as your
hadoop user!

In any case you can also run a cronjob on each slave as suggested above.

The thing about all systems like this is you have to remember to shut
them down when you actually want the service down for service etc.

We run Nagios and cacti so I would like to develop check scripts for
these services. I am going to get  SVN repo together if anyone is
interested in contributing let me know.


Re: Ideal number of mappers and reducers; any physical limits?

2008-10-29 Thread Edward J. Yoon
> I doubt that it is stored as an explicit matrix.  Each page would probably
> have a big table (or file) entry and would have a list of links including
> link text.

Oh.. Probably, and some random walk on the link graph.

On Wed, Oct 29, 2008 at 2:12 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> On Tue, Oct 28, 2008 at 5:15 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote:
>
>> ...
>> In single machine, as far as we
>> know graph can be stored to linked list or matrix.
>>
>
> Since the matrix is normally very sparse for large graphs, these two
> approaches are pretty similar.
>
>
>> ... So, I guess google's web graph will be stored as a matrix in a
>> bigTable.
>>
>
> I doubt that it is stored as an explicit matrix.  Each page would probably
> have a big table (or file) entry and would have a list of links including
> link text.
>
>
> Have you seen my 2D block algorithm post?? --
>> http://blog.udanax.org/2008/10/parallel-matrix-multiply-on-hadoop.html
>>
>
> I have now.  Block decomposition for multiplies almost always applies only
> to dense matrix operations.  For most sparse matrix representations
> extracting a block is only efficient if it is full width or height.  For
> very sparse matrix operations, the savings due to reuse of intermediate
> results are completely dominated by the I/O cost so block decompositions are
> much less helpful.
>
> In many cases, it isn't even very helpful to send around entire rows and
> sending individual elements is about as efficient.
>
> FYI, Hama (http://incubator.apache.org/hama/) will be handled graph
>> algorithms since it is a related with adjacency matrix and topological
>> algebra. And I think 2000 node hadoop/hbase cluster is big enough if a
>> sequential/random read/write speed will be improved 800%. :-)
>>
>
> I think that a 5 node cluster is big enough without any improvement in
> read/write speed.
>
> Of course, it depends on the size of the problem.  I was only working with a
> matrix with a few tens of billions of non-zero values.
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: Datanode not detecting full disk

2008-10-29 Thread Jeff Hammerbacher
Hey Stefan,

It's always fun when seemingly trivial problems turn out to be
nontrivial. As for the solution: if I recall correctly (someone from
Facebook please hop in here), we just jacked up
dfs.datanode.du.reserved to a sizable amount, like 2 GB or something.

Regards,
Jeff

On Tue, Oct 28, 2008 at 11:31 PM, Stefan Will <[EMAIL PROTECTED]> wrote:
> Hi Jeff,
>
> Yeah, it looks like I'm running into the issues described in the bug. I'm
> running 0.18.1 on CentOS 5 by the way. Measuring available disk space
> appears to be harder than I thought ... and here I was under the impression
> the percentage in df was a pretty clear indicator of how full the disk is
> ;-)
>
> How did you guys solve/work around this ?
>
> -- Stefan
>
>
>> From: Jeff Hammerbacher <[EMAIL PROTECTED]>
>> Reply-To: 
>> Date: Mon, 27 Oct 2008 12:40:08 -0700
>> To: 
>> Subject: Re: Datanode not detecting full disk
>>
>> Hey Stefan,
>>
>> We used to have trouble with this issue at Facebook. What version are
>> you running? You might get more information on this ticket:
>> https://issues.apache.org/jira/browse/HADOOP-2991.
>>
>> Regards,
>> Jeff
>>
>> On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote:
>>> Each of my datanodes has  a system and a data partition, with dfs.data.dir
>>> pointed to the data partition. The data partition just filled up to 100% on
>>> all of my nodes (as evident via df), but the NameNode web ui still shows
>>> them only 88-94% full (interestingly, the numbers differ even though the
>>> machines are configured identically). I thought the datanodes used df to
>>> determine free space ? How is the storage utilization determined ?
>>>
>>> -- Stefan
>>>
>
>
>


"Could not obtain block" error

2008-10-29 Thread murali krishna
Hi,
When I try to read one of the file from dfs, I get the following error in an 
infinite loop (using 0.15.3)

“08/10/28 23:43:15 INFO fs.DFSClient: Could not obtain block 
blk_5994030096182059653 from any node:  java.io.IOException: No live nodes 
contain current block”

Fsck showed that the file is HEALTHY but under replicated (1 instead of 
configured 2). I checked the datanode log where the only replica exists for 
that block and I can see repeated errors while serving that bock.
2008-10-22 23:55:39,378 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_59940300961820596
53 to 68.142.212.228:50010 got java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at org.apache.hadoop.dfs.DataNode$BlockSender.(DataNode.java:1061)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1446)
at java.lang.Thread.run(Thread.java:619)

Any idea what is going on and how can I fix this ?

Thanks,
Murali