Re: Socket closed Exception

2009-04-01 Thread lohit

Thanks Koji, Raghu.
This seemed to solve our problem, havent seen this happen in the past 2 days. 
What is the typical value of ipc.client.idlethreshold on big clusters.
Does default value of 4000 suffice?

Lohit



- Original Message 
From: Koji Noguchi 
To: core-user@hadoop.apache.org
Sent: Monday, March 30, 2009 9:30:04 AM
Subject: RE: Socket closed Exception

Lohit, 

You're right. We saw " java.net.SocketTimeoutException: timed out
waiting for rpc response" and not Socket closed exception.

If you're getting "closed exception", then I don't remember seeing that
problem on our clusters.

Our users often report "Socket closed exception" as a problem, but in
most cases those failures are due to jobs failing with completely
different reasons and race condition between 1) JobTracker removing
directory/killing tasks and 2) tasks failing with closed exception
before they get killed.

Koji



-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Monday, March 30, 2009 8:51 AM
To: core-user@hadoop.apache.org
Subject: Re: Socket closed Exception


Thanks Koji. 
If I look at the code, NameNode (RPC Server) seems to tear down idle
connections. Did you see 'Socket closed' exception instead of 'timed out
waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not
timeout, but get back socket closed exception when they do RPC for
create/open/getFileInfo.

I will give this a try. Thanks again,
Lohit



- Original Message 
From: Koji Noguchi 
To: core-user@hadoop.apache.org
Sent: Sunday, March 29, 2009 11:44:29 PM
Subject: RE: Socket closed Exception

Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  ipc.client.idlethreshold
  4000
  Defines the threshold number of connections after which
   connections will be inspected for idleness.
  

When inspecting for idleness, namenode uses

  ipc.client.maxidletime
  12
  Defines the maximum idle time for a connected client 
   after which it may be disconnected.
  

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit



Re: Socket closed Exception

2009-03-30 Thread lohit

Thanks Raghu, is the log level at DEBUG? I do not see any socket close 
exception at NameNode at WARN/INFO level.
Lohit



- Original Message 
From: Raghu Angadi 
To: core-user@hadoop.apache.org
Sent: Monday, March 30, 2009 12:08:19 PM
Subject: Re: Socket closed Exception


If it is NameNode, then there is probably a log about closing the socket around 
that time.

Raghu.

lohit wrote:
> Recently we are seeing lot of Socket closed exception in our cluster. Many 
> task's open/create/getFileInfo calls get back 'SocketException' with message 
> 'Socket closed'. We seem to see many tasks fail with same error around same 
> time. There are no warning or info messages in NameNode /TaskTracker/Task 
> logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due 
> heavy load or during conention of resource of anykind?
> 
> Thanks,
> Lohit
> 


Re: Socket closed Exception

2009-03-30 Thread Raghu Angadi


If it is NameNode, then there is probably a log about closing the socket 
around that time.


Raghu.

lohit wrote:

Recently we are seeing lot of Socket closed exception in our cluster. Many 
task's open/create/getFileInfo calls get back 'SocketException' with message 
'Socket closed'. We seem to see many tasks fail with same error around same 
time. There are no warning or info messages in NameNode /TaskTracker/Task logs. 
(This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy 
load or during conention of resource of anykind?

Thanks,
Lohit





RE: Socket closed Exception

2009-03-30 Thread Koji Noguchi
Lohit, 

You're right. We saw " java.net.SocketTimeoutException: timed out
waiting for rpc response" and not Socket closed exception.

If you're getting "closed exception", then I don't remember seeing that
problem on our clusters.

Our users often report "Socket closed exception" as a problem, but in
most cases those failures are due to jobs failing with completely
different reasons and race condition between 1) JobTracker removing
directory/killing tasks and 2) tasks failing with closed exception
before they get killed.

Koji



-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Monday, March 30, 2009 8:51 AM
To: core-user@hadoop.apache.org
Subject: Re: Socket closed Exception


Thanks Koji. 
If I look at the code, NameNode (RPC Server) seems to tear down idle
connections. Did you see 'Socket closed' exception instead of 'timed out
waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not
timeout, but get back socket closed exception when they do RPC for
create/open/getFileInfo.

I will give this a try. Thanks again,
Lohit



- Original Message 
From: Koji Noguchi 
To: core-user@hadoop.apache.org
Sent: Sunday, March 29, 2009 11:44:29 PM
Subject: RE: Socket closed Exception

Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  ipc.client.idlethreshold
  4000
  Defines the threshold number of connections after which
   connections will be inspected for idleness.
  

When inspecting for idleness, namenode uses

  ipc.client.maxidletime
  12
  Defines the maximum idle time for a connected client 
   after which it may be disconnected.
  

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit


Re: Socket closed Exception

2009-03-30 Thread lohit

Thanks Koji. 
If I look at the code, NameNode (RPC Server) seems to tear down idle 
connections. Did you see 'Socket closed' exception instead of 'timed out 
waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not timeout, but 
get back socket closed exception when they do RPC for create/open/getFileInfo.

I will give this a try. Thanks again,
Lohit



- Original Message 
From: Koji Noguchi 
To: core-user@hadoop.apache.org
Sent: Sunday, March 29, 2009 11:44:29 PM
Subject: RE: Socket closed Exception

Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  ipc.client.idlethreshold
  4000
  Defines the threshold number of connections after which
   connections will be inspected for idleness.
  

When inspecting for idleness, namenode uses

  ipc.client.maxidletime
  12
  Defines the maximum idle time for a connected client 
   after which it may be disconnected.
  

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit


RE: Socket closed Exception

2009-03-29 Thread Koji Noguchi
Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  ipc.client.idlethreshold
  4000
  Defines the threshold number of connections after which
   connections will be inspected for idleness.
  

When inspecting for idleness, namenode uses

  ipc.client.maxidletime
  12
  Defines the maximum idle time for a connected client 
   after which it may be disconnected.
  

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit