Re: Socket closed Exception
Thanks Koji, Raghu. This seemed to solve our problem, havent seen this happen in the past 2 days. What is the typical value of ipc.client.idlethreshold on big clusters. Does default value of 4000 suffice? Lohit - Original Message From: Koji Noguchi To: core-user@hadoop.apache.org Sent: Monday, March 30, 2009 9:30:04 AM Subject: RE: Socket closed Exception Lohit, You're right. We saw " java.net.SocketTimeoutException: timed out waiting for rpc response" and not Socket closed exception. If you're getting "closed exception", then I don't remember seeing that problem on our clusters. Our users often report "Socket closed exception" as a problem, but in most cases those failures are due to jobs failing with completely different reasons and race condition between 1) JobTracker removing directory/killing tasks and 2) tasks failing with closed exception before they get killed. Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Monday, March 30, 2009 8:51 AM To: core-user@hadoop.apache.org Subject: Re: Socket closed Exception Thanks Koji. If I look at the code, NameNode (RPC Server) seems to tear down idle connections. Did you see 'Socket closed' exception instead of 'timed out waiting for socket'? We seem to hit the 'Socket closed' exception where client do not timeout, but get back socket closed exception when they do RPC for create/open/getFileInfo. I will give this a try. Thanks again, Lohit - Original Message From: Koji Noguchi To: core-user@hadoop.apache.org Sent: Sunday, March 29, 2009 11:44:29 PM Subject: RE: Socket closed Exception Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. ipc.client.idlethreshold 4000 Defines the threshold number of connections after which connections will be inspected for idleness. When inspecting for idleness, namenode uses ipc.client.maxidletime 12 Defines the maximum idle time for a connected client after which it may be disconnected. As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Sunday, March 29, 2009 11:56 AM To: core-user@hadoop.apache.org Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
Re: Socket closed Exception
Thanks Raghu, is the log level at DEBUG? I do not see any socket close exception at NameNode at WARN/INFO level. Lohit - Original Message From: Raghu Angadi To: core-user@hadoop.apache.org Sent: Monday, March 30, 2009 12:08:19 PM Subject: Re: Socket closed Exception If it is NameNode, then there is probably a log about closing the socket around that time. Raghu. lohit wrote: > Recently we are seeing lot of Socket closed exception in our cluster. Many > task's open/create/getFileInfo calls get back 'SocketException' with message > 'Socket closed'. We seem to see many tasks fail with same error around same > time. There are no warning or info messages in NameNode /TaskTracker/Task > logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due > heavy load or during conention of resource of anykind? > > Thanks, > Lohit >
Re: Socket closed Exception
If it is NameNode, then there is probably a log about closing the socket around that time. Raghu. lohit wrote: Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
RE: Socket closed Exception
Lohit, You're right. We saw " java.net.SocketTimeoutException: timed out waiting for rpc response" and not Socket closed exception. If you're getting "closed exception", then I don't remember seeing that problem on our clusters. Our users often report "Socket closed exception" as a problem, but in most cases those failures are due to jobs failing with completely different reasons and race condition between 1) JobTracker removing directory/killing tasks and 2) tasks failing with closed exception before they get killed. Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Monday, March 30, 2009 8:51 AM To: core-user@hadoop.apache.org Subject: Re: Socket closed Exception Thanks Koji. If I look at the code, NameNode (RPC Server) seems to tear down idle connections. Did you see 'Socket closed' exception instead of 'timed out waiting for socket'? We seem to hit the 'Socket closed' exception where client do not timeout, but get back socket closed exception when they do RPC for create/open/getFileInfo. I will give this a try. Thanks again, Lohit - Original Message From: Koji Noguchi To: core-user@hadoop.apache.org Sent: Sunday, March 29, 2009 11:44:29 PM Subject: RE: Socket closed Exception Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. ipc.client.idlethreshold 4000 Defines the threshold number of connections after which connections will be inspected for idleness. When inspecting for idleness, namenode uses ipc.client.maxidletime 12 Defines the maximum idle time for a connected client after which it may be disconnected. As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Sunday, March 29, 2009 11:56 AM To: core-user@hadoop.apache.org Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
Re: Socket closed Exception
Thanks Koji. If I look at the code, NameNode (RPC Server) seems to tear down idle connections. Did you see 'Socket closed' exception instead of 'timed out waiting for socket'? We seem to hit the 'Socket closed' exception where client do not timeout, but get back socket closed exception when they do RPC for create/open/getFileInfo. I will give this a try. Thanks again, Lohit - Original Message From: Koji Noguchi To: core-user@hadoop.apache.org Sent: Sunday, March 29, 2009 11:44:29 PM Subject: RE: Socket closed Exception Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. ipc.client.idlethreshold 4000 Defines the threshold number of connections after which connections will be inspected for idleness. When inspecting for idleness, namenode uses ipc.client.maxidletime 12 Defines the maximum idle time for a connected client after which it may be disconnected. As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Sunday, March 29, 2009 11:56 AM To: core-user@hadoop.apache.org Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
RE: Socket closed Exception
Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. ipc.client.idlethreshold 4000 Defines the threshold number of connections after which connections will be inspected for idleness. When inspecting for idleness, namenode uses ipc.client.maxidletime 12 Defines the maximum idle time for a connected client after which it may be disconnected. As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Sunday, March 29, 2009 11:56 AM To: core-user@hadoop.apache.org Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit