[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561788=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561788 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 06/Mar/21 13:26 Start Date: 06/Mar/21 13:26 Worklog Time Spent: 10m Work Description: iwasakims merged pull request #2727: URL: https://github.com/apache/hadoop/pull/2727 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 561788) Time Spent: 9h 10m (was: 9h) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 9h 10m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); > continue; > } > key.attach(c); // so closeCurrentConnection can get the object > reader.addConnection(c); > } > } > {code} > When a SocketException occurs in line 1402 (or 1403 or 1404), the > server.accept() in line 1400 has finished, so we expect the following > behavior: > # The server (NameNode) accepts this connection but it will
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561713=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561713 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 06/Mar/21 04:48 Start Date: 06/Mar/21 04:48 Worklog Time Spent: 10m Work Description: iwasakims commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-791872984 @functioner You should address the checkstyle warning. I think we don't need the comment. ./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java:61: public static final int IPC_CLIENT_RPC_TIMEOUT_DEFAULT = 12; // 120 seconds: Line is longer than 80 characters (found 81). [LineLength] This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 561713) Time Spent: 9h (was: 8h 50m) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 9h > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); >
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561507=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561507 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 05/Mar/21 18:10 Start Date: 05/Mar/21 18:10 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-791591405 Are we ready to merge? @ferhui @iwasakims This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 561507) Time Spent: 8h 50m (was: 8h 40m) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 8h 50m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); > continue; > } > key.attach(c); // so closeCurrentConnection can get the object > reader.addConnection(c); > } > } > {code} > When a SocketException occurs in line 1402 (or 1403 or 1404), the > server.accept() in line 1400 has finished, so we expect the following
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561300=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561300 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 05/Mar/21 06:09 Start Date: 05/Mar/21 06:09 Worklog Time Spent: 10m Work Description: ferhui commented on a change in pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#discussion_r588054312 ## File path: hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java ## @@ -1456,6 +1456,7 @@ public void run() { @Test public void testClientGetTimeout() throws IOException { Configuration config = new Configuration(); +conf.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0); Review comment: config.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0); This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 561300) Time Spent: 8h 40m (was: 8.5h) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 8h 40m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null,
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561265 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 05/Mar/21 02:51 Start Date: 05/Mar/21 02:51 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-79227 > @functioner As @iwasakims said, you can add > `conf.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0);` > before > ` assertEquals(Client.getTimeout(config), -1);` It seems it doesn't work. The obtained timeout is still 12. Any idea? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 561265) Time Spent: 8.5h (was: 8h 20m) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 8.5h > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); > continue; > } > key.attach(c); // so closeCurrentConnection can
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560764=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560764 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 04/Mar/21 01:08 Start Date: 04/Mar/21 01:08 Worklog Time Spent: 10m Work Description: ferhui commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-790202955 @functioner As @iwasakims said, you can add `conf.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0);` before `assertEquals(Client.getTimeout(config), -1);` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 560764) Time Spent: 8h 20m (was: 8h 10m) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 8h 20m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); > continue; > } > key.attach(c); // so closeCurrentConnection can get the object > reader.addConnection(c); > } > } > {code} > When
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560418=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560418 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 03/Mar/21 11:50 Start Date: 03/Mar/21 11:50 Worklog Time Spent: 10m Work Description: iwasakims commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789658781 @functioner The `TestIPC#testClientGetTimeout` tests deprecated `Client#getTimeout` which was used before `ipc.client.rpc-timeout.ms` and `Client#getRpcTimeout` was introduced. Based on the context, the testClientGetTimeout should check the value of `Client#getTimeout` when `ipc.client.rpc-timeout.ms` is set to 0. (-1 is expected if ipc.client.ping is true (default)). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 560418) Time Spent: 8h 10m (was: 8h) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 8h 10m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } >
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560396=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560396 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 03/Mar/21 11:24 Start Date: 03/Mar/21 11:24 Worklog Time Spent: 10m Work Description: ferhui commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789644619 @functioner That's OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 560396) Time Spent: 8h (was: 7h 50m) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 8h > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); > continue; > } > key.attach(c); // so closeCurrentConnection can get the object > reader.addConnection(c); > } > } > {code} > When a SocketException occurs in line 1402 (or 1403 or 1404), the > server.accept() in line 1400 has finished, so we expect the following > behavior: > # The server
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560360=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560360 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 03/Mar/21 09:53 Start Date: 03/Mar/21 09:53 Worklog Time Spent: 10m Work Description: functioner commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789587539 > @functioner According to CI results, TestIPC#testClientGetTimeout fails. It is related, please check. It fails at line 1459: https://github.com/apache/hadoop/blob/b4985c1ef277bcf51eec981385c56218ac41f09e/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java#L1456-L1460 `Client.getTimeout` is: https://github.com/apache/hadoop/blob/b4985c1ef277bcf51eec981385c56218ac41f09e/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L237-L258 Before we change the default rpcTimeout: rpcTimeout is 0, so it won't return at line 251. `CommonConfigurationKeys.IPC_CLIENT_PING_DEFAULT` is true, so it won't return at line 255 either. Finally, it returns -1 at line 257, and passes the test case. After we change the default rpcTimeout=12: It returns at line 251, it fails because 12 is not -1. Conclusion: This test is essentially checking the default value of rpcTimeout. Since we modified this value, we should also modify this test as `assertThat(Client.getTimeout(config)).isEqualTo(12)`. What do you think? @ferhui @iwasakims This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 560360) Time Spent: 7h 50m (was: 7h 40m) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 7h 50m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server =
[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang
[ https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560359=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560359 ] ASF GitHub Bot logged work on HADOOP-17552: --- Author: ASF GitHub Bot Created on: 03/Mar/21 09:25 Start Date: 03/Mar/21 09:25 Worklog Time Spent: 10m Work Description: ferhui commented on pull request #2727: URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789570037 @functioner According to CI results, TestIPC#testClientGetTimeout fails. It is related, please check. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 560359) Time Spent: 7h 40m (was: 7.5h) > Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid > potential hang > > > Key: HADOOP-17552 > URL: https://issues.apache.org/jira/browse/HADOOP-17552 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 3.2.2 >Reporter: Haoze Wu >Priority: Major > Labels: pull-request-available > Time Spent: 7h 40m > Remaining Estimate: 0h > > We are doing some systematic fault injection testing in Hadoop-3.2.2 and > when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster > (1 NameNode, 2 DataNodes), the client gets stuck forever. After some > investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the > read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps > swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the > `rpcTimeout` configuration in the `handleTimeout` method. > *Reproduction* > Start HDFS with the default configuration. Then execute a client (we used > the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to > accept the client’s socket, inject a socket error (java.net.SocketException > or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will > also work). > We prepare the scripts for reproduction in a gist > ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]). > *Diagnosis* > When the NameNode tries to accept a client’s socket, basically there are > 4 steps: > # accept the socket (line 1400) > # configure the socket (line 1402-1404) > # make the socket a Reader (after line 1404) > # swallow the possible IOException in line 1350 > {code:java} > //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java > public void run() { > while (running) { > SelectionKey key = null; > try { > getSelector().select(); > Iterator iter = > getSelector().selectedKeys().iterator(); > while (iter.hasNext()) { > key = iter.next(); > iter.remove(); > try { > if (key.isValid()) { > if (key.isAcceptable()) > doAccept(key); > } > } catch (IOException e) { // line 1350 > } > key = null; > } > } catch (OutOfMemoryError e) { > // ... > } catch (Exception e) { > // ... > } > } > } > void doAccept(SelectionKey key) throws InterruptedException, IOException, > OutOfMemoryError { > ServerSocketChannel server = (ServerSocketChannel) key.channel(); > SocketChannel channel; > while ((channel = server.accept()) != null) { // line 1400 > channel.configureBlocking(false); // line 1402 > channel.socket().setTcpNoDelay(tcpNoDelay); // line 1403 > channel.socket().setKeepAlive(true); // line 1404 > > Reader reader = getReader(); > Connection c = connectionManager.register(channel, > this.listenPort, this.isOnAuxiliaryPort); > // If the connectionManager can't take it, close the connection. > if (c == null) { > if (channel.isOpen()) { > IOUtils.cleanup(null, channel); > } > connectionManager.droppedConnections.getAndIncrement(); > continue; > } > key.attach(c); // so closeCurrentConnection can get the object > reader.addConnection(c); > } > } > {code} > When a SocketException occurs in line 1402 (or 1403 or 1404), the > server.accept()