[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561788=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561788
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 06/Mar/21 13:26
Start Date: 06/Mar/21 13:26
Worklog Time Spent: 10m 
  Work Description: iwasakims merged pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 561788)
Time Spent: 9h 10m  (was: 9h)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   continue;
> }
> key.attach(c);  // so closeCurrentConnection can get the object
> reader.addConnection(c);
>   }
> }
> {code}
>     When a SocketException occurs in line 1402 (or 1403 or 1404), the 
> server.accept() in line 1400 has finished, so we expect the following 
> behavior:
>  # The server (NameNode) accepts this connection but it will 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561713=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561713
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 06/Mar/21 04:48
Start Date: 06/Mar/21 04:48
Worklog Time Spent: 10m 
  Work Description: iwasakims commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-791872984


   @functioner You should address the checkstyle warning. I think we don't need 
the  comment.
   
   
./hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java:61:
  public static final int IPC_CLIENT_RPC_TIMEOUT_DEFAULT = 12; // 120 
seconds: Line is longer than 80 characters (found 81). [LineLength]



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 561713)
Time Spent: 9h  (was: 8h 50m)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561507=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561507
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 05/Mar/21 18:10
Start Date: 05/Mar/21 18:10
Worklog Time Spent: 10m 
  Work Description: functioner commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-791591405


   Are we ready to merge? @ferhui @iwasakims 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 561507)
Time Spent: 8h 50m  (was: 8h 40m)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   continue;
> }
> key.attach(c);  // so closeCurrentConnection can get the object
> reader.addConnection(c);
>   }
> }
> {code}
>     When a SocketException occurs in line 1402 (or 1403 or 1404), the 
> server.accept() in line 1400 has finished, so we expect the following 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561300=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561300
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 05/Mar/21 06:09
Start Date: 05/Mar/21 06:09
Worklog Time Spent: 10m 
  Work Description: ferhui commented on a change in pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#discussion_r588054312



##
File path: 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java
##
@@ -1456,6 +1456,7 @@ public void run() {
   @Test
   public void testClientGetTimeout() throws IOException {
 Configuration config = new Configuration();
+conf.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0);

Review comment:
   config.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0);





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 561300)
Time Spent: 8h 40m  (was: 8.5h)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=561265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-561265
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 05/Mar/21 02:51
Start Date: 05/Mar/21 02:51
Worklog Time Spent: 10m 
  Work Description: functioner commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-79227


   > @functioner As @iwasakims said, you can add
   > `conf.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0);`
   > before
   > ` assertEquals(Client.getTimeout(config), -1);`
   
   It seems it doesn't work. The obtained timeout is still 12. Any idea?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 561265)
Time Spent: 8.5h  (was: 8h 20m)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   continue;
> }
> key.attach(c);  // so closeCurrentConnection can 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560764=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560764
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 04/Mar/21 01:08
Start Date: 04/Mar/21 01:08
Worklog Time Spent: 10m 
  Work Description: ferhui commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-790202955


   @functioner As @iwasakims said, you can add 
   `conf.setInt(CommonConfigurationKeys.IPC_CLIENT_RPC_TIMEOUT_KEY, 0);`
   before
   `assertEquals(Client.getTimeout(config), -1);`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 560764)
Time Spent: 8h 20m  (was: 8h 10m)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   continue;
> }
> key.attach(c);  // so closeCurrentConnection can get the object
> reader.addConnection(c);
>   }
> }
> {code}
>     When 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560418=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560418
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 03/Mar/21 11:50
Start Date: 03/Mar/21 11:50
Worklog Time Spent: 10m 
  Work Description: iwasakims commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789658781


   @functioner The `TestIPC#testClientGetTimeout` tests deprecated 
`Client#getTimeout` which was used before `ipc.client.rpc-timeout.ms` and 
`Client#getRpcTimeout` was introduced. Based on the context, the 
testClientGetTimeout should check the value of `Client#getTimeout` when 
`ipc.client.rpc-timeout.ms` is set to 0. (-1 is expected if ipc.client.ping is 
true (default)).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 560418)
Time Spent: 8h 10m  (was: 8h)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560396=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560396
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 03/Mar/21 11:24
Start Date: 03/Mar/21 11:24
Worklog Time Spent: 10m 
  Work Description: ferhui commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789644619


   @functioner That's OK



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 560396)
Time Spent: 8h  (was: 7h 50m)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   continue;
> }
> key.attach(c);  // so closeCurrentConnection can get the object
> reader.addConnection(c);
>   }
> }
> {code}
>     When a SocketException occurs in line 1402 (or 1403 or 1404), the 
> server.accept() in line 1400 has finished, so we expect the following 
> behavior:
>  # The server 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560360=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560360
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 03/Mar/21 09:53
Start Date: 03/Mar/21 09:53
Worklog Time Spent: 10m 
  Work Description: functioner commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789587539


   > @functioner According to CI results, TestIPC#testClientGetTimeout fails. 
It is related, please check.
   
   It fails at line 1459:
   
https://github.com/apache/hadoop/blob/b4985c1ef277bcf51eec981385c56218ac41f09e/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java#L1456-L1460
   
   `Client.getTimeout` is:
   
https://github.com/apache/hadoop/blob/b4985c1ef277bcf51eec981385c56218ac41f09e/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L237-L258
   
   Before we change the default rpcTimeout:
   rpcTimeout is 0, so it won't return at line 251.
   `CommonConfigurationKeys.IPC_CLIENT_PING_DEFAULT` is true, so it won't 
return at line 255 either.
   Finally, it returns -1 at line 257, and passes the test case.
   
   After we change the default rpcTimeout=12:
   It returns at line 251, it fails because 12 is not -1.
   
   Conclusion:
   This test is essentially checking the default value of rpcTimeout.
   Since we modified this value, we should also modify this test as 
`assertThat(Client.getTimeout(config)).isEqualTo(12)`.
   What do you think? @ferhui @iwasakims 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 560360)
Time Spent: 7h 50m  (was: 7h 40m)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = 

[jira] [Work logged] (HADOOP-17552) Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang

2021-03-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17552?focusedWorklogId=560359=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-560359
 ]

ASF GitHub Bot logged work on HADOOP-17552:
---

Author: ASF GitHub Bot
Created on: 03/Mar/21 09:25
Start Date: 03/Mar/21 09:25
Worklog Time Spent: 10m 
  Work Description: ferhui commented on pull request #2727:
URL: https://github.com/apache/hadoop/pull/2727#issuecomment-789570037


   @functioner According to CI results, TestIPC#testClientGetTimeout fails. It 
is related, please check.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 560359)
Time Spent: 7h 40m  (was: 7.5h)

> Change ipc.client.rpc-timeout.ms from 0 to 12 by default to avoid 
> potential hang
> 
>
> Key: HADOOP-17552
> URL: https://issues.apache.org/jira/browse/HADOOP-17552
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 3.2.2
>Reporter: Haoze Wu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
>     We are doing some systematic fault injection testing in Hadoop-3.2.2 and 
> when we try to run a client (e.g., `bin/hdfs dfs -ls /`) to our HDFS cluster 
> (1 NameNode, 2 DataNodes), the client gets stuck forever. After some 
> investigation, we believe that it’s a bug in `hadoop.ipc.Client` because the 
> read method of `hadoop.ipc.Client$Connection$PingInputStream` keeps 
> swallowing `java.net.SocketTimeoutException` due to the mistaken usage of the 
> `rpcTimeout` configuration in the `handleTimeout` method.
> *Reproduction*
>     Start HDFS with the default configuration. Then execute a client (we used 
> the command `bin/hdfs dfs -ls /` in the terminal). While HDFS is trying to 
> accept the client’s socket, inject a socket error (java.net.SocketException 
> or java.io.IOException), specifically at line 1402 (line 1403 or 1404 will 
> also work).
>     We prepare the scripts for reproduction in a gist 
> ([https://gist.github.com/functioner/08bcd86491b8ff32860eafda8c140e24]).
> *Diagnosis*
>     When the NameNode tries to accept a client’s socket, basically there are 
> 4 steps:
>  # accept the socket (line 1400)
>  # configure the socket (line 1402-1404)
>  # make the socket a Reader (after line 1404)
>  # swallow the possible IOException in line 1350
> {code:java}
> //hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
> public void run() {
>   while (running) {
> SelectionKey key = null;
> try {
>   getSelector().select();
>   Iterator iter = 
> getSelector().selectedKeys().iterator();
>   while (iter.hasNext()) {
> key = iter.next();
> iter.remove();
> try {
>   if (key.isValid()) {
> if (key.isAcceptable())
>   doAccept(key);
>   }
> } catch (IOException e) { // line 1350
> }
> key = null;
>   }
> } catch (OutOfMemoryError e) {
>   // ...
> } catch (Exception e) {
>   // ...
> }
>   }
> }
> void doAccept(SelectionKey key) throws InterruptedException, IOException, 
> OutOfMemoryError {
>   ServerSocketChannel server = (ServerSocketChannel) key.channel();
>   SocketChannel channel;
>   while ((channel = server.accept()) != null) {   // line 1400
> channel.configureBlocking(false); // line 1402
> channel.socket().setTcpNoDelay(tcpNoDelay);   // line 1403
> channel.socket().setKeepAlive(true);  // line 1404
> 
> Reader reader = getReader();
> Connection c = connectionManager.register(channel,
> this.listenPort, this.isOnAuxiliaryPort);
> // If the connectionManager can't take it, close the connection.
> if (c == null) {
>   if (channel.isOpen()) {
> IOUtils.cleanup(null, channel);
>   }
>   connectionManager.droppedConnections.getAndIncrement();
>   continue;
> }
> key.attach(c);  // so closeCurrentConnection can get the object
> reader.addConnection(c);
>   }
> }
> {code}
>     When a SocketException occurs in line 1402 (or 1403 or 1404), the 
> server.accept()