[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825139#comment-17825139
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

haiyang1987 commented on PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#issuecomment-1987535597

   > > Hi @Neilxzn Any progress here? Thanks.
   > > this PR is still necessary, there are some similar problems in our 
environment~
   > 
   > @haiyang1987 Our online environment (70 PB EC Data cluster, spark + hive 
olap) has already applied this patch. So far, everything is running normally.
   
   Noted, thanks for you work ~




> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
> reading, 

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825137#comment-17825137
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

Neilxzn commented on PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#issuecomment-1987533616

   Rebase it to branch trunk. And The new test 
`org.apache.hadoop.hdfs.TestDFSStripedInputStreamWithTimeout`  passed on my 
lolcal env.
   https://github.com/apache/hadoop/assets/10757009/6bedc732-cbe7-403b-8c6a-b5e78a33527f;>
   




> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825135#comment-17825135
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

Neilxzn commented on PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#issuecomment-1987532047

   > Hi @Neilxzn Any progress here? Thanks.
   > 
   > this PR is still necessary, there are some similar problems in our 
environment~
   
   @haiyang1987  Our online environment (70 PB EC Data cluster, spark + hive 
olap) has already applied this patch. So far, everything is running normally.




> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it 

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825133#comment-17825133
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

Neilxzn opened a new pull request, #5829:
URL: https://github.com/apache/hadoop/pull/5829

   
   
   ### Description of PR
   https://issues.apache.org/jira/browse/HDFS-15413
   
   Offer a available patch to fix HDFS-15413.  This patch add 
dfs.client.read.striped.datanode.max.attempts config to allow users to adjust 
the number of dn retries to solve the problem of Datanode timeout when reading 
EC files. 
   
   ### How was this patch tested?
   no add test.  just test in our cluster
   
   ### For code changes:
   add dfs.client.read.striped.datanode.max.attempts config
   
   




> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> 

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825126#comment-17825126
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

hadoop-yetus commented on PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#issuecomment-1987499017

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m  0s |  |  Docker mode activated.  |
   | -1 :x: |  patch  |   0m 19s |  |  
https://github.com/apache/hadoop/pull/5829 does not apply to trunk. Rebase 
required? Wrong Branch? See 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.  
|
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | GITHUB PR | https://github.com/apache/hadoop/pull/5829 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/5/console |
   | versions | git=2.34.1 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825125#comment-17825125
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

Neilxzn closed pull request #5829:  HDFS-15413. add 
dfs.client.read.striped.datanode.max.attempts to fix read ecfile timeout
URL: https://github.com/apache/hadoop/pull/5829




> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:48772 dst: 
> /10.128.9.42:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 

[jira] [Commented] (HDFS-17401) EC: Excess internal block may not be able to be deleted correctly when it's stored in fallback storage

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825071#comment-17825071
 ] 

ASF GitHub Bot commented on HDFS-17401:
---

hadoop-yetus commented on PR #6597:
URL: https://github.com/apache/hadoop/pull/6597#issuecomment-1987280103

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 56s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 16s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  8s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 22s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  8s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 17s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 53s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  6s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 23s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 274m 37s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6597/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 47s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 413m 26s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSAdmin |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6597/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6597 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 7644248f5fef 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a5e2fe4348b687735a40ebe3a2b5327fab773bc0 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6597/4/testReport/ |
   | Max. process+thread count | 3144 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 

[jira] [Commented] (HDFS-17397) Choose another DN as soon as possible, when encountering network issues

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825037#comment-17825037
 ] 

ASF GitHub Bot commented on HDFS-17397:
---

xleoken commented on PR #6591:
URL: https://github.com/apache/hadoop/pull/6591#issuecomment-1987200534

   cc @Hexiaoqiao please take a review, thanks




> Choose another DN as soon as possible, when encountering network issues
> ---
>
> Key: HDFS-17397
> URL: https://issues.apache.org/jira/browse/HDFS-17397
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xleoken
>Priority: Minor
>  Labels: pull-request-available
> Attachments: hadoop.png
>
>
> Choose another DN as soon as possible, when encountering network issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17391) Adjust the checkpoint io buffer size to the chunk size

2024-03-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825036#comment-17825036
 ] 

ASF GitHub Bot commented on HDFS-17391:
---

hadoop-yetus commented on PR #6594:
URL: https://github.com/apache/hadoop/pull/6594#issuecomment-1987196694

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  31m 40s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 40s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 36s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  5s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 48s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 4 unchanged - 4 
fixed = 4 total (was 8)  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 45s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 21s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 195m 32s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6594/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 29s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 281m 36s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSAdmin |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6594/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6594 |
   | JIRA Issue | HDFS-17391 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 27eb09e86f1e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 38cef3e92d24e95edc9ae79b8b9e4c991381b724 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6594/7/testReport/