[jira] [Commented] (HDFS-16262) Async refresh of cached locations in DFSInputStream

2021-12-09 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456727#comment-17456727
 ] 

Bryan Beaudreault commented on HDFS-16262:
--

I would love to get a review here if someone has time. This PR has been 
deployed to more than 8000 servers across 45 clusters in our environment, has 
been working great for us and I'd like to close the loop if possible.

> Async refresh of cached locations in DFSInputStream
> ---
>
> Key: HDFS-16262
> URL: https://issues.apache.org/jira/browse/HDFS-16262
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> HDFS-15119 added the ability to invalidate cached block locations in 
> DFSInputStream. As written, the feature will affect all DFSInputStreams 
> regardless of whether they need it or not. The invalidation also only applies 
> on the next request, so the next request will pay the cost of calling 
> openInfo before reading the data.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). I'd like 
> to utilize the idea started in HDFS-15119 in order to update DFSInputStreams 
> after blocks have been moved to local hosts.
> I was considering using the feature as is, but some of our clusters are quite 
> large and I'm concerned about the impact on the namenode:
>  * We have some clusters with over 350k StoreFiles, so that'd be 350k 
> DFSInputStreams. With such a large number and very active usage, having the 
> refresh be in-line makes it too hard to ensure we don't DDOS the NameNode.
>  * Currently we need to pay the price of openInfo the next time a 
> DFSInputStream is invoked. Moving that async would minimize the latency hit. 
> Also, some StoreFiles might be far less frequently accessed, so they may live 
> on for a long time before ever refreshing. We'd like to be able to know that 
> all DFSInputStreams are refreshed by a given time.
>  * We may have 350k files, but only a small percentage of them are ever 
> non-local at a given time. Refreshing only if necessary will save a lot of 
> work.
> In order to make this as painless to end users as possible, I'd like to:
>  * Update the implementation to utilize an async thread for managing 
> refreshes. This will give more control over rate limiting across all 
> DFSInputStreams in a DFSClient, and also ensure that all DFSInputStreams are 
> refreshed.
>  * Only refresh files which are lacking a local replica or have known 
> deadNodes to be cleaned up
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16261) Configurable grace period around invalidation of replaced blocks

2021-12-09 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456729#comment-17456729
 ] 

Bryan Beaudreault commented on HDFS-16261:
--

I would love to get a review here if someone has time. This PR has been 
deployed to more than 100 active production namenodes (plus standby similar 
namenodes), serving clusters of 100s of nodes and 100's of thousands of blocks. 
Has been working great for us.

> Configurable grace period around invalidation of replaced blocks
> 
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16262) Async refresh of cached locations in DFSInputStream

2021-12-09 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456727#comment-17456727
 ] 

Bryan Beaudreault edited comment on HDFS-16262 at 12/9/21, 8:57 PM:


I would love to get a review here if someone has time. This PR has been 
deployed to more than 8000 servers across 45 clusters in our production 
environments, has been working great for us and I'd like to close the loop if 
possible.


was (Author: bbeaudreault):
I would love to get a review here if someone has time. This PR has been 
deployed to more than 8000 servers across 45 clusters in our environment, has 
been working great for us and I'd like to close the loop if possible.

> Async refresh of cached locations in DFSInputStream
> ---
>
> Key: HDFS-16262
> URL: https://issues.apache.org/jira/browse/HDFS-16262
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> HDFS-15119 added the ability to invalidate cached block locations in 
> DFSInputStream. As written, the feature will affect all DFSInputStreams 
> regardless of whether they need it or not. The invalidation also only applies 
> on the next request, so the next request will pay the cost of calling 
> openInfo before reading the data.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). I'd like 
> to utilize the idea started in HDFS-15119 in order to update DFSInputStreams 
> after blocks have been moved to local hosts.
> I was considering using the feature as is, but some of our clusters are quite 
> large and I'm concerned about the impact on the namenode:
>  * We have some clusters with over 350k StoreFiles, so that'd be 350k 
> DFSInputStreams. With such a large number and very active usage, having the 
> refresh be in-line makes it too hard to ensure we don't DDOS the NameNode.
>  * Currently we need to pay the price of openInfo the next time a 
> DFSInputStream is invoked. Moving that async would minimize the latency hit. 
> Also, some StoreFiles might be far less frequently accessed, so they may live 
> on for a long time before ever refreshing. We'd like to be able to know that 
> all DFSInputStreams are refreshed by a given time.
>  * We may have 350k files, but only a small percentage of them are ever 
> non-local at a given time. Refreshing only if necessary will save a lot of 
> work.
> In order to make this as painless to end users as possible, I'd like to:
>  * Update the implementation to utilize an async thread for managing 
> refreshes. This will give more control over rate limiting across all 
> DFSInputStreams in a DFSClient, and also ensure that all DFSInputStreams are 
> refreshed.
>  * Only refresh files which are lacking a local replica or have known 
> deadNodes to be cleaned up
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16261) Configurable grace period around invalidation of replaced blocks

2021-12-11 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457637#comment-17457637
 ] 

Bryan Beaudreault commented on HDFS-16261:
--

[~hexiaoqiao] thank you very much for the feedback! I am happy to try a 
different approach if it makes sense.

I saw two problems with the DataNode side:
 # It's much more operationally complicated to change configurations on 
DataNode since there may be 100s or 1000s of them. Restarting DataNodes causes 
pain to low latency clients (like HBase).
 # The code on the DataNode side is hard to integrate a deferral process into.

I like the idea of hooking BlockSender, but unfortunately that does not work 
for my use-case. I am not just trying to handle in-progress streams. I'm also 
trying to avoid ReplicaNotFoundExceptions for new requests, which causes long 
tail latency spikes for us. This is meant to pair with HDFS-16262, which will 
allow a DFSInputStream to refresh their block locations before the grace period 
expires and avoid hitting any ReplicaNotFoundExceptions. This is an important 
goal of this issue, avoid ReplicaNotFoundExceptions.

I could get around the problem 1 above by having the namenode send along a 
grace period with DNA_INVALIDATE. That way the configuration is still on the 
namenode, but the DataNode is responsible for handling it. 

Before I investigate that approach, can you help me better understand your 
concern with the NameNode side? I'm not sure what added costs there are here, 
the amount of PendingDeletion blocks should be very small in comparison to 
total block capacity served by NameNode. Note this grace period is only on 
_replaced_ blocks, not deleted blocks.

Thank you again, I look forward to your input.

> Configurable grace period around invalidation of replaced blocks
> 
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-05 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-16155:


 Summary: Allow configurable exponential backoff in DFSInputStream 
refetchLocations
 Key: HDFS-16155
 URL: https://issues.apache.org/jira/browse/HDFS-16155
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient
Reporter: Bryan Beaudreault


The retry policy in 
[DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
 was first written many years ago. It allows configuration of the base time 
window, but subsequent retries double in an un-configurable way. This retry 
strategy makes sense in some clusters as it's very conservative and will avoid 
DDOSing the namenode in certain systemic failure modes – for example, if a  
file is being read by a large hadoop job and the underlying blocks are moved by 
the balancer. In this case, enough datanodes would be added to the deadNodes 
list and all hadoop tasks would simultaneously try to refetch the blocks. The 
3s doubling with random factor helps break up that stampeding herd.

However, not all cluster use-cases are created equal, so there are other cases 
where a more aggressive initial backoff is preferred. For example in a 
low-latency single reader scenario. In this case, if the balancer moves enough 
blocks, the reader hits this 3s backoff which is way too long for a low latency 
use-case.

One could configure the the window very low (10ms), but then you can hit other 
systemic failure modes which would result in readers DDOSing the namenode 
again. For example, if blocks went missing due to truly dead datanodes. In this 
case, many readers might be refetching locations for different files with retry 
backoffs like 10ms, 20ms, 40ms, etc. It takes a while to backoff enough to 
avoid impacting the namenode with that strategy.

I suggest adding a configurable multiplier to the backoff strategy so that 
operators can tune this as they see fit for their use-case. In the above low 
latency case, one could set the base very low (say 2ms) and the multiplier very 
high (say 50). This gives an aggressive first retry that very quickly backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-05 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394236#comment-17394236
 ] 

Bryan Beaudreault commented on HDFS-16155:
--

I've submitted the linked PR to solve this jira, but I can't make myself the 
assignee. Would appreciate some eyes on the PR if anyone has time. It's a small 
patch which would help a lot in low latency use-cases.

> Allow configurable exponential backoff in DFSInputStream refetchLocations
> -
>
> Key: HDFS-16155
> URL: https://issues.apache.org/jira/browse/HDFS-16155
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Bryan Beaudreault
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The retry policy in 
> [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
>  was first written many years ago. It allows configuration of the base time 
> window, but subsequent retries double in an un-configurable way. This retry 
> strategy makes sense in some clusters as it's very conservative and will 
> avoid DDOSing the namenode in certain systemic failure modes – for example, 
> if a  file is being read by a large hadoop job and the underlying blocks are 
> moved by the balancer. In this case, enough datanodes would be added to the 
> deadNodes list and all hadoop tasks would simultaneously try to refetch the 
> blocks. The 3s doubling with random factor helps break up that stampeding 
> herd.
> However, not all cluster use-cases are created equal, so there are other 
> cases where a more aggressive initial backoff is preferred. For example in a 
> low-latency single reader scenario. In this case, if the balancer moves 
> enough blocks, the reader hits this 3s backoff which is way too long for a 
> low latency use-case.
> One could configure the the window very low (10ms), but then you can hit 
> other systemic failure modes which would result in readers DDOSing the 
> namenode again. For example, if blocks went missing due to truly dead 
> datanodes. In this case, many readers might be refetching locations for 
> different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a 
> while to backoff enough to avoid impacting the namenode with that strategy.
> I suggest adding a configurable multiplier to the backoff strategy so that 
> operators can tune this as they see fit for their use-case. In the above low 
> latency case, one could set the base very low (say 2ms) and the multiplier 
> very high (say 50). This gives an aggressive first retry that very quickly 
> backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-05 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394237#comment-17394237
 ] 

Bryan Beaudreault commented on HDFS-16155:
--

If this PR is approved, I'd appreciate it included into the hadoop 3.2 release 
branch. I tried applying my patch directly and there are some minor conflicts 
related to imports. I would be happy to submit a follow-up PR for that branch 
if desired.

> Allow configurable exponential backoff in DFSInputStream refetchLocations
> -
>
> Key: HDFS-16155
> URL: https://issues.apache.org/jira/browse/HDFS-16155
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Bryan Beaudreault
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The retry policy in 
> [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
>  was first written many years ago. It allows configuration of the base time 
> window, but subsequent retries double in an un-configurable way. This retry 
> strategy makes sense in some clusters as it's very conservative and will 
> avoid DDOSing the namenode in certain systemic failure modes – for example, 
> if a  file is being read by a large hadoop job and the underlying blocks are 
> moved by the balancer. In this case, enough datanodes would be added to the 
> deadNodes list and all hadoop tasks would simultaneously try to refetch the 
> blocks. The 3s doubling with random factor helps break up that stampeding 
> herd.
> However, not all cluster use-cases are created equal, so there are other 
> cases where a more aggressive initial backoff is preferred. For example in a 
> low-latency single reader scenario. In this case, if the balancer moves 
> enough blocks, the reader hits this 3s backoff which is way too long for a 
> low latency use-case.
> One could configure the the window very low (10ms), but then you can hit 
> other systemic failure modes which would result in readers DDOSing the 
> namenode again. For example, if blocks went missing due to truly dead 
> datanodes. In this case, many readers might be refetching locations for 
> different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a 
> while to backoff enough to avoid impacting the namenode with that strategy.
> I suggest adding a configurable multiplier to the backoff strategy so that 
> operators can tune this as they see fit for their use-case. In the above low 
> latency case, one could set the base very low (say 2ms) and the multiplier 
> very high (say 50). This gives an aggressive first retry that very quickly 
> backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-06 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394719#comment-17394719
 ] 

Bryan Beaudreault commented on HDFS-16155:
--

Not sure why the above comment was not edited or a new comment created. But the 
build is now passing in the PR https://github.com/apache/hadoop/pull/3271

> Allow configurable exponential backoff in DFSInputStream refetchLocations
> -
>
> Key: HDFS-16155
> URL: https://issues.apache.org/jira/browse/HDFS-16155
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Bryan Beaudreault
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The retry policy in 
> [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
>  was first written many years ago. It allows configuration of the base time 
> window, but subsequent retries double in an un-configurable way. This retry 
> strategy makes sense in some clusters as it's very conservative and will 
> avoid DDOSing the namenode in certain systemic failure modes – for example, 
> if a  file is being read by a large hadoop job and the underlying blocks are 
> moved by the balancer. In this case, enough datanodes would be added to the 
> deadNodes list and all hadoop tasks would simultaneously try to refetch the 
> blocks. The 3s doubling with random factor helps break up that stampeding 
> herd.
> However, not all cluster use-cases are created equal, so there are other 
> cases where a more aggressive initial backoff is preferred. For example in a 
> low-latency single reader scenario. In this case, if the balancer moves 
> enough blocks, the reader hits this 3s backoff which is way too long for a 
> low latency use-case.
> One could configure the the window very low (10ms), but then you can hit 
> other systemic failure modes which would result in readers DDOSing the 
> namenode again. For example, if blocks went missing due to truly dead 
> datanodes. In this case, many readers might be refetching locations for 
> different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a 
> while to backoff enough to avoid impacting the namenode with that strategy.
> I suggest adding a configurable multiplier to the backoff strategy so that 
> operators can tune this as they see fit for their use-case. In the above low 
> latency case, one could set the base very low (say 2ms) and the multiplier 
> very high (say 50). This gives an aggressive first retry that very quickly 
> backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-06 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394872#comment-17394872
 ] 

Bryan Beaudreault commented on HDFS-16155:
--

[~brahma] any chance this could be included in the 3.2.3 release you are 
working on? The implementation in the PR is relatively straightforward and 
small.

> Allow configurable exponential backoff in DFSInputStream refetchLocations
> -
>
> Key: HDFS-16155
> URL: https://issues.apache.org/jira/browse/HDFS-16155
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Bryan Beaudreault
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The retry policy in 
> [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
>  was first written many years ago. It allows configuration of the base time 
> window, but subsequent retries double in an un-configurable way. This retry 
> strategy makes sense in some clusters as it's very conservative and will 
> avoid DDOSing the namenode in certain systemic failure modes – for example, 
> if a  file is being read by a large hadoop job and the underlying blocks are 
> moved by the balancer. In this case, enough datanodes would be added to the 
> deadNodes list and all hadoop tasks would simultaneously try to refetch the 
> blocks. The 3s doubling with random factor helps break up that stampeding 
> herd.
> However, not all cluster use-cases are created equal, so there are other 
> cases where a more aggressive initial backoff is preferred. For example in a 
> low-latency single reader scenario. In this case, if the balancer moves 
> enough blocks, the reader hits this 3s backoff which is way too long for a 
> low latency use-case.
> One could configure the the window very low (10ms), but then you can hit 
> other systemic failure modes which would result in readers DDOSing the 
> namenode again. For example, if blocks went missing due to truly dead 
> datanodes. In this case, many readers might be refetching locations for 
> different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a 
> while to backoff enough to avoid impacting the namenode with that strategy.
> I suggest adding a configurable multiplier to the backoff strategy so that 
> operators can tune this as they see fit for their use-case. In the above low 
> latency case, one could set the base very low (say 2ms) and the multiplier 
> very high (say 50). This gives an aggressive first retry that very quickly 
> backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16155) Allow configurable exponential backoff in DFSInputStream refetchLocations

2021-08-23 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403140#comment-17403140
 ] 

Bryan Beaudreault commented on HDFS-16155:
--

[~hexiaoqiao] any chance you could review this?

> Allow configurable exponential backoff in DFSInputStream refetchLocations
> -
>
> Key: HDFS-16155
> URL: https://issues.apache.org/jira/browse/HDFS-16155
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The retry policy in 
> [DFSInputStream#refetchLocations|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1018-L1040]
>  was first written many years ago. It allows configuration of the base time 
> window, but subsequent retries double in an un-configurable way. This retry 
> strategy makes sense in some clusters as it's very conservative and will 
> avoid DDOSing the namenode in certain systemic failure modes – for example, 
> if a  file is being read by a large hadoop job and the underlying blocks are 
> moved by the balancer. In this case, enough datanodes would be added to the 
> deadNodes list and all hadoop tasks would simultaneously try to refetch the 
> blocks. The 3s doubling with random factor helps break up that stampeding 
> herd.
> However, not all cluster use-cases are created equal, so there are other 
> cases where a more aggressive initial backoff is preferred. For example in a 
> low-latency single reader scenario. In this case, if the balancer moves 
> enough blocks, the reader hits this 3s backoff which is way too long for a 
> low latency use-case.
> One could configure the the window very low (10ms), but then you can hit 
> other systemic failure modes which would result in readers DDOSing the 
> namenode again. For example, if blocks went missing due to truly dead 
> datanodes. In this case, many readers might be refetching locations for 
> different files with retry backoffs like 10ms, 20ms, 40ms, etc. It takes a 
> while to backoff enough to avoid impacting the namenode with that strategy.
> I suggest adding a configurable multiplier to the backoff strategy so that 
> operators can tune this as they see fit for their use-case. In the above low 
> latency case, one could set the base very low (say 2ms) and the multiplier 
> very high (say 50). This gives an aggressive first retry that very quickly 
> backs off.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15910) Replace bzero with explicit_bzero for better safety

2021-09-22 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418875#comment-17418875
 ] 

Bryan Beaudreault commented on HDFS-15910:
--

FYI, this breaks the ability to build hadoop 3.3 on Centos 6. I realize the 
BUILDING docs only give guidance for Centos 8, but prior to this I was able to 
build hadoop with Centos 6.  The fix HDFS-15977 has not been backported to 
branch-3.3, so this remains broken for 3.3 for now.

> Replace bzero with explicit_bzero for better safety
> ---
>
> Key: HDFS-15910
> URL: https://issues.apache.org/jira/browse/HDFS-15910
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Affects Versions: 3.2.2
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It is better to always use explicit_bzero since it guarantees that the buffer 
> will be cleared irrespective of the compiler optimizations - 
> https://man7.org/linux/man-pages/man3/bzero.3.html.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15977) Call explicit_bzero only if it is available

2021-09-22 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418876#comment-17418876
 ] 

Bryan Beaudreault commented on HDFS-15977:
--

Any chance this can be backported to branch-3.3 for next 3.3.x release? As of 
HDFS-15910, hadoop 3.3 cannot be built on older CentOS versions.

> Call explicit_bzero only if it is available
> ---
>
> Key: HDFS-15977
> URL: https://issues.apache.org/jira/browse/HDFS-15977
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> CentOS/RHEL 7 has glibc 2.17, and it does not support explicit_bzero. Now I 
> don't want to drop support for CentOS/RHEL 7, and we should call 
> explicit_bzero only if it is available. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15977) Call explicit_bzero only if it is available

2021-09-24 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419741#comment-17419741
 ] 

Bryan Beaudreault commented on HDFS-15977:
--

Thank you!

> Call explicit_bzero only if it is available
> ---
>
> Key: HDFS-15977
> URL: https://issues.apache.org/jira/browse/HDFS-15977
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> CentOS/RHEL 7 has glibc 2.17, and it does not support explicit_bzero. Now I 
> don't want to drop support for CentOS/RHEL 7, and we should call 
> explicit_bzero only if it is available. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15910) Replace bzero with explicit_bzero for better safety

2021-09-24 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419988#comment-17419988
 ] 

Bryan Beaudreault commented on HDFS-15910:
--

The linked issue has been backported to branch 3.3, so this is no longer an 
issue for 3.3.2+

> Replace bzero with explicit_bzero for better safety
> ---
>
> Key: HDFS-15910
> URL: https://issues.apache.org/jira/browse/HDFS-15910
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Affects Versions: 3.2.2
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It is better to always use explicit_bzero since it guarantees that the buffer 
> will be cleared irrespective of the compiler optimizations - 
> https://man7.org/linux/man-pages/man3/bzero.3.html.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15977) Call explicit_bzero only if it is available

2021-09-24 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419987#comment-17419987
 ] 

Bryan Beaudreault commented on HDFS-15977:
--

Just to close the loop, with that fix in place I was again able to build Hadoop 
3.3.1 from CentOS 6 with no other changes. Thanks again.

> Call explicit_bzero only if it is available
> ---
>
> Key: HDFS-15977
> URL: https://issues.apache.org/jira/browse/HDFS-15977
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> CentOS/RHEL 7 has glibc 2.17, and it does not support explicit_bzero. Now I 
> don't want to drop support for CentOS/RHEL 7, and we should call 
> explicit_bzero only if it is available. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15119) Allow expiration of cached locations in DFSInputStream

2021-09-28 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421490#comment-17421490
 ] 

Bryan Beaudreault commented on HDFS-15119:
--

Did anything come of the benchmarks [~ahussein]?

Reading through, I agree it would be nice to have a mechanism for refreshing 
block locations. But in a low-latency usecase like HBase, ideally that would 
happen in the background, not in the critical path of a request. Alternatively, 
as mentioned above, one could refresh the locations only in response to certain 
exceptions.

> Allow expiration of cached locations in DFSInputStream
> --
>
> Key: HDFS-15119
> URL: https://issues.apache.org/jira/browse/HDFS-15119
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15119-branch-2.10.003.patch, HDFS-15119.001.patch, 
> HDFS-15119.002.patch, HDFS-15119.003.patch
>
>
> Staleness and other transient conditions can affect reads for a long time 
> since the block locations may not be re-fetched. It makes sense to make 
> cached locations to expire.
> For example, we may not take advantage of local-reads since the nodes are 
> blacklisted and have not been updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16261) Configurable grace period around deletion of invalidated blocks

2021-10-06 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-16261:


 Summary: Configurable grace period around deletion of invalidated 
blocks
 Key: HDFS-16261
 URL: https://issues.apache.org/jira/browse/HDFS-16261
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault


When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
NameNode and the NameNode instructs the old host to in invalidate the block 
using DNA_INVALIDATE. As it stands today, this invalidation is async but tends 
to happen relatively quickly.

I'm working on a feature for HBase which enables efficient healing of locality 
through Balancer-style low level block moves. One issue is that HBase tends to 
keep open long running DFSInputStreams and moving blocks from under them causes 
lots of warns in the RegionServer and increases long tail latencies due to the 
necessary retries in the DFSClient.

One way I'd like to fix this is to provide a configurable grace period on async 
invalidations. This would give the DFSClient enough time to refresh block 
locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16261) Configurable grace period around deletion of invalidated blocks

2021-10-06 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-16261:
-
Description: 
When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
NameNode and the NameNode instructs the old host to in invalidate the block 
using DNA_INVALIDATE. As it stands today, this invalidation is async but tends 
to happen relatively quickly.

I'm working on a feature for HBase which enables efficient healing of locality 
through Balancer-style low level block moves (HBASE-26250). One issue is that 
HBase tends to keep open long running DFSInputStreams and moving blocks from 
under them causes lots of warns in the RegionServer and increases long tail 
latencies due to the necessary retries in the DFSClient.

One way I'd like to fix this is to provide a configurable grace period on async 
invalidations. This would give the DFSClient enough time to refresh block 
locations before hitting any errors.

  was:
When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
NameNode and the NameNode instructs the old host to in invalidate the block 
using DNA_INVALIDATE. As it stands today, this invalidation is async but tends 
to happen relatively quickly.

I'm working on a feature for HBase which enables efficient healing of locality 
through Balancer-style low level block moves. One issue is that HBase tends to 
keep open long running DFSInputStreams and moving blocks from under them causes 
lots of warns in the RegionServer and increases long tail latencies due to the 
necessary retries in the DFSClient.

One way I'd like to fix this is to provide a configurable grace period on async 
invalidations. This would give the DFSClient enough time to refresh block 
locations before hitting any errors.


> Configurable grace period around deletion of invalidated blocks
> ---
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16262) Async refresh of cached locations in DFSInputStream

2021-10-06 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-16262:


 Summary: Async refresh of cached locations in DFSInputStream
 Key: HDFS-16262
 URL: https://issues.apache.org/jira/browse/HDFS-16262
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault


HDFS-15119 added the ability to invalidate cached block locations in 
DFSInputStream. As written, the feature will affect all DFSInputStreams 
regardless of whether they need it or not. The invalidation also only applies 
on the next request, so the next request will pay the cost of calling openInfo 
before reading the data.

I'm working on a feature for HBase which enables efficient healing of locality 
through Balancer-style low level block moves (HBASE-26250). I'd like to utilize 
the idea started in HDFS-15119 in order to update DFSInputStreams after blocks 
have been moved to local hosts.

I was considering using the feature as is, but some of our clusters are quite 
large and I'm concerned about the impact on the namenode:
 * We have some clusters with over 350k StoreFiles, so that'd be 350k 
DFSInputStreams. With such a large number and very active usage, having the 
refresh be in-line makes it too hard to ensure we don't DDOS the NameNode.
 * Currently we need to pay the price of openInfo the next time a 
DFSInputStream is invoked. Moving that async would minimize the latency hit. 
Also, some StoreFiles might be far less frequently accessed, so they may live 
on for a long time before ever refreshing. We'd like to be able to know that 
all DFSInputStreams are refreshed by a given time.
 * We may have 350k files, but only a small percentage of them are ever 
non-local at a given time. Refreshing only if necessary will save a lot of work.

In order to make this as painless to end users as possible, I'd like to:
 * Update the implementation to utilize an async thread for managing refreshes. 
This will give more control over rate limiting across all DFSInputStreams in a 
DFSClient, and also ensure that all DFSInputStreams are refreshed.
 * Only refresh files which are lacking a local replica or have known deadNodes 
to be cleaned up

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16262) Async refresh of cached locations in DFSInputStream

2021-10-07 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425784#comment-17425784
 ] 

Bryan Beaudreault commented on HDFS-16262:
--

PR submitted: [https://github.com/apache/hadoop/pull/3527]

I've had it running in one of our test clusters (hadoop 3.3), under load and 
with block moves occurring. I had it tuned to a short interval of 10s just to 
put it in an extreme condition. It works really well.

[~kihwal] [~ahussein] just wanted to tag you both because you worked on the 
original issue. Thanks for the inspiration and I tried to implement this in a 
way that is backwards compatible with your original intention.

> Async refresh of cached locations in DFSInputStream
> ---
>
> Key: HDFS-16262
> URL: https://issues.apache.org/jira/browse/HDFS-16262
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> HDFS-15119 added the ability to invalidate cached block locations in 
> DFSInputStream. As written, the feature will affect all DFSInputStreams 
> regardless of whether they need it or not. The invalidation also only applies 
> on the next request, so the next request will pay the cost of calling 
> openInfo before reading the data.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). I'd like 
> to utilize the idea started in HDFS-15119 in order to update DFSInputStreams 
> after blocks have been moved to local hosts.
> I was considering using the feature as is, but some of our clusters are quite 
> large and I'm concerned about the impact on the namenode:
>  * We have some clusters with over 350k StoreFiles, so that'd be 350k 
> DFSInputStreams. With such a large number and very active usage, having the 
> refresh be in-line makes it too hard to ensure we don't DDOS the NameNode.
>  * Currently we need to pay the price of openInfo the next time a 
> DFSInputStream is invoked. Moving that async would minimize the latency hit. 
> Also, some StoreFiles might be far less frequently accessed, so they may live 
> on for a long time before ever refreshing. We'd like to be able to know that 
> all DFSInputStreams are refreshed by a given time.
>  * We may have 350k files, but only a small percentage of them are ever 
> non-local at a given time. Refreshing only if necessary will save a lot of 
> work.
> In order to make this as painless to end users as possible, I'd like to:
>  * Update the implementation to utilize an async thread for managing 
> refreshes. This will give more control over rate limiting across all 
> DFSInputStreams in a DFSClient, and also ensure that all DFSInputStreams are 
> refreshed.
>  * Only refresh files which are lacking a local replica or have known 
> deadNodes to be cleaned up
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16261) Configurable grace period around deletion of invalidated blocks

2021-10-07 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425805#comment-17425805
 ] 

Bryan Beaudreault commented on HDFS-16261:
--

I'm looking at this now. I don't have much experience in this area, but am 
looking into two possibilities high level: handling this on the namenode or 
handling in the datanode.
h2. Handling in the NameNode

When a DataNode receives a block, it notifies the namenode via 
notifyNamenodeReceivedBlock. This sends a RECEIVED_BLOCK to the namenode along 
with a "delHint", which tells the namenode to invalidate that block on the old 
host.

Tracing that delHint through on the namenode side through a bunch of layers, 
you eventually land in BlockManager.processExtraRedundancyBlock which 
eventually lands in processChosenExcessRedundancy. 
processChosenExcessRedundancy adds the block to a excessRedundancyMap and to a 
nodeToBlocks map in InvalidateBlocks.

There is a RedundancyChore which periodically checks InvalidateBlocks, pulling 
a configurable amount of blocks and adding them to the DatanodeDescriptor's 
invalidateBlocks map. One quick option might be to configure the 
RedundancyChore with dfs.namenode.redundancy.interval.seconds, though that's 
not exactly what we want which is a per-block grace period.

Next time a DataNode sends a heartbeat, the Namenode processes various state 
for that datanode and sends back a series of commands. Here the NameNode pulls 
a configurable number of blocks from the DatanodeDescriptor's invalidateBlocks 
and sends them to the DataNode as part of a DNA_INVALIDATE command.

If we were to handle this in the NameNode, we could potentially hook in a 
couple places:
 * When adding to nodeToBlocks, we could include a timestamp. The 
RedundancyChore could only add blocks to the Descriptor's invalidateBlocks map 
if older than a threshold.
 * When adding to Descriptor's invalidateBlocks, we could add a timestamp. When 
processing heartbeats, we could only send blocks via DNA_INVALIDATE which have 
been in invalidateBlocks for more than a threshold
 * As mentioned above, we could try tuning 
dfs.namenode.redundancy.interval.seconds, though that isn't perfect because a 
block could be added right before the chore runs and thus get immediately 
invalidated.

h2. Handling in the DataNode

When a DataNode gets a request for a block, it looks that up in its 
FsDatasetImpl volumeMap. If the block does not exist, a 
ReplicaNotFoundException is thrown.

The DataNode receives the list of blocks to invalidate from the DNA_INVALIDATE 
command, which is processed by BPOfferService. This is immediately handed off 
to FsDatasetImpl.invalidate, which validates the request and immediately 
removes the block from volumeMap. At this point, the data still exists on disk 
but requests for the block would throw a ReplicaNotFoundException per above.

Once removed from volumeMap, the deletion of data is handled by the 
FsDatasetAsyncDiskService. The processing is done async, but is immediately 
handed off to a ThreadPoolExecutor which should execute fairly quickly.

A couple options:
 * Defer the call to FsDatasetImpl.invalidate, at the highest level. This could 
be passed off to a thread pool to be executed after a delay. In this case, the 
block would remain in the volumeMap until the task is executed.
 * Execute invalidate immediately, but defer the data deletion. We're already 
using a thread pool here, so it might be easier to execute after a delay. It's 
worth noting that there are other actions taken around the volumeMap removal. 
We'd need to verify whether those need to be synchronized with removal from 
volumeMap. In this case we'd need to either:
 ** relocate the volumeMap.remove call to within the FsDatasetAsyncDiskService. 
This seems like somewhat of a leaky abstraction.
 ** Add a pendingDeletion map and add to that when removing from volumeMap. The 
FsDatasetAsyncDiskService would remove from pendingDeletion once completed. 
We'd need to update our block fetch code to check volumeMap _or_ 
pendingDeletion. This separation might give us opportunities in the future, 
such as including a flag in the response that instructs the DFSClient "this 
block may go away soon".

 

I'm doing more investigation and specifically want to look into what would 
happen if the handling service died before invalidating blocks. I'm assuming 
this is already handled since this process is very async already, but it will 
be good to know. I also want to do a bit more thinking of the pros and cons of 
each option above, and some experimenting with the easiest option of tuning the 
redundancy chore. I'll report back when I have some more information, and also 
open to other opinions or suggestions.

> Configurable grace period around deletion of invalidated blocks
> ---
>
> Key: H

[jira] [Commented] (HDFS-16261) Configurable grace period around deletion of invalidated blocks

2021-10-07 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425866#comment-17425866
 ] 

Bryan Beaudreault commented on HDFS-16261:
--

I've verified that setting "dfs.namenode.redundancy.interval.seconds" to, for 
example, 5 minutes and setting the DFSClient block location refresh to 10 
seconds (https://issues.apache.org/jira/browse/HDFS-16262) results in zero 
ReplicaNotFoundExceptions even when all the primary replica for all blocks are 
shuffled to do different hosts. Enabling debug logging of the refresh thread, I 
can see that while blocks are being shuffled the refresh thread will trigger 
for files whose blocks have moved and then once all block moves are finished 
the refresh thread will settle down to 0 blocks refreshed.

I'm going to dig more into the above comment tomorrow, but wanted to test the 
simple change just to prove the concept. That appears to have been a success.

> Configurable grace period around deletion of invalidated blocks
> ---
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16261) Configurable grace period around deletion of invalidated blocks

2021-10-08 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426391#comment-17426391
 ] 

Bryan Beaudreault commented on HDFS-16261:
--

Despite how well tuning "dfs.namenode.redundancy.interval.seconds" has worked, 
I don't think that's a good long term option because the RedundancyMonitor also 
handles some processing of reconstruction and misplaced blocks. I don't want to 
mess with those processes.

For now I decided to going the route of encoding an insertion timestamp in the 
NameNode's InvalidateBlocks nodeToBlocks map. This felt like the easiest 
approach, since it's just a minor change to the existing system for handing out 
block invalidation commands.

I'll be testing this out in a test cluster shortly.

In the meantime, I've spent some time looking into how the NameNode handles 
crash recovery when blocks might have been awaiting deletion. The only handling 
is that on NameNode startup it will find all over-replicated blocks and try to 
reduce the replication to what is expected. This means invalidating blocks 
again, but not necessarily the ones we had originally chose.

This definitely seems like a downside of this approach. It basically means that 
we may mess up locality again after the NameNode restarts, since it very well 
may decide to keep the block we originally invalidated. We'd need to re-run the 
process of moving blocks back to how we want them, which could be automatically 
handled but may temporarily degrade latencies a bit. Another negative aspect of 
this approach, which I realized during the investigation, is that if a client 
calls DFSClient.getLocatedBlocks while a block is pending deletion, the result 
will include the to-be-deleted replica until it's been fully purged.

I think implementing this in the DataNode instead would avoid both of those 
downsides. On the flip side, if a DataNode restarted while a block was pending 
deletion, when it started back up again the block would no longer be available. 
This seems like a totally reasonable failure mode.

For now I'm going to do some testing of the NameNode side to see how it works 
in practice, but will also look into what a DataNode side implementation would 
look like.

> Configurable grace period around deletion of invalidated blocks
> ---
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16261) Configurable grace period around invalidation of replaced blocks

2021-10-21 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-16261:
-
Summary: Configurable grace period around invalidation of replaced blocks  
(was: Configurable grace period around deletion of invalidated blocks)

> Configurable grace period around invalidation of replaced blocks
> 
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16261) Configurable grace period around invalidation of replaced blocks

2021-10-27 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434941#comment-17434941
 ] 

Bryan Beaudreault commented on HDFS-16261:
--

Just to close the loop on that last comment, it did not make sense to implement 
this in the DataNode. It was theoretically possible, but became very messy due 
to how access of the volumeMap is managed there. I also think implementing in 
the NameNode is more inline with the general HDFS architecture, where the 
NameNode owns the state around block locations and pending work.

I've had the grace period running on a couple internal clusters for a couple 
weeks now. I just submitted a PR for review.

> Configurable grace period around invalidation of replaced blocks
> 
>
> Key: HDFS-16261
> URL: https://issues.apache.org/jira/browse/HDFS-16261
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a block is moved with REPLACE_BLOCK, the new location is recorded in the 
> NameNode and the NameNode instructs the old host to in invalidate the block 
> using DNA_INVALIDATE. As it stands today, this invalidation is async but 
> tends to happen relatively quickly.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). One 
> issue is that HBase tends to keep open long running DFSInputStreams and 
> moving blocks from under them causes lots of warns in the RegionServer and 
> increases long tail latencies due to the necessary retries in the DFSClient.
> One way I'd like to fix this is to provide a configurable grace period on 
> async invalidations. This would give the DFSClient enough time to refresh 
> block locations before hitting any errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16262) Async refresh of cached locations in DFSInputStream

2023-01-25 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680883#comment-17680883
 ] 

Bryan Beaudreault commented on HDFS-16262:
--

Hey [~mccormickt12]. It's not exactly that it was ignored/forgotten, but I 
think the problems are sort of orthogonal. I'm not sure it would have made 
sense to solve for ignoredNodes in the context of this JIRA.

This Jira solves for a problem due to persistent state. A DFSInputStream caches 
block locations and dead nodes for the lifetime of the stream. While a stream 
is open, the underlying replicas of blocks may have changed. If so, the cached 
block locations may have datanodes in the wrong order to make use of the best 
locality. Also, dead nodes may build up in a way that inevitably leads to a 
BlockMissingException even if there are no real problems:
 * File has 4 blocks
 * Block A has locations 1, 2, 3
 * Block B has locations 3, 4, 5
 * Block C has locations 6, 7, 8
 * Block D has locations 3, 1, 6
 * Opening this file involves fetching these block locations
 * After opening the file, the block locations are all shuffled around in the 
cluster out of band
 * Reading this file involves reading these blocks in sequence:
 ** First we read block A, but it no longer exists at location 1. Add 1 to 
deadNodes, and found at location 2 so success
 ** Same problem for B, doesn't exist at its first location 3. Add 3 to 
deadnodes, and find at loc 4.
 ** Block C doesn't exist at 6, add 6 to deadNodes and find at 7.
 ** Now we get to D, but all 3 replicas (3, 1, 6) are in deadNodes – they were 
never dead, just no longer holding those replicas above. They do hold the 
replicas for D, but the input stream doesn't care – log "Could not obtain 
block" and do an expensive refreshLocations call.
 * The above refreshLocatiosn call increments global failures counter for this 
stream by 1
 * Let's say your file actually has 100s of blocks, or that you're often 
preading in ways that re-requests data from blocks over time If enough 
replicas are moving around frequently enough, you very quickly exceed the 
failure counts and trigger BlockMissingException.

For us the openInfo in refetchLocations would cause clear latency spikes, and 
the BlockMissingExceptions were worse. So the goal of this Jira was to prevent 
that from happening by updating the locations when we know they had changed – 
hopefully before they are requested. Once locations are updated, it makes. 
little sense to keep the same dead nodes list.

The difference with ignoredNodes is it's in the context of a single request. 
Every request starts off with no nodes ignored. The idea with ignoredNodes I 
think is that since you're looping and kicking off hedged requests, you want to 
make sure you don't submit a request to the same node twice. So on each loop 
you add the last node to ignoredNodes.

It's hard to believe ignoredNodes is a problem unless there's really a problem 
with those nodes. I imagine there would be other logs leading up to the error 
which might indicate why ignoredNodes had to grow beyond 1. After the first 
hedge, it looks like further hedge requests are only kicked off if that first 
hedge throws a InterruptedException in getFirstToComplete.

I suppose its possible ignoredNodes is interacting poorly with deadnodes. 
Imagine my above example, except block B doesn't fail so deadNodes is 1, 6. A 
request for D comes in and has to hedge the request to 3... So 3 is in 
ignoredNodes and 1/6 are in deadNodes – refetchLocations required.

If that's true, I think this Jira should help the situation since it would 
clear out deadNodes periodically. I don't think this Jira can clear out 
ignoreNodes because it's not a global state that we can gain access to. 
ignoredNodes is allocated in hedgedFetchBlockByteRange itself.

Looking at your jira, the line numbers in the stack trace is off so it's 
unclear if your install has this Jira available. I'm also not 100% sure of your 
approach to clearing ignoredNodes. Based on my understanding, ignoredNodes 
should really only contain nodes that actively have hedged requests pending 
within the current request context. Clearing the list could cause the next loop 
to submit a request to the same node that's already serving one so you end up 
with 2 pending futures to the same node. I could be wrong though, I haven't 
spent a lot of time studying the code recently. I'd be looking for other log 
indicators leading up to the exception you saw to see what precipitated the 
addition into ignoredNodes to the point that all locations were ignored for a 
single request.

 

> Async refresh of cached locations in DFSInputStream
> ---
>
> Key: HDFS-16262
> URL: https://issues.apache.org/jira/browse/HDFS-16262
> Project: Hadoop HDFS
>  Issue Type: Improve

[jira] [Created] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-05 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-17179:


 Summary: DFSInputStream should report CorruptMetaHeaderException 
as corruptBlock to NameNode
 Key: HDFS-17179
 URL: https://issues.apache.org/jira/browse/HDFS-17179
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault


We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corruptat 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]  at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
 ~[hadoop-hdfs-client-3.3.1.jar:?] at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
~[hadoop-hdfs-client-3.3.1.jar:?]  at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1160)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1132) 
~[hadoop-hdfs-client-3.3.1.jar:?] at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1128) 
~[hadoop-hdfs-client-3.3.1.jar:?] at 
java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]  at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]  
 at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]   at java.lang.Thread.run(Thread.java:829) ~[?:?]Caused by: 
org.apache.hadoop.util.InvalidChecksumSizeException: The value -75 does not map 
to a valid checksum Type  at 
org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
 ~[hadoop-common-3.3.1.jar:?]at 
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:159) 
~[hadoop-common-3.3.1.jar:?]  at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:131)
 ~[hadoop-hdfs-client-3.3.1.jar:?] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-05 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-17179:
-
Description: 
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1160)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1132) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1128) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]   at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.hadoop.util.InvalidChecksumSizeException: The value -75 
does not map to a valid checksum Type
at 
org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
 ~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:159) 
~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:131)
 ~[hadoop-hdfs-client-3.3.1.jar:?] {code}

  was:
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corruptat 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]  at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]   at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockR

[jira] [Updated] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-05 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-17179:
-
Description: 
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1160)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1132) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1128) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.hadoop.util.InvalidChecksumSizeException: The value -75 
does not map to a valid checksum Type
at 
org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
 ~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:159) 
~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:131)
 ~[hadoop-hdfs-client-3.3.1.jar:?] {code}

  was:
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.im

[jira] [Commented] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-05 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762160#comment-17762160
 ] 

Bryan Beaudreault commented on HDFS-17179:
--

This is only a problem with ShortCircuitReads. With a normal read, it results 
in an Op.READ_BLOCK request to the DataNode. When the DataNode processes this 
request, it will similarly try to call BlockMetadataHeader.readHeader and will 
encounter a CorruptMetaHeaderException. It will handle that exception and 
report the block to the namenode 
[here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L4083].
 Here's an example stacktrace of that:
{code:java}
2023-09-02 21:44:04,414 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
xx:50010:DataXceiver error processing READ_BLOCK operation  src: 
/xx:4556 dst: /xx:50010
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
        at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:191)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:147)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readDataChecksum(BlockMetadataHeader.java:100)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:335)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:596)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.util.InvalidChecksumSizeException: The value 84 
does not map to a valid checksum Type
        at 
org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
        at 
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:177)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:189)
        ... 8 more
2023-09-02 21:44:04,415 WARN 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
BP-1284711166-xx-1647379306520:blk_1403927728_330193265 on /mnt/hdfs/data 
{code}

> DFSInputStream should report CorruptMetaHeaderException as corruptBlock to 
> NameNode
> ---
>
> Key: HDFS-17179
> URL: https://issues.apache.org/jira/browse/HDFS-17179
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> We've been running into some data corruption issues recently. When a 
> ChecksumException is thrown, DFSInputStream correctly reports the block to 
> the NameNode which triggers deletion and re-replication of the replica. It's 
> also possible that we fail to even read the meta header for constructing the 
> checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
> by DFSInputStream. We should handle this similarly to ChecksumException. See 
> stacktrace:
>  
> {code:java}
> org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
> meta file header is corrupt
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
> ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org

[jira] [Updated] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-05 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-17179:
-
Description: 
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
WARN org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: 
BlockReaderFactory(fileName=/hbase/data/default/table-c1/5a76502c2c7be37b2d92057baa8a3d81/0/24ddc16e2d824a3bb9bf242ad950a589,
 block=BP-154245500-xx-1657570070866:blk_1362550389_288818622): error crea
ting ShortCircuitReplica.
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1160)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1132) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1128) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.hadoop.util.InvalidChecksumSizeException: The value -75 
does not map to a valid checksum Type
at 
org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
 ~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:159) 
~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:131)
 ~[hadoop-hdfs-client-3.3.1.jar:?] 
WARN org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: 
ShortCircuitCache(0x4da4703d): failed to load 
1362550389_BP-154245500-xxx-1657570070866{code}

  was:
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.cl

[jira] [Updated] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-05 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-17179:
-
Description: 
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
WARN org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: 
BlockReaderFactory(fileName=/hbase/data/default/table-c1/5a76502c2c7be37b2d92057baa8a3d81/0/24ddc16e2d824a3bb9bf242ad950a589,
 block=BP-154245500-xx-1657570070866:blk_1362550389_288818622): error 
creating ShortCircuitReplica.
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1160)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1132) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1128) 
~[hadoop-hdfs-client-3.3.1.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.hadoop.util.InvalidChecksumSizeException: The value -75 
does not map to a valid checksum Type
at 
org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
 ~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:159) 
~[hadoop-common-3.3.1.jar:?]
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:131)
 ~[hadoop-hdfs-client-3.3.1.jar:?] 
WARN org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: 
ShortCircuitCache(0x4da4703d): failed to load 
1362550389_BP-154245500-xxx-1657570070866{code}

  was:
We've been running into some data corruption issues recently. When a 
ChecksumException is thrown, DFSInputStream correctly reports the block to the 
NameNode which triggers deletion and re-replication of the replica. It's also 
possible that we fail to even read the meta header for constructing the 
checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
by DFSInputStream. We should handle this similarly to ChecksumException. See 
stacktrace:

 
{code:java}
WARN org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: 
BlockReaderFactory(fileName=/hbase/data/default/table-c1/5a76502c2c7be37b2d92057baa8a3d81/0/24ddc16e2d824a3bb9bf242ad950a589,
 block=BP-154245500-xx-1657570070866:blk_1362550389_288818622): error crea
ting ShortCircuitReplica.
org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
meta file header is corrupt
at 
org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
at 
org.apache.hadoop.hd

[jira] [Commented] (HDFS-17179) DFSInputStream should report CorruptMetaHeaderException as corruptBlock to NameNode

2023-09-06 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762377#comment-17762377
 ] 

Bryan Beaudreault commented on HDFS-17179:
--

Digging further into this, I’m realizing that DFSInputStream cannot handle this 
exception. It’s caught way before reaching there, and the stacktrace I shared 
is printed as part of a WARN. The BlockReaderLocal will fail to build, and we 
will fallback to BlockReaderRemote, which will contact the DataNode with 
Op.READ_BLOCK. The DataNode will handle this exception if checksums are 
enabled. If checksums are not enabled, it won’t even try to load the 
MetaHeader. 



I think the bug here, if any, is that BlockReaderLocal should work similar to 
DataNode — the MetaHeader is only used for checksums, so only read it if 
checksums are enabled. This is easier said than done because of how 
BlockReaderLocal uses checksum metadata throughout the code for determining 
read sizes, etc even if checksums are disabled. It’s probably doable but will 
require some refactoring to do it safely. 

> DFSInputStream should report CorruptMetaHeaderException as corruptBlock to 
> NameNode
> ---
>
> Key: HDFS-17179
> URL: https://issues.apache.org/jira/browse/HDFS-17179
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> We've been running into some data corruption issues recently. When a 
> ChecksumException is thrown, DFSInputStream correctly reports the block to 
> the NameNode which triggers deletion and re-replication of the replica. It's 
> also possible that we fail to even read the meta header for constructing the 
> checksum. This gets thrown as CorruptMetaHeaderException which is not handled 
> by DFSInputStream. We should handle this similarly to ChecksumException. See 
> stacktrace:
>  
> {code:java}
> WARN org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: 
> BlockReaderFactory(fileName=/hbase/data/default/table-c1/5a76502c2c7be37b2d92057baa8a3d81/0/24ddc16e2d824a3bb9bf242ad950a589,
>  block=BP-154245500-xx-1657570070866:blk_1362550389_288818622): error 
> creating ShortCircuitReplica.
> org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The block 
> meta file header is corrupt
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.preadHeader(BlockMetadataHeader.java:133)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.(ShortCircuitReplica.java:129)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:618)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
> ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1160)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1132) 
> ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1128) 
> ~[hadoop-hdfs-client-3.3.1.jar:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>   at java.lang.Thread.run(Thread.java:829) ~[?:?]
> Caused by: org.apache.hadoop.util.InvalidChecksumSizeException: The value -75 
> does not map to a valid checksum Type
>   at 
> org.apache.hadoop.util.DataChecksum.mapByteToChecksumType(DataChecksum.java:190)
>  ~[hadoop-common-3.3.1.jar:?]
>  

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2023-11-02 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782303#comment-17782303
 ] 

Bryan Beaudreault commented on HDFS-15413:
--

I was also having this issue when trying out Erasure Coding for HBase 
storefiles. Similarly, compations were failing but other requests were fine. I 
applied [~max2049] 's patch to my cluster, and it resolved the issue (after 
setting max attempts to 2).

It would be great to get this reviewed and merged.

> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.1

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2023-11-06 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783449#comment-17783449
 ] 

Bryan Beaudreault commented on HDFS-15413:
--

[~qinyuren] thank you for the suggestion! In our usage of EC with HBase, this 
happens even with very little load. Because HBase will read a chunk of data for 
compaction, then write it out again over time. It may get rate limited due to 
throughput limits, which may cause it to pause before reading more data. This 
exceeds the timeout, and just 1 retry gets it to re-connect.

> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTimeoutException: 1 millis timeout 
> while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-1-hadoop.hadoo

[jira] [Created] (HDFS-17262) Transfer rate metric warning log is too verbose

2023-11-21 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-17262:


 Summary: Transfer rate metric warning log is too verbose
 Key: HDFS-17262
 URL: https://issues.apache.org/jira/browse/HDFS-17262
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Bryan Beaudreault


HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration is 
millis, and its very possible for a read to take less than a millisecond when 
considering local TCP connection. We are seeing this spam multiple times per 
millisecond. There's another report on the PR for HDFS-16917.

Please downgrade to debug or remove the log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17262) Transfer rate metric warning log is too verbose

2023-11-21 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788504#comment-17788504
 ] 

Bryan Beaudreault commented on HDFS-17262:
--

cc [~rdingankar] 

> Transfer rate metric warning log is too verbose
> ---
>
> Key: HDFS-17262
> URL: https://issues.apache.org/jira/browse/HDFS-17262
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Priority: Major
>
> HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration 
> is millis, and its very possible for a read to take less than a millisecond 
> when considering local TCP connection. We are seeing this spam multiple times 
> per millisecond. There's another report on the PR for HDFS-16917.
> Please downgrade to debug or remove the log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17319) Downgrade noisy InvalidToken log in ShortCircuitCache

2024-01-02 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-17319:


 Summary: Downgrade noisy InvalidToken log in ShortCircuitCache
 Key: HDFS-17319
 URL: https://issues.apache.org/jira/browse/HDFS-17319
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Bryan Beaudreault


ShortCircuitCache logs an exception whenever InvalidToken is detected (see 
below). As I understand it, this is part of normal operations when block tokens 
are enabled. So this log seems really noisy. I think we should downgrade it to 
DEBUG, or at least remove the stacktrace. It leads someone to thinking they 
have a problem, when they don't.
{code:java}
2024-01-02T16:02:51,621 [hedgedRead-1545] INFO 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: 
ShortCircuitCache(0xbac84bc): could not load 
1437522350_BP-1420092181-ip-1658432093559 due to InvalidToken exception.
org.apache.hadoop.security.token.SecretManager$InvalidToken: access control 
error while attempting to set up short-circuit access to 
/hbase/data/default/hbase-table-1/23f85f2d91e4967ce389d1a09c43e46d/0/609ccd5d7fcb4830a6602ddaea5ed27e
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:651)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:545)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:786)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:723)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360)
 ~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:715) 
~[hadoop-hdfs-client-3.3.1.jar:?]
        at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1180)
 ~[hadoop-hdfs-client-3.3.1.jar:?] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16874) Improve DataNode decommission for Erasure Coding

2024-01-08 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804427#comment-17804427
 ] 

Bryan Beaudreault commented on HDFS-16874:
--

[~jingzhao] is there any update here?

> Improve DataNode decommission for Erasure Coding
> 
>
> Key: HDFS-16874
> URL: https://issues.apache.org/jira/browse/HDFS-16874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ec, erasure-coding
>Reporter: Jing Zhao
>Assignee: Jing Zhao
>Priority: Major
>
> There are a couple of issues with the current DataNode decommission 
> implementation when large amounts of Erasure Coding data are involved in the 
> data re-replication/reconstruction process:
>  # Slowness. In HDFS-8786 we made a decision to use re-replication for 
> DataNode decommission if the internal EC block is still available. While this 
> strategy reduces the CPU cost caused by EC reconstruction, it greatly limits 
> the overall data recovery bandwidth, since there is only one single DataNode 
> as the source. While high density HDD hosts are more and more widely used by 
> HDFS especially along with Erasure Coding for warm data use case, this 
> becomes a big pain for cluster management. In our production, to decommission 
> a DataNode with several hundred TB EC data stored might take several days. 
> HDFS-16613 provides optimization based on the existing mechanism, but more 
> fundamentally we may want to allow EC reconstruction for DataNode 
> decommission so as to achieve much larger recovery bandwidth.
>  # The semantic of the existing EC reconstruction command (the 
> BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The 
> existing reconstruction command depends on the holes in the 
> srcNodes/liveBlockIndices arrays to indicate the target internal blocks for 
> recovery, while the holes can also be caused by the fact that the 
> corresponding datanode is too busy so it cannot be used as the reconstruction 
> source. This causes the later DataNode side reconstruction may not be 
> consistent with the original intention. E.g., if the index of the missing 
> block is 6, and the datanode storing block 0 is busy, the src nodes in the 
> reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target 
> datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is 
> working on this issue by indicating an excluding index list. More 
> fundamentally we can follow the same path but go a step further by adding an 
> optional field explicitly indicating the target block indices in the command 
> protobuf msg. With the extension the DataNode will no longer use the holes in 
> the src node array to "guess" the reconstruction targets.
> Internally we have developed and applied fixes by following the above 
> directions. We have seen significant improvement (100+ times speed up) in 
> terms of datanode decommission speed for EC data. The more clear semantic of 
> the reconstruction command protobuf msg also help prevent potential data 
> corruption during the EC reconstruction.
> We will use this ticket to track the similar fixes for the Apache releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17284) Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks during block recovery

2024-01-08 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-17284:
-
Component/s: ec

> Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks 
> during block recovery
> --
>
> Key: HDFS-17284
> URL: https://issues.apache.org/jira/browse/HDFS-17284
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Reporter: Hualong Zhang
>Assignee: Hualong Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks 
> during block recovery



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17364) Use WeakReferencedElasticByteBufferPool in DFSStripedInputStream

2024-01-31 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HDFS-17364:


 Summary: Use WeakReferencedElasticByteBufferPool in 
DFSStripedInputStream
 Key: HDFS-17364
 URL: https://issues.apache.org/jira/browse/HDFS-17364
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Bryan Beaudreault


DFSStripedInputStream uses ElasticByteBufferPool to allocate byte buffers for 
the "curStripeBuf". This is used for non-positional (stateful) reads and is 
allocated with a size of numDataBlocks * cellSize. For RS-6-3-1024k, that means 
each DFSStripedInputStream could allocate a 6mb buffer. When the IS is 
finished, the buffer is put back in the pool. Over time and with spikes of 
concurrent reads, the pool grows and most of the buffers sit there unused.
 
WeakReferencedElasticByteBufferPool was introduced HADOOP-18105 and mitigates 
this issue because the excess buffers can be GC'd once they are no longer 
needed. We should use this same pool in DFSStripedInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-3702) Add an option for NOT writing the blocks locally if there is a datanode on the same box as the client

2024-03-12 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825665#comment-17825665
 ] 

Bryan Beaudreault commented on HDFS-3702:
-

I can't find any evidence that this was ever used in HBase. Just to finally 
close the loop, this has now landed in HBASE-28260 for inclusion in hbase 2.6.0+

> Add an option for NOT writing the blocks locally if there is a datanode on 
> the same box as the client
> -
>
> Key: HDFS-3702
> URL: https://issues.apache.org/jira/browse/HDFS-3702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.5.1
>Reporter: Nicolas Liochon
>Assignee: Lei (Eddy) Xu
>Priority: Minor
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-3702.000.patch, HDFS-3702.001.patch, 
> HDFS-3702.002.patch, HDFS-3702.003.patch, HDFS-3702.004.patch, 
> HDFS-3702.005.patch, HDFS-3702.006.patch, HDFS-3702.007.patch, 
> HDFS-3702.008.patch, HDFS-3702.009.patch, HDFS-3702.010.patch, 
> HDFS-3702.011.patch, HDFS-3702.012.patch, HDFS-3702_Design.pdf
>
>
> This is useful for Write-Ahead-Logs: these files are writen for recovery 
> only, and are not read when there are no failures.
> Taking HBase as an example, these files will be read only if the process that 
> wrote them (the 'HBase regionserver') dies. This will likely come from a 
> hardware failure, hence the corresponding datanode will be dead as well. So 
> we're writing 3 replicas, but in reality only 2 of them are really useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-5837) dfs.namenode.replication.considerLoad does not consider decommissioned nodes

2014-01-27 Thread Bryan Beaudreault (JIRA)
Bryan Beaudreault created HDFS-5837:
---

 Summary: dfs.namenode.replication.considerLoad does not consider 
decommissioned nodes
 Key: HDFS-5837
 URL: https://issues.apache.org/jira/browse/HDFS-5837
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Bryan Beaudreault


In DefaultBlockPlacementPolicy, there is a setting 
dfs.namenode.replication.considerLoad which tries to balance the load of the 
cluster when choosing replica locations.  This code does not take into account 
decommissioned nodes.

The code for considerLoad calculates the load by doing:  TotalClusterLoad /
numNodes.  However, numNodes includes decommissioned nodes (which have 0 load). 
 Therefore, the average load is artificially low.  Example:

TotalLoad = 250
numNodes = 100
decommissionedNodes = 50

avgLoad = 250/100 = 2.50
trueAvgLoad = 250 / (100 - 70) = 8.33

If the real load of the remaining 30 nodes is (on average) 8.33, this is more 
than 2x the calculated average load of 2.50.  This causes these nodes to be 
rejected as replica locations. The final result is that all nodes are rejected, 
and no replicas can be placed.  

See exceptions printed from client during this scenario: 
https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5837) dfs.namenode.replication.considerLoad does not consider decommissioned nodes

2014-01-27 Thread Bryan Beaudreault (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HDFS-5837:


Description: 
In DefaultBlockPlacementPolicy, there is a setting 
dfs.namenode.replication.considerLoad which tries to balance the load of the 
cluster when choosing replica locations.  This code does not take into account 
decommissioned nodes.

The code for considerLoad calculates the load by doing:  TotalClusterLoad / 
numNodes.  However, numNodes includes decommissioned nodes (which have 0 load). 
 Therefore, the average load is artificially low.  Example:

TotalLoad = 250
numNodes = 100
decommissionedNodes = 70
remainingNodes = numNodes - decommissionedNodes = 30

avgLoad = 250/100 = 2.50
trueAvgLoad = 250 / 30 = 8.33

If the real load of the remaining 30 nodes is (on average) 8.33, this is more 
than 2x the calculated average load of 2.50.  This causes these nodes to be 
rejected as replica locations. The final result is that all nodes are rejected, 
and no replicas can be placed.  

See exceptions printed from client during this scenario: 
https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1


  was:
In DefaultBlockPlacementPolicy, there is a setting 
dfs.namenode.replication.considerLoad which tries to balance the load of the 
cluster when choosing replica locations.  This code does not take into account 
decommissioned nodes.

The code for considerLoad calculates the load by doing:  TotalClusterLoad /
numNodes.  However, numNodes includes decommissioned nodes (which have 0 load). 
 Therefore, the average load is artificially low.  Example:

TotalLoad = 250
numNodes = 100
decommissionedNodes = 50

avgLoad = 250/100 = 2.50
trueAvgLoad = 250 / (100 - 70) = 8.33

If the real load of the remaining 30 nodes is (on average) 8.33, this is more 
than 2x the calculated average load of 2.50.  This causes these nodes to be 
rejected as replica locations. The final result is that all nodes are rejected, 
and no replicas can be placed.  

See exceptions printed from client during this scenario: 
https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1



> dfs.namenode.replication.considerLoad does not consider decommissioned nodes
> 
>
> Key: HDFS-5837
> URL: https://issues.apache.org/jira/browse/HDFS-5837
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Bryan Beaudreault
>
> In DefaultBlockPlacementPolicy, there is a setting 
> dfs.namenode.replication.considerLoad which tries to balance the load of the 
> cluster when choosing replica locations.  This code does not take into 
> account decommissioned nodes.
> The code for considerLoad calculates the load by doing:  TotalClusterLoad / 
> numNodes.  However, numNodes includes decommissioned nodes (which have 0 
> load).  Therefore, the average load is artificially low.  Example:
> TotalLoad = 250
> numNodes = 100
> decommissionedNodes = 70
> remainingNodes = numNodes - decommissionedNodes = 30
> avgLoad = 250/100 = 2.50
> trueAvgLoad = 250 / 30 = 8.33
> If the real load of the remaining 30 nodes is (on average) 8.33, this is more 
> than 2x the calculated average load of 2.50.  This causes these nodes to be 
> rejected as replica locations. The final result is that all nodes are 
> rejected, and no replicas can be placed.  
> See exceptions printed from client during this scenario: 
> https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5837) dfs.namenode.replication.considerLoad does not consider decommissioned nodes

2014-02-09 Thread Bryan Beaudreault (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896080#comment-13896080
 ] 

Bryan Beaudreault commented on HDFS-5837:
-

Thanks!

> dfs.namenode.replication.considerLoad does not consider decommissioned nodes
> 
>
> Key: HDFS-5837
> URL: https://issues.apache.org/jira/browse/HDFS-5837
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha, 3.0.0, 2.0.6-alpha, 2.2.0
>Reporter: Bryan Beaudreault
>Assignee: Tao Luo
> Fix For: 2.3.0
>
> Attachments: HDFS-5837.patch, HDFS-5837_B.patch, HDFS-5837_C.patch, 
> HDFS-5837_branch_2.2.0.patch
>
>
> In DefaultBlockPlacementPolicy, there is a setting 
> dfs.namenode.replication.considerLoad which tries to balance the load of the 
> cluster when choosing replica locations.  This code does not take into 
> account decommissioned nodes.
> The code for considerLoad calculates the load by doing:  TotalClusterLoad / 
> numNodes.  However, numNodes includes decommissioned nodes (which have 0 
> load).  Therefore, the average load is artificially low.  Example:
> TotalLoad = 250
> numNodes = 100
> decommissionedNodes = 70
> remainingNodes = numNodes - decommissionedNodes = 30
> avgLoad = 250/100 = 2.50
> trueAvgLoad = 250 / 30 = 8.33
> If the real load of the remaining 30 nodes is (on average) 8.33, this is more 
> than 2x the calculated average load of 2.50.  This causes these nodes to be 
> rejected as replica locations. The final result is that all nodes are 
> rejected, and no replicas can be placed.  
> See exceptions printed from client during this scenario: 
> https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)