[jira] [Updated] (HDFS-17631) Fix RedundantEditLogInputStream.nextOp() state error when EditLogInputStream.skipUntil() throw IOException

2024-09-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17631:
---
Description: 
For namenode HA mode, standby namenode load editlog form journalnodes  via 
QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
used for combine multiple remote journalnode inputstreams.

The problems is that when read editlog with 
RedundantEditLogInputStream.nextOp() if the first stream execute skipUntil() 
throw IOException ( network errors, or hardware problems etc..) ,  it will be 
State.OK rather than State.STREAM_FAILED. 

And the proper state will be like blew and fault tolerant:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL 
-> State.OK

  was:
For namenode HA mode, standby namenode load editlog form journalnodes  via 
QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
used for combine multiple remote journalnode inputstreams.

The problems is that when read editlog with 
RedundantEditLogInputStream.nextOp() if the first stream execute skipUntil() 
throw IOException ( network errors, or hardware problems etc..) ,  it will be 
State.OK rather than State.STREAM_FAILED. 

 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL 
-> State.OK


> Fix RedundantEditLogInputStream.nextOp()  state error when 
> EditLogInputStream.skipUntil() throw IOException
> ---
>
> Key: HDFS-17631
> URL: https://issues.apache.org/jira/browse/HDFS-17631
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> For namenode HA mode, standby namenode load editlog form journalnodes  via 
> QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
> used for combine multiple remote journalnode inputstreams.
> The problems is that when read editlog with 
> RedundantEditLogInputStream.nextOp() if the first stream execute skipUntil() 
> throw IOException ( network errors, or hardware problems etc..) ,  it will be 
> State.OK rather than State.STREAM_FAILED. 
> And the proper state will be like blew and fault tolerant:
> State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL 
> -> State.OK



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17631) Fix RedundantEditLogInputStream.nextOp() state error when EditLogInputStream.skipUntil() throw IOException

2024-09-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17631:
---
Description: 
For namenode HA mode, standby namenode load editlog form journalnodes  via 
QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
used for combine multiple remote journalnode inputstreams.

The problems is that when read editlog with 
RedundantEditLogInputStream.nextOp() if the first stream execute skipUntil() 
throw IOException ( network errors, or hardware problems etc..) ,  it will be 
State.OK rather than State.STREAM_FAILED. 

 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL 
-> State.OK

  was:
For namenode HA mode, standby namenode load editlog form journalnodes  via 
QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
used for combine multiple remote journalnode inputstreams.

Now when EditLogInputStream.skipUntil() throw IOException in 
RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
State.STREAM_FAILED. 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL


> Fix RedundantEditLogInputStream.nextOp()  state error when 
> EditLogInputStream.skipUntil() throw IOException
> ---
>
> Key: HDFS-17631
> URL: https://issues.apache.org/jira/browse/HDFS-17631
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> For namenode HA mode, standby namenode load editlog form journalnodes  via 
> QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
> used for combine multiple remote journalnode inputstreams.
> The problems is that when read editlog with 
> RedundantEditLogInputStream.nextOp() if the first stream execute skipUntil() 
> throw IOException ( network errors, or hardware problems etc..) ,  it will be 
> State.OK rather than State.STREAM_FAILED. 
>  
> The proper state will be like blew:
> State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL 
> -> State.OK



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17631) Fix RedundantEditLogInputStream.nextOp() state error when EditLogInputStream.skipUntil() throw IOException

2024-09-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17631:
---
Description: 
For namenode HA mode, standby namenode load editlog form journalnodes  via 
QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
used for combine multiple remote journalnode inputstreams.

Now when EditLogInputStream.skipUntil() throw IOException in 
RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
State.STREAM_FAILED. 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL

  was:
For namenode HA mode, standby namenode load editlog form journalnodes.  

 

Now when EditLogInputStream.skipUntil() throw IOException in 
RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
State.STREAM_FAILED. 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL


> Fix RedundantEditLogInputStream.nextOp()  state error when 
> EditLogInputStream.skipUntil() throw IOException
> ---
>
> Key: HDFS-17631
> URL: https://issues.apache.org/jira/browse/HDFS-17631
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> For namenode HA mode, standby namenode load editlog form journalnodes  via 
> QuorumJournalManger.selectInputStreams().  And RedundantEditLogInputStream is 
> used for combine multiple remote journalnode inputstreams.
> Now when EditLogInputStream.skipUntil() throw IOException in 
> RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
> State.STREAM_FAILED. 
> The proper state will be like blew:
> State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17631) Fix RedundantEditLogInputStream.nextOp() state error when EditLogInputStream.skipUntil() throw IOException

2024-09-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17631:
---
Description: 
For namenode HA mode, standby namenode load editlog form journalnodes.  

 

Now when EditLogInputStream.skipUntil() throw IOException in 
RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
State.STREAM_FAILED. 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL

  was:
Now when EditLogInputStream.skipUntil() throw IOException in 
RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
State.STREAM_FAILED. 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL


> Fix RedundantEditLogInputStream.nextOp()  state error when 
> EditLogInputStream.skipUntil() throw IOException
> ---
>
> Key: HDFS-17631
> URL: https://issues.apache.org/jira/browse/HDFS-17631
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> For namenode HA mode, standby namenode load editlog form journalnodes.  
>  
> Now when EditLogInputStream.skipUntil() throw IOException in 
> RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
> State.STREAM_FAILED. 
> The proper state will be like blew:
> State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17631) Fix RedundantEditLogInputStream.nextOp() state error when EditLogInputStream.skipUntil() throw IOException

2024-09-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17631:
---
Summary: Fix RedundantEditLogInputStream.nextOp()  state error when 
EditLogInputStream.skipUntil() throw IOException  (was: 
RedundantEditLogInputStream.nextOp() will be State.STREAM_FAILED when 
EditLogInputStream.skipUntil() throw IOException)

> Fix RedundantEditLogInputStream.nextOp()  state error when 
> EditLogInputStream.skipUntil() throw IOException
> ---
>
> Key: HDFS-17631
> URL: https://issues.apache.org/jira/browse/HDFS-17631
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> Now when EditLogInputStream.skipUntil() throw IOException in 
> RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
> State.STREAM_FAILED. 
> The proper state will be like blew:
> State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17631) RedundantEditLogInputStream.nextOp() will be State.STREAM_FAILED when EditLogInputStream.skipUntil() throw IOException

2024-09-24 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17631:
--

Assignee: liuguanghua

> RedundantEditLogInputStream.nextOp() will be State.STREAM_FAILED when 
> EditLogInputStream.skipUntil() throw IOException
> --
>
> Key: HDFS-17631
> URL: https://issues.apache.org/jira/browse/HDFS-17631
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Now when EditLogInputStream.skipUntil() throw 
> IOException in RedundantEditLogInputStream.nextOp(), it is still into 
> State.OK rather than State.STREAM_FAILED. 
> The proper state will be like blew:
> State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17631) RedundantEditLogInputStream.nextOp() will be State.STREAM_FAILED when EditLogInputStream.skipUntil() throw IOException

2024-09-24 Thread liuguanghua (Jira)
liuguanghua created HDFS-17631:
--

 Summary: RedundantEditLogInputStream.nextOp() will be 
State.STREAM_FAILED when EditLogInputStream.skipUntil() throw IOException
 Key: HDFS-17631
 URL: https://issues.apache.org/jira/browse/HDFS-17631
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: Now when EditLogInputStream.skipUntil() throw IOException 
in RedundantEditLogInputStream.nextOp(), it is still into State.OK rather than 
State.STREAM_FAILED. 

The proper state will be like blew:

State.SKIP_UNTIL -> State.STREAM_FAILED ->(try next stream)  State.SKIP_UNTIL
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17592) FastCopy support data copy in different nameservices without federation

2024-07-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17592:
---
Description: 
FastCopy is  a faster data copy tools.  In federation cluster  or a single 
cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
original copy.

FastCopy can support data copy via transfer in different nameservices without 
federation. In theory, it could reduces one IO transfer and almost reduce halt 
time.

 

Test Data:

blocksize 128M

1TB ECfiles + 1TB 3 replicated files

 
|distcp map=20|DIstcp via FastCopy(HardLink)|DistCp via 
FastCopy(Transfer)|Distcp(original)|
| Time Spent|5m6.687s|22m44.094s|38m17.024s|

 

  was:
FastCopy is  a faster data copy tools.  In federation cluster  or a single 
cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
original copy.

FastCopy can support data copy via transfer in different nameservices without 
federation. In theory, it could save almost half the time  compared to origianl 
 copy.


> FastCopy support data copy in different nameservices without federation
> ---
>
> Key: HDFS-17592
> URL: https://issues.apache.org/jira/browse/HDFS-17592
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
> Attachments: FastCopy via Transfer.jpg
>
>
> FastCopy is  a faster data copy tools.  In federation cluster  or a single 
> cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
> original copy.
> FastCopy can support data copy via transfer in different nameservices without 
> federation. In theory, it could reduces one IO transfer and almost reduce 
> halt time.
>  
> Test Data:
> blocksize 128M
> 1TB ECfiles + 1TB 3 replicated files
>  
> |distcp map=20|DIstcp via FastCopy(HardLink)|DistCp via 
> FastCopy(Transfer)|Distcp(original)|
> | Time Spent|5m6.687s|22m44.094s|38m17.024s|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17592) FastCopy support data copy in different nameservices without federation

2024-07-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17592:
---
Attachment: FastCopy via Transfer.jpg

> FastCopy support data copy in different nameservices without federation
> ---
>
> Key: HDFS-17592
> URL: https://issues.apache.org/jira/browse/HDFS-17592
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
> Attachments: FastCopy via Transfer.jpg
>
>
> FastCopy is  a faster data copy tools.  In federation cluster  or a single 
> cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
> original copy.
> FastCopy can support data copy via transfer in different nameservices without 
> federation. In theory, it could save almost half the time  compared to 
> origianl  copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17592) FastCopy support data copy in different nameservices without federation

2024-07-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17592:
--

Assignee: liuguanghua

> FastCopy support data copy in different nameservices without federation
> ---
>
> Key: HDFS-17592
> URL: https://issues.apache.org/jira/browse/HDFS-17592
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>
> FastCopy is  a faster data copy tools.  In federation cluster  or a single 
> cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
> original copy.
> FastCopy can support data copy via transfer in different nameservices without 
> federation. In theory, it could save almost half the time  compared to 
> origianl  copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17592) FastCopy support data copy in different nameservices without federation

2024-07-26 Thread liuguanghua (Jira)
liuguanghua created HDFS-17592:
--

 Summary: FastCopy support data copy in different nameservices 
without federation
 Key: HDFS-17592
 URL: https://issues.apache.org/jira/browse/HDFS-17592
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: liuguanghua


FastCopy is  a faster data copy tools.  In federation cluster  or a single 
cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
original copy.

FastCopy can support data copy via transfer in different nameservices without 
federation. In theory, it could save almost half the time  compared to origianl 
 copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2139) Fast copy for HDFS.

2024-07-22 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867676#comment-17867676
 ] 

liuguanghua commented on HDFS-2139:
---

--  We already has hadoop distcp, Why do we need hdfs dfs -distcp?
[~zeekling] ,  dfs -fastcp command like dfs -cp ,  and distcp is a mapreduce 
program and reply on yarn.  For some case, we can use fastcp without distcp 
with only hdfs usage. 

> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
>Assignee: Rituraj
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch, image-2022-08-11-11-48-17-994.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode
> [~xuzq_zander]Provided a design doc 
> https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-2139) Fast copy for HDFS.

2024-07-21 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867636#comment-17867636
 ] 

liuguanghua edited comment on HDFS-2139 at 7/22/24 2:52 AM:


[~hexiaoqiao]  , thanks for reply.

For 3 : fastcopy can use in a federation cluster,  and in a single cluster , 
and in two different cluster with no federation.  The difference is that 
fastcopy will use  hardlink in federation cluster or in a single cluster.  And 
fastcopy will use transfer in  two different cluster with no federation.   

 

Test Data:

blocksize 128M

1TB ECfiles + 1TB 3 replicated files

 
|distcp map=20|DIstcp via FastCopy(HardLink)|DistCp via 
FastCopy(Transfer)|Distcp(original)|
| Time Spent|5m6.687s|22m44.094s|38m17.024s|

[~zeekling] , fastcopy can improve data copy efficiency.


was (Author: liuguanghua):
[~hexiaoqiao]  , thanks for reply.

For 3 : fastcopy can use in a federation cluster,  and in a single cluster , 
and in two different cluster with no federation.  The difference is that 
fastcopy will use  hardlink in federation cluster or in a single cluster.  And 
fastcopy will use transfer in  two different cluster with no federation.   

 

Test Data:

blocksize 128M

1TB ECfiles + 1TB 3 replicated files

 
|distcp map=20|DIstcp via FastCopy(HardLink)|DistCp via 
FastCopy(Transfer)|Distcp(original)|
|时间|5m6.687s|22m44.094s|38m17.024s|

[~zeekling] , fastcopy can improve data copy efficiency.

> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
>Assignee: Rituraj
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch, image-2022-08-11-11-48-17-994.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode
> [~xuzq_zander]Provided a design doc 
> https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2139) Fast copy for HDFS.

2024-07-21 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867636#comment-17867636
 ] 

liuguanghua commented on HDFS-2139:
---

[~hexiaoqiao]  , thanks for reply.

For 3 : fastcopy can use in a federation cluster,  and in a single cluster , 
and in two different cluster with no federation.  The difference is that 
fastcopy will use  hardlink in federation cluster or in a single cluster.  And 
fastcopy will use transfer in  two different cluster with no federation.   

 

Test Data:

blocksize 128M

1TB ECfiles + 1TB 3 replicated files

 
|distcp map=20|DIstcp via FastCopy(HardLink)|DistCp via 
FastCopy(Transfer)|Distcp(original)|
|时间|5m6.687s|22m44.094s|38m17.024s|

[~zeekling] , fastcopy can improve data copy efficiency.

> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
>Assignee: Rituraj
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch, image-2022-08-11-11-48-17-994.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode
> [~xuzq_zander]Provided a design doc 
> https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17581) Add FastCopy tool and support dfs -fastcp command

2024-07-15 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17581:
---
Description: 
Add FastCopy Tool :

(1) support data replication with replication files

(2) support data replication with ec files

And add hdfs dfs -fastcp command for copy file use fastcopy.  And the fastcp is 
similar to cp command

 

This is depend on  HDFS-16757

  was:
Add FastCopy Tool :

(1) support data replication with replication files

(2) support data replication with ec files

And add hdfs dfs -fastcp command for copy file use fastcopy.  And the fastcp is 
similar to cp command

 


> Add FastCopy tool and support dfs -fastcp command
> -
>
> Key: HDFS-17581
> URL: https://issues.apache.org/jira/browse/HDFS-17581
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>
> Add FastCopy Tool :
> (1) support data replication with replication files
> (2) support data replication with ec files
> And add hdfs dfs -fastcp command for copy file use fastcopy.  And the fastcp 
> is similar to cp command
>  
> This is depend on  HDFS-16757



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17582) Distcp support fastcopy

2024-07-15 Thread liuguanghua (Jira)
liuguanghua created HDFS-17582:
--

 Summary: Distcp support fastcopy 
 Key: HDFS-17582
 URL: https://issues.apache.org/jira/browse/HDFS-17582
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: liuguanghua
Assignee: liuguanghua


DistCp support fastcopy  for distribute data replication cross same nameservice 
or different nameservices in hdfs federation cluster.

This is depend on 
 # HDFS-16757
 # HDFS-17581



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17581) Add FastCopy tool and support dfs -fastcp command

2024-07-15 Thread liuguanghua (Jira)
liuguanghua created HDFS-17581:
--

 Summary: Add FastCopy tool and support dfs -fastcp command
 Key: HDFS-17581
 URL: https://issues.apache.org/jira/browse/HDFS-17581
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: liuguanghua
Assignee: liuguanghua


Add FastCopy Tool :

(1) support data replication with replication files

(2) support data replication with ec files

And add hdfs dfs -fastcp command for copy file use fastcopy.  And the fastcp is 
similar to cp command

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17509) RBF: Fix ClientProtocol.concat will throw NPE if tgr is a empty file.

2024-05-16 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847165#comment-17847165
 ] 

liuguanghua commented on HDFS-17509:


Thanks [~xuzq_zander] 

> RBF: Fix ClientProtocol.concat  will throw NPE if tgr is a empty file.
> --
>
> Key: HDFS-17509
> URL: https://issues.apache.org/jira/browse/HDFS-17509
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> hdfs dfs -concat  /tmp/merge /tmp/t1 /tmp/t2
> When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17509) RBF: Fix ClientProtocol.concat will throw NPE if tgr is a empty file.

2024-05-16 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17509:
--

Assignee: liuguanghua

> RBF: Fix ClientProtocol.concat  will throw NPE if tgr is a empty file.
> --
>
> Key: HDFS-17509
> URL: https://issues.apache.org/jira/browse/HDFS-17509
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> hdfs dfs -concat  /tmp/merge /tmp/t1 /tmp/t2
> When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2139) Fast copy for HDFS.

2024-05-15 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846574#comment-17846574
 ] 

liuguanghua commented on HDFS-2139:
---

[~haiyang Hu]  Hello sir. Are you still working on this now ?  I am interested 
in doing some things in this job. [~xuzq_zander]   And I will use fastcopy in 
production environment.   

> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
>Assignee: Rituraj
>Priority: Major
> Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch, image-2022-08-11-11-48-17-994.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode
> [~xuzq_zander]Provided a design doc 
> https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2139) Fast copy for HDFS.

2024-05-07 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844497#comment-17844497
 ] 

liuguanghua commented on HDFS-2139:
---

[~xuzq_zander]  Hello,sir.    The design doc can not be viewd because of 
permission. 

[https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing]

Can you upload a new version in Attachments? Thanks very much

> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
>Assignee: Rituraj
>Priority: Major
> Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch, image-2022-08-11-11-48-17-994.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode
> [~xuzq_zander]Provided a design doc 
> https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17509) RBF: Fix ClientProtocol.concat will throw NPE if tgr is a empty file.

2024-04-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17509:
---
Summary: RBF: Fix ClientProtocol.concat  will throw NPE if tgr is a empty 
file.  (was: RBF: ClientProtocol.concat  will throw NPE if tgr is a empty file.)

> RBF: Fix ClientProtocol.concat  will throw NPE if tgr is a empty file.
> --
>
> Key: HDFS-17509
> URL: https://issues.apache.org/jira/browse/HDFS-17509
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> hdfs dfs -concat  /tmp/merge /tmp/t1 /tmp/t2
> When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17509) RBF: ClientProtocol.concat will throw NPE if tgr is a empty file.

2024-04-29 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17509:
---
Description: 
hdfs dfs -concat  /tmp/merge /tmp/t1 /tmp/t2

When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. 

 

 

> RBF: ClientProtocol.concat  will throw NPE if tgr is a empty file.
> --
>
> Key: HDFS-17509
> URL: https://issues.apache.org/jira/browse/HDFS-17509
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> hdfs dfs -concat  /tmp/merge /tmp/t1 /tmp/t2
> When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17509) RBF: ClientProtocol.concat will throw NPE if tgr is a empty file.

2024-04-29 Thread liuguanghua (Jira)
liuguanghua created HDFS-17509:
--

 Summary: RBF: ClientProtocol.concat  will throw NPE if tgr is a 
empty file.
 Key: HDFS-17509
 URL: https://issues.apache.org/jira/browse/HDFS-17509
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2024-03-21 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829445#comment-17829445
 ] 

liuguanghua edited comment on HDFS-16016 at 3/21/24 9:13 AM:
-

Thank for reply.  [~zhangxiping] 

IBR  contains   DELETED_BLOCK,   RECEIVED_BLOCK, RECEIVING_BLOCK.  Mis-order of 
IBR and FBR not only effect to_remove blocks. 

And NN should remove blocks which FBR does not contain  if disk damage lost 
blocks.

 


was (Author: liuguanghua):
Thank for reply.  

IBR  contains   DELETED_BLOCK,   RECEIVED_BLOCK, RECEIVING_BLOCK.  Mis-order of 
IBR and FBR not only effect to_remove blocks. 

And NN should remove blocks which FBR does not contain  if disk damage lost 
blocks.

 

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png, 
> image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, 
> image-2024-03-20-18-31-23-937.png, image-2024-03-21-16-20-46-746.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2024-03-21 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829445#comment-17829445
 ] 

liuguanghua commented on HDFS-16016:


Thank for reply.  

IBR  contains   DELETED_BLOCK,   RECEIVED_BLOCK, RECEIVING_BLOCK.  Mis-order of 
IBR and FBR not only effect to_remove blocks. 

And NN should remove blocks which FBR does not contain  if disk damage lost 
blocks.

 

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png, 
> image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, 
> image-2024-03-20-18-31-23-937.png, image-2024-03-21-16-20-46-746.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2024-03-21 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829403#comment-17829403
 ] 

liuguanghua commented on HDFS-16016:


In step4, It is according to the following way?

(1)  In a loop,  Heartbeat -> IBR(if need) -> FBR(6h)

(2)  And DN keeps all blocks(FBR) in memory ,and merge every IBR

[~zhangxiping] , Thanks.

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png, 
> image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, 
> image-2024-03-20-18-31-23-937.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17357) NioInetPeer.close() should close socket connection.

2024-01-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17357:
---
Summary: NioInetPeer.close() should close socket connection.  (was: EC: 
NioInetPeer.close() should close socket connection.)

> NioInetPeer.close() should close socket connection.
> ---
>
> Key: HDFS-17357
> URL: https://issues.apache.org/jira/browse/HDFS-17357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> NioInetPeer.close()  now do not close socket connection.  
> And I found 3w+ connections leakage in datanode . And I found many warn 
> message as blew.
> 2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hostname:50010:DataXceiverServer
>  
> When any Exception is found in DataXceiverServer, it will execute clostStream.
> IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 
> But NioInetPeer.close()  is not invoked with  close socket connection. And 
> this will lead to connection leakage.
> Other subClass of Peer's close() is implemented with socket.close().  See 
> EncryptedPeer, DomainPeer, BasicInetPeer
>  
>  
> This solution can be reporduced as following:
> (1) Client write data to HDFS
> (2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
> new Xceiver will fail and throw IOException . And the socket will not release.
> (3) Client crash for that no new data will be added or client.close is 
> executed.
> (4) There will be socket connection leakage between datanodes.
>  
>  
> The connection leakage like this
> dn1
> dn1:57042 dn2:50010 ESTABLISHED
> dn2
> dn2:50010 dn1:57042 ESTABLISHED



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17357) EC: NioInetPeer.close() should close socket connection.

2024-01-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17357:
---
Description: 
NioInetPeer.close()  now do not close socket connection.  

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 

 

This solution can be reporduced as following:
(1) Client write data to HDFS
(2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
new Xceiver will fail and throw IOException . And the socket will not release.
(3) Client crash for that no new data will be added or client.close is executed.
(4) There will be socket connection leakage between datanodes.

 

 

The connection leakage like this
dn1
dn1:57042 dn2:50010 ESTABLISHED

dn2
dn2:50010 dn1:57042 ESTABLISHED

  was:
NioInetPeer.close()  now do not close socket connection.  

 

In my environment,all data were stored with EC.

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 

 

This solution can be reporduced as following:
(1) Client write data to HDFS
(2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
new Xceiver will fail and throw IOException . And the socket will not release.
(3) Client crash for that no new data will be added or client.close is executed.
(4) There will be socket connection leakage between datanodes.

 

 

The connection leakage like this
dn1
dn1:57042 dn2:50010 ESTABLISHED

dn2
dn2:50010 dn1:57042 ESTABLISHED


> EC: NioInetPeer.close() should close socket connection.
> ---
>
> Key: HDFS-17357
> URL: https://issues.apache.org/jira/browse/HDFS-17357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> NioInetPeer.close()  now do not close socket connection.  
> And I found 3w+ connections leakage in datanode . And I found many warn 
> message as blew.
> 2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hostname:50010:DataXceiverServer
>  
> When any Exception is found in DataXceiverServer, it will execute clostStream.
> IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 
> But NioInetPeer.close()  is not invoked with  close socket connection. And 
> this will lead to connection leakage.
> Other subClass of Peer's close() is implemented with socket.close().  See 
> EncryptedPeer, DomainPeer, BasicInetPeer
>  
>  
> This solution can be reporduced as following:
> (1) Client write data to HDFS
> (2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
> new Xceiver will fail and throw IOException . And the socket will not release.
> (3) Client crash for that no new data will be added or client.close is 
> executed.
> (4) There will be socket connection leakage between datanodes.
>  
>  
> The connection leakage like this
> dn1
> dn1:57042 dn2:50010 ESTABLISHED
> dn2
> dn2:50010 dn1:57042 ESTABLISHED



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17357) NioInetPeer.close() should close socket connection.

2024-01-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17357:
---
Description: 
NioInetPeer.close()  now do not close socket connection.  

 

In my environment,all data were stored with EC.

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 

 

This solution can be reporduced as following:
(1) Client write data to HDFS
(2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
new Xceiver will fail and throw IOException . And the socket will not release.
(3) Client crash for that no new data will be added or client.close is executed.
(4) There will be socket connection leakage between datanodes.

 

 

The connection leakage like this
dn1
dn1:57042 dn2:50010 ESTABLISHED

dn2
dn2:50010 dn1:57042 ESTABLISHED

  was:
NioInetPeer.close()  now do not close socket connection.  

 

In my environment,all data were stored with EC.

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 

 

This solution can be reporduced as following:
(1) Client write data to HDFS
(2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
new Xceiver will fail and throw IOException . And the socket will not release.
(3) Client crash for that no new data will be added or client.close is executed.
(4) There will be socket connection leakage between datanodes.


> NioInetPeer.close() should close socket connection.
> ---
>
> Key: HDFS-17357
> URL: https://issues.apache.org/jira/browse/HDFS-17357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> NioInetPeer.close()  now do not close socket connection.  
>  
> In my environment,all data were stored with EC.
> And I found 3w+ connections leakage in datanode . And I found many warn 
> message as blew.
> 2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hostname:50010:DataXceiverServer
>  
> When any Exception is found in DataXceiverServer, it will execute clostStream.
> IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 
> But NioInetPeer.close()  is not invoked with  close socket connection. And 
> this will lead to connection leakage.
> Other subClass of Peer's close() is implemented with socket.close().  See 
> EncryptedPeer, DomainPeer, BasicInetPeer
>  
>  
> This solution can be reporduced as following:
> (1) Client write data to HDFS
> (2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
> new Xceiver will fail and throw IOException . And the socket will not release.
> (3) Client crash for that no new data will be added or client.close is 
> executed.
> (4) There will be socket connection leakage between datanodes.
>  
>  
> The connection leakage like this
> dn1
> dn1:57042 dn2:50010 ESTABLISHED
> dn2
> dn2:50010 dn1:57042 ESTABLISHED



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17357) NioInetPeer.close() should close socket connection.

2024-01-26 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17357:
---
Description: 
NioInetPeer.close()  now do not close socket connection.  

 

In my environment,all data were stored with EC.

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 

 

This solution can be reporduced as following:
(1) Client write data to HDFS
(2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
new Xceiver will fail and throw IOException . And the socket will not release.
(3) Client crash for that no new data will be added or client.close is executed.
(4) There will be socket connection leakage between datanodes.

  was:
NioInetPeer.close()  now do not close socket connection.  

 

In my environment,all data were stored with EC.

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 


> NioInetPeer.close() should close socket connection.
> ---
>
> Key: HDFS-17357
> URL: https://issues.apache.org/jira/browse/HDFS-17357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> NioInetPeer.close()  now do not close socket connection.  
>  
> In my environment,all data were stored with EC.
> And I found 3w+ connections leakage in datanode . And I found many warn 
> message as blew.
> 2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hostname:50010:DataXceiverServer
>  
> When any Exception is found in DataXceiverServer, it will execute clostStream.
> IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 
> But NioInetPeer.close()  is not invoked with  close socket connection. And 
> this will lead to connection leakage.
> Other subClass of Peer's close() is implemented with socket.close().  See 
> EncryptedPeer, DomainPeer, BasicInetPeer
>  
>  
> This solution can be reporduced as following:
> (1) Client write data to HDFS
> (2) datanode Xceiver count max to DFS_DATANODE_MAX_RECEIVER_THREADS_KEY , the 
> new Xceiver will fail and throw IOException . And the socket will not release.
> (3) Client crash for that no new data will be added or client.close is 
> executed.
> (4) There will be socket connection leakage between datanodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17357) NioInetPeer.close() should close socket connection.

2024-01-25 Thread liuguanghua (Jira)
liuguanghua created HDFS-17357:
--

 Summary: NioInetPeer.close() should close socket connection.
 Key: HDFS-17357
 URL: https://issues.apache.org/jira/browse/HDFS-17357
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua


NioInetPeer.close()  now do not close socket connection.  

 

In my environment,all data were stored with EC.

And I found 3w+ connections leakage in datanode . And I found many warn message 
as blew.

2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
hostname:50010:DataXceiverServer

 

When any Exception is found in DataXceiverServer, it will execute clostStream.

IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 

But NioInetPeer.close()  is not invoked with  close socket connection. And this 
will lead to connection leakage.

Other subClass of Peer's close() is implemented with socket.close().  See 

EncryptedPeer, DomainPeer, BasicInetPeer

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17357) NioInetPeer.close() should close socket connection.

2024-01-25 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17357:
--

Assignee: liuguanghua

> NioInetPeer.close() should close socket connection.
> ---
>
> Key: HDFS-17357
> URL: https://issues.apache.org/jira/browse/HDFS-17357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>
> NioInetPeer.close()  now do not close socket connection.  
>  
> In my environment,all data were stored with EC.
> And I found 3w+ connections leakage in datanode . And I found many warn 
> message as blew.
> 2024-01-22 15:27:57,500 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hostname:50010:DataXceiverServer
>  
> When any Exception is found in DataXceiverServer, it will execute clostStream.
> IOUtils.closeStream(peer)    -> Peer.close() -> NioInetPeer.close() 
> But NioInetPeer.close()  is not invoked with  close socket connection. And 
> this will lead to connection leakage.
> Other subClass of Peer's close() is implemented with socket.close().  See 
> EncryptedPeer, DomainPeer, BasicInetPeer
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-10 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17311:
---
Description: 
In the Router, find blow log

 
2023-12-29 15:18:54,799 ERROR 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
more than 2048 connections at the same time

 
The log indicates that ConnectionManager.creatorQueue is full at a certain 
point. But my cluster does not have so many users cloud reach up 2048 pair of 
.

This may be due to the following reasons:
 # ConnectionManager.creatorQueue is a queue that will be offered 
ConnectionPool if ConnectionContext is not enough.
 # ConnectionCreator thread will consume from creatorQueue and make more 
ConnectionContexts for a ConnectionPool.
 # Client will concurrent invoke for ConnectionManager.getConnection() for a 
same user. And this maybe lead to add many same ConnectionPool into 
ConnectionManager.creatorQueue.
 # When creatorQueue is full, a new ConnectionPool will not be added in 
successfully and log this error. This maybe lead to a really new ConnectionPool 
clould not produce more ConnectionContexts for new user.

So this pr try to make creatorQueue will not add same ConnectionPool at once.

  was:
2023-12-29 15:18:54,799 ERROR 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
more than 2048 connections at the same time

In my environment, ConnectionManager creatorQueue is full ,but the cluster does 
not have so many users cloud reach up  2048 pair of  in router.

In the case of a large number of concurrent  creatorQueue add same pool more 
than once.

 


> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17300) [SBN READ] A rpc call in Observer should throw ObserverRetryOnActiveException if its stateid is always lower than client stateid for a configured time.

2024-01-09 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17300:
--

Assignee: liuguanghua

> [SBN READ]  A rpc call in Observer should throw 
> ObserverRetryOnActiveException if its stateid is always lower than client 
> stateid for a configured time.
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>   
> Now when Observer is enable, Observer will update its stateid through that 
> EditLogTailer near-real-time tailing editlog form Active Namenode. And if a 
> rpc call's stateid is lower than client stateid which may update from active 
> namenode with msync, the call will be requeued into callqueue.
> This PR is intend to if a rpc call's stateid is always lower than client 
> statid for a configured time , the call should throw 
> ObserverRetryOnActiveException for client and client will go to active 
> namenode for processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17300) [SBN READ] A rpc call in Observer should throw ObserverRetryOnActiveException if its stateid is always lower than client stateid for a configured time.

2024-01-09 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17300:
---
Summary: [SBN READ]  A rpc call in Observer should throw 
ObserverRetryOnActiveException if its stateid is always lower than client 
stateid for a configured time.  (was: [SBN READ]  A rcp call in Observer should 
throw ObserverRetryOnActiveException if its stateid is always lower than client 
stateid for a configured time.)

> [SBN READ]  A rpc call in Observer should throw 
> ObserverRetryOnActiveException if its stateid is always lower than client 
> stateid for a configured time.
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>   
> Now when Observer is enable, Observer will update its stateid through that 
> EditLogTailer near-real-time tailing editlog form Active Namenode. And if a 
> rpc call's stateid is lower than client stateid which may update from active 
> namenode with msync, the call will be requeued into callqueue.
> This PR is intend to if a rpc call's stateid is always lower than client 
> statid for a configured time , the call should throw 
> ObserverRetryOnActiveException for client and client will go to active 
> namenode for processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17300) [SBN READ] A rcp call in Observer should throw ObserverRetryOnActiveException if its stateid is always lower than client stateid for a configured time.

2024-01-09 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17300:
---
Description: 
  

Now when Observer is enable, Observer will update its stateid through that 
EditLogTailer near-real-time tailing editlog form Active Namenode. And if a rpc 
call's stateid is lower than client stateid which may update from active 
namenode with msync, the call will be requeued into callqueue.

This PR is intend to if a rpc call's stateid is always lower than client statid 
for a configured time , the call should throw ObserverRetryOnActiveException 
for client and client will go to active namenode for processing.

  was:
           Now when Observer NN is used,  if the stateid is delayed , the 
rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
something else wrong , the call will be requeued again and again.  

        So Observer should throw ObserverRetryOnActiveException if stateid is 
always delayed with Active Namenode for a configured time.


> [SBN READ]  A rcp call in Observer should throw 
> ObserverRetryOnActiveException if its stateid is always lower than client 
> stateid for a configured time.
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>   
> Now when Observer is enable, Observer will update its stateid through that 
> EditLogTailer near-real-time tailing editlog form Active Namenode. And if a 
> rpc call's stateid is lower than client stateid which may update from active 
> namenode with msync, the call will be requeued into callqueue.
> This PR is intend to if a rpc call's stateid is always lower than client 
> statid for a configured time , the call should throw 
> ObserverRetryOnActiveException for client and client will go to active 
> namenode for processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17300) [SBN READ] A rcp call in Observer should throw ObserverRetryOnActiveException if its stateid is always lower than client stateid for a configured time.

2024-01-09 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17300:
---
Summary: [SBN READ]  A rcp call in Observer should throw 
ObserverRetryOnActiveException if its stateid is always lower than client 
stateid for a configured time.  (was: [SBN READ] Observer should throw 
ObserverRetryOnActiveException if stateid is always delayed with Active 
Namenode for a  configured time)

> [SBN READ]  A rcp call in Observer should throw 
> ObserverRetryOnActiveException if its stateid is always lower than client 
> stateid for a configured time.
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] (HDFS-17309) RBF: Fix Router Safemode check contidition error

2024-01-05 Thread liuguanghua (Jira)


[ https://issues.apache.org/jira/browse/HDFS-17309 ]


liuguanghua deleted comment on HDFS-17309:


was (Author: liuguanghua):
[~slfan1989]  Ok, I will submit new PR according to this format in the future. 
Thank you.

> RBF: Fix Router Safemode check contidition error
> 
>
> Key: HDFS-17309
> URL: https://issues.apache.org/jira/browse/HDFS-17309
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> With HDFS-17116, Router safemode check contidition use monotonicNow(). 
> For code in  RouterSafemodeService.periodicInvoke()
> long now = monotonicNow();
> long cacheUpdateTime = stateStore.getCacheUpdateTime();
> boolean isCacheStale = (now - cacheUpdateTime) > this.staleInterval;
>  
> Function monotonicNow() is implemented with System.nanoTime(). 
> System.nanoTime() in javadoc description:
> This method can only be used to measure elapsed time and is not related to 
> any other notion of system or wall-clock time. The value returned represents 
> nanoseconds since some fixed but arbitrary origin time (perhaps in the 
> future, so values may be negative). 
>  
> The following situation maybe exists :
> If refreshCaches not success in the beginning, cacheUpdateTime will be 0 , 
> and now - cacheUpdateTime is arbitrary origin time,so isCacheStale maybe  be 
> true or false. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17309) RBF: Fix Router Safemode check contidition error

2024-01-05 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803508#comment-17803508
 ] 

liuguanghua commented on HDFS-17309:


[~slfan1989]  Ok, I will submit new PR according to this format in the future. 
Thank you.

> RBF: Fix Router Safemode check contidition error
> 
>
> Key: HDFS-17309
> URL: https://issues.apache.org/jira/browse/HDFS-17309
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> With HDFS-17116, Router safemode check contidition use monotonicNow(). 
> For code in  RouterSafemodeService.periodicInvoke()
> long now = monotonicNow();
> long cacheUpdateTime = stateStore.getCacheUpdateTime();
> boolean isCacheStale = (now - cacheUpdateTime) > this.staleInterval;
>  
> Function monotonicNow() is implemented with System.nanoTime(). 
> System.nanoTime() in javadoc description:
> This method can only be used to measure elapsed time and is not related to 
> any other notion of system or wall-clock time. The value returned represents 
> nanoseconds since some fixed but arbitrary origin time (perhaps in the 
> future, so values may be negative). 
>  
> The following situation maybe exists :
> If refreshCaches not success in the beginning, cacheUpdateTime will be 0 , 
> and now - cacheUpdateTime is arbitrary origin time,so isCacheStale maybe  be 
> true or false. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17325) Doc: Fix the documentation of fs expunge command in FileSystemShell.md

2024-01-05 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17325:
--

Assignee: liuguanghua

> Doc: Fix the documentation of fs expunge command in FileSystemShell.md
> --
>
> Key: HDFS-17325
> URL: https://issues.apache.org/jira/browse/HDFS-17325
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> Fix doc in FileSystemShell.md.
> hadoop fs -expunge --immediate   should be hadoop fs -expunge -immediate
>  
> Usage: hadoop fs [generic options] -expunge [-immediate] [-fs ]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17325) Doc: Fix the documentation of fs expunge command in FileSystemShell.md

2024-01-05 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17325:
---
Description: 
Fix doc in FileSystemShell.md.

hadoop fs -expunge --immediate   should be hadoop fs -expunge -immediate

 


Usage: hadoop fs [generic options] -expunge [-immediate] [-fs ]

> Doc: Fix the documentation of fs expunge command in FileSystemShell.md
> --
>
> Key: HDFS-17325
> URL: https://issues.apache.org/jira/browse/HDFS-17325
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> Fix doc in FileSystemShell.md.
> hadoop fs -expunge --immediate   should be hadoop fs -expunge -immediate
>  
> Usage: hadoop fs [generic options] -expunge [-immediate] [-fs ]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17325) Doc: Fix the documentation of fs expunge command in FileSystemShell.md

2024-01-04 Thread liuguanghua (Jira)
liuguanghua created HDFS-17325:
--

 Summary: Doc: Fix the documentation of fs expunge command in 
FileSystemShell.md
 Key: HDFS-17325
 URL: https://issues.apache.org/jira/browse/HDFS-17325
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17324) RBF: Router should not return nameservices that not enable observer read in RpcResponseHeaderProto

2024-01-04 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17324:
--

Assignee: liuguanghua

> RBF: Router should not return nameservices that not enable observer read in 
> RpcResponseHeaderProto
> --
>
> Key: HDFS-17324
> URL: https://issues.apache.org/jira/browse/HDFS-17324
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> {color:#172b4d}Router Observer Read is controled by 
> RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY and 
> RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_OVERRIDES.{color}
> {color:#172b4d}If nameservice is not enable for observer read in Router, 
> RpcResponseHeaderProto  in Router should not return it.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17324) RBF: Router should not return nameservices that not enable observer read in RpcResponseHeaderProto

2024-01-04 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17324:
---
Description: 
{color:#172b4d}Router Observer Read is controled by 
RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY and 
RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_OVERRIDES.{color}

{color:#172b4d}If nameservice is not enable for observer read in Router, 
RpcResponseHeaderProto  in Router should not return it.{color}

  was:
{color:#172b4d}Router Observer Read is controled by 
RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY and 
RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_OVERRIDES.
{color}

{color:#172b4d}If nameservice is not enable for observer read in router, {color}


> RBF: Router should not return nameservices that not enable observer read in 
> RpcResponseHeaderProto
> --
>
> Key: HDFS-17324
> URL: https://issues.apache.org/jira/browse/HDFS-17324
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>
> {color:#172b4d}Router Observer Read is controled by 
> RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY and 
> RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_OVERRIDES.{color}
> {color:#172b4d}If nameservice is not enable for observer read in Router, 
> RpcResponseHeaderProto  in Router should not return it.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17324) RBF: Router should not return nameservices that not enable observer read in RpcResponseHeaderProto

2024-01-04 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17324:
---
Description: 
{color:#172b4d}Router Observer Read is controled by 
RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY and 
RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_OVERRIDES.
{color}

{color:#172b4d}If nameservice is not enable for observer read in router, {color}

> RBF: Router should not return nameservices that not enable observer read in 
> RpcResponseHeaderProto
> --
>
> Key: HDFS-17324
> URL: https://issues.apache.org/jira/browse/HDFS-17324
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>
> {color:#172b4d}Router Observer Read is controled by 
> RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY and 
> RBFConfigKeys.DFS_ROUTER_OBSERVER_READ_OVERRIDES.
> {color}
> {color:#172b4d}If nameservice is not enable for observer read in router, 
> {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17324) RBF: Router should not return nameservices that not enable observer read in RpcResponseHeaderProto

2024-01-04 Thread liuguanghua (Jira)
liuguanghua created HDFS-17324:
--

 Summary: RBF: Router should not return nameservices that not 
enable observer read in RpcResponseHeaderProto
 Key: HDFS-17324
 URL: https://issues.apache.org/jira/browse/HDFS-17324
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17321) RBF: Add RouterAutoMsyncService for auto msync in Router

2024-01-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17321:
---
Description: 
 

Router should have the ability to to auto msync to a nameservice. And it can 
ensure router periodically refreshes its record of a namespace's state.

Different from HDFS-17027, this is controled by router itself without 
configuring with AbstractNNFailoverProxyProvider.
And HDFS-16890 maybe lead to many read requests into active NN at the same time.

This PR provides a new way to implememts auto msync in Router.

> RBF: Add RouterAutoMsyncService for auto msync in Router
> 
>
> Key: HDFS-17321
> URL: https://issues.apache.org/jira/browse/HDFS-17321
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>  
> Router should have the ability to to auto msync to a nameservice. And it can 
> ensure router periodically refreshes its record of a namespace's state.
> Different from HDFS-17027, this is controled by router itself without 
> configuring with AbstractNNFailoverProxyProvider.
> And HDFS-16890 maybe lead to many read requests into active NN at the same 
> time.
> This PR provides a new way to implememts auto msync in Router.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17321) RBF: Add RouterAutoMsyncService for auto msync in Router

2024-01-03 Thread liuguanghua (Jira)
liuguanghua created HDFS-17321:
--

 Summary: RBF: Add RouterAutoMsyncService for auto msync in Router
 Key: HDFS-17321
 URL: https://issues.apache.org/jira/browse/HDFS-17321
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2023-12-29 Thread liuguanghua (Jira)
liuguanghua created HDFS-17311:
--

 Summary: RBF: ConnectionManager creatorQueue should offer a pool 
that is not already in creatorQueue.
 Key: HDFS-17311
 URL: https://issues.apache.org/jira/browse/HDFS-17311
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua


2023-12-29 15:18:54,799 ERROR 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
more than 2048 connections at the same time

In my environment, ConnectionManager creatorQueue is full ,but the cluster does 
not have so many users cloud reach up  2048 pair of  in router.

In the case of a large number of concurrent  creatorQueue add same pool more 
than once.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17309) Fix Router Safemode check contidition error

2023-12-28 Thread liuguanghua (Jira)
liuguanghua created HDFS-17309:
--

 Summary: Fix Router Safemode check contidition error
 Key: HDFS-17309
 URL: https://issues.apache.org/jira/browse/HDFS-17309
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua


With HDFS-17116, Router safemode check contidition use monotonicNow(). 

For code in  RouterSafemodeService.periodicInvoke()

long now = monotonicNow();

long cacheUpdateTime = stateStore.getCacheUpdateTime();
boolean isCacheStale = (now - cacheUpdateTime) > this.staleInterval;

 

Function monotonicNow() is implemented with System.nanoTime(). 

System.nanoTime() in javadoc description:

This method can only be used to measure elapsed time and is not related to any 
other notion of system or wall-clock time. The value returned represents 
nanoseconds since some fixed but arbitrary origin time (perhaps in the future, 
so values may be negative). 

 

The following situation maybe exists :

If refreshCaches not success in the beginning, cacheUpdateTime will be 0 , and 
now - cacheUpdateTime is arbitrary origin time,so isCacheStale maybe  be true 
or false. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17306) RBF:Router should not return nameservices that does not enable observer nodes in RpcResponseHeaderProto

2023-12-28 Thread liuguanghua (Jira)
liuguanghua created HDFS-17306:
--

 Summary: RBF:Router should not return nameservices that does not 
enable observer nodes in RpcResponseHeaderProto
 Key: HDFS-17306
 URL: https://issues.apache.org/jira/browse/HDFS-17306
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua


      If a cluster has 3 nameservices: ns1, ns2,ns3, and  ns1 has observer 
nodes, and client via DFSRouter comminutes with nns.

      If DFS_ROUTER_OBSERVER_READ_DEFAULT_KEY enable,  the client will receive 
all nameservices in RpcResponseHeaderProto. 

       We should reduce rpc response size if nameservices don't enable observer 
nodes.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a period of time

2023-12-22 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17300:
---
Description: 
           Now when Observer NN is used,  if the stateid is delayed , the 
rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
something else wrong , the call will be requeued again and again.  

        So Observer should throw ObserverRetryOnActiveException if stateid is 
always delayed with Active Namenode for a configured time.

  was:
Now when Observer NN is used,  if the stateid is delayed , the rpcServer will 
be requeued into callqueue.

If EditLogTailer is broken or something else wrong , the call will be requeued 
again and again.  

So 


> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a period of time
> --
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-22 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17300:
---
Summary: [SBN READ] Observer should throw ObserverRetryOnActiveException if 
stateid is always delayed with Active Namenode for a  configured time  (was: 
[SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
always delayed with Active Namenode for a period of time)

> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a  configured time
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a period of time

2023-12-22 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17300:
---
Description: 
Now when Observer NN is used,  if the stateid is delayed , the rpcServer will 
be requeued into callqueue.

If EditLogTailer is broken or something else wrong , the call will be requeued 
again and again.  

So 

> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a period of time
> --
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>
> Now when Observer NN is used,  if the stateid is delayed , the rpcServer will 
> be requeued into callqueue.
> If EditLogTailer is broken or something else wrong , the call will be 
> requeued again and again.  
> So 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a period of time

2023-12-22 Thread liuguanghua (Jira)
liuguanghua created HDFS-17300:
--

 Summary: [SBN READ] Observer should throw 
ObserverRetryOnActiveException if stateid is always delayed with Active 
Namenode for a period of time
 Key: HDFS-17300
 URL: https://issues.apache.org/jira/browse/HDFS-17300
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17170) add metris for datanode in function processQueueMessages and reportTo

2023-12-21 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua resolved HDFS-17170.

Resolution: Not A Problem

> add metris for datanode in function processQueueMessages and reportTo
> -
>
> Key: HDFS-17170
> URL: https://issues.apache.org/jira/browse/HDFS-17170
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
> Add two metrics for datanode:
> (1)BPServiceActorAction.reportTo  will  execute errorReport and 
> reportBadBlocks, record the counts of it.
>  (2) BPServiceActor.processQueueMessages in the loop with heartbeat to 
> NN,should recore the numOps and time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17285) RBF: Add a safe mode check period configuration

2023-12-12 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17285:
---
Summary: RBF: Add a safe mode check period configuration  (was: [RBF] 
Decrease dfsrouter safe mode check period.)

> RBF: Add a safe mode check period configuration
> ---
>
> Key: HDFS-17285
> URL: https://issues.apache.org/jira/browse/HDFS-17285
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
> When dfsrouter start, it enters safe mode. And it will cost 1min to leave.
> The log is blow:
> 14:35:23,717 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leave 
> startup safe mode after 3 ms
> 14:35:23,717 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Enter 
> safe mode after 18 ms without reaching the State Store
> 14:35:23,717 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: 
> Entering safe mode
> 14:35:24,996 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: 
> Delaying safemode exit for 28721 milliseconds...
> 14:36:25,037 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: 
> Leaving safe mode after 61319 milliseconds
> It depends on these configs.
> DFS_ROUTER_SAFEMODE_EXTENSION 30s 
> DFS_ROUTER_SAFEMODE_EXPIRATION 3min
> DFS_ROUTER_CACHE_TIME_TO_LIVE_MS 1min  (it is the period for check safe mode)
> Because in safe mode dfsrouter will reject write requests, so it should be 
> shorter in check period if refreshCaches is done.  And we should remove 
> DFS_ROUTER_CACHE_TIME_TO_LIVE_MS form RouterSafemodeService.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17285) [RBF] Decrease dfsrouter safe mode check period.

2023-12-11 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17285:
---
Description: 
When dfsrouter start, it enters safe mode. And it will cost 1min to leave.

The log is blow:

14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leave 
startup safe mode after 3 ms
14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Enter 
safe mode after 18 ms without reaching the State Store
14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Entering 
safe mode
14:35:24,996 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Delaying 
safemode exit for 28721 milliseconds...
14:36:25,037 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leaving 
safe mode after 61319 milliseconds

It depends on these configs.
DFS_ROUTER_SAFEMODE_EXTENSION 30s 
DFS_ROUTER_SAFEMODE_EXPIRATION 3min
DFS_ROUTER_CACHE_TIME_TO_LIVE_MS 1min  (it is the period for check safe mode)

Because in safe mode dfsrouter will reject write requests, so it should be 
shorter in check period if refreshCaches is done.  And we should remove 
DFS_ROUTER_CACHE_TIME_TO_LIVE_MS form RouterSafemodeService.

  was:
When dfsrouter start, it enters safe mode. And it will cost 1min to leave.

The log is blow:

14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leave 
startup safe mode after 3 ms
14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Enter 
safe mode after 18 ms without reaching the State Store
14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Entering 
safe mode
14:35:24,996 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Delaying 
safemode exit for 28721 milliseconds...
14:36:25,037 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leaving 
safe mode after 61319 milliseconds

It depends on these configs.
DFS_ROUTER_SAFEMODE_EXTENSION 30s 
DFS_ROUTER_SAFEMODE_EXPIRATION 3min
DFS_ROUTER_CACHE_TIME_TO_LIVE_MS 1min  (it is the period for check safe mode)

Because in safe mode dfsrouter will reject write requests, so it should be 
shorter in check period if refreshCaches is done.  And we should be separted 
DFS_ROUTER_CACHE_TIME_TO_LIVE_MS form RouterSafemodeService.


> [RBF] Decrease dfsrouter safe mode check period.
> 
>
> Key: HDFS-17285
> URL: https://issues.apache.org/jira/browse/HDFS-17285
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Minor
>
> When dfsrouter start, it enters safe mode. And it will cost 1min to leave.
> The log is blow:
> 14:35:23,717 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leave 
> startup safe mode after 3 ms
> 14:35:23,717 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Enter 
> safe mode after 18 ms without reaching the State Store
> 14:35:23,717 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: 
> Entering safe mode
> 14:35:24,996 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: 
> Delaying safemode exit for 28721 milliseconds...
> 14:36:25,037 INFO 
> org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: 
> Leaving safe mode after 61319 milliseconds
> It depends on these configs.
> DFS_ROUTER_SAFEMODE_EXTENSION 30s 
> DFS_ROUTER_SAFEMODE_EXPIRATION 3min
> DFS_ROUTER_CACHE_TIME_TO_LIVE_MS 1min  (it is the period for check safe mode)
> Because in safe mode dfsrouter will reject write requests, so it should be 
> shorter in check period if refreshCaches is done.  And we should remove 
> DFS_ROUTER_CACHE_TIME_TO_LIVE_MS form RouterSafemodeService.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17285) [RBF] Decrease dfsrouter safe mode check period.

2023-12-11 Thread liuguanghua (Jira)
liuguanghua created HDFS-17285:
--

 Summary: [RBF] Decrease dfsrouter safe mode check period.
 Key: HDFS-17285
 URL: https://issues.apache.org/jira/browse/HDFS-17285
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua


When dfsrouter start, it enters safe mode. And it will cost 1min to leave.

The log is blow:

14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leave 
startup safe mode after 3 ms
14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Enter 
safe mode after 18 ms without reaching the State Store
14:35:23,717 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Entering 
safe mode
14:35:24,996 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Delaying 
safemode exit for 28721 milliseconds...
14:36:25,037 INFO 
org.apache.hadoop.hdfs.server.federation.router.RouterSafemodeService: Leaving 
safe mode after 61319 milliseconds

It depends on these configs.
DFS_ROUTER_SAFEMODE_EXTENSION 30s 
DFS_ROUTER_SAFEMODE_EXPIRATION 3min
DFS_ROUTER_CACHE_TIME_TO_LIVE_MS 1min  (it is the period for check safe mode)

Because in safe mode dfsrouter will reject write requests, so it should be 
shorter in check period if refreshCaches is done.  And we should be separted 
DFS_ROUTER_CACHE_TIME_TO_LIVE_MS form RouterSafemodeService.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17269) [RBF] Make listStatus user root trash dir will return all subclusters trash subdirs if user has any mount points in nameservice.

2023-12-01 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17269:
--

Assignee: liuguanghua

> [RBF] Make listStatus user root trash dir will return all subclusters trash 
> subdirs if user has any mount points in nameservice.
> 
>
> Key: HDFS-17269
> URL: https://issues.apache.org/jira/browse/HDFS-17269
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>
> Same scenario  as HDFS-17263 
> If user trash config  fs.trash.checkpoint.interval set to 10min in namenodes, 
> the trash root dir /user/$USER/.Trash/Current will be very 10 min renamed to 
> /user/$USER/.Trash/timestamp .
>  
> When user  ls  /user/$USER/.Trash, it should be return blow:
> /user/$USER/.Trash/Current
> /user/$USER/.Trash/timestamp (This is invisible now)
>  
> So we should  make  that user ls trash root dir can see all trash subdirs in 
> all nameservices which user has any mountpoint in nameservice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17269) [RBF] Make listStatus user root trash dir will return all subclusters trash subdirs if user has any mount points in nameservice.

2023-12-01 Thread liuguanghua (Jira)
liuguanghua created HDFS-17269:
--

 Summary: [RBF] Make listStatus user root trash dir will return all 
subclusters trash subdirs if user has any mount points in nameservice.
 Key: HDFS-17269
 URL: https://issues.apache.org/jira/browse/HDFS-17269
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua


Same scenario  as HDFS-17263 

If user trash config  fs.trash.checkpoint.interval set to 10min in namenodes, 
the trash root dir /user/$USER/.Trash/Current will be very 10 min renamed to 
/user/$USER/.Trash/timestamp .

 

When user  ls  /user/$USER/.Trash, it should be return blow:

/user/$USER/.Trash/Current

/user/$USER/.Trash/timestamp (This is invisible now)

 

So we should  make  that user ls trash root dir can see all trash subdirs in 
all nameservices which user has any mountpoint in nameservice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17263) RBF: Fix client ls trash path cannot get except default nameservices trash path

2023-12-01 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17263:
--

Assignee: liuguanghua

> RBF: Fix client ls trash path cannot get except default nameservices trash 
> path
> ---
>
> Key: HDFS-17263
> URL: https://issues.apache.org/jira/browse/HDFS-17263
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> With  HDFS-16024, we can rename data to the Trash should be based on src 
> locations. That is great for my useage.  After a period of use, I found this 
> cause a issue.
> There are two nameservices ns0   ns1,  and ns0 is the default nameservice.
> (1) Add moutTable 
> /home/data -> (ns0, /home/data)
> /data1/test1 -> (ns1, /data1/test1 )
> /data2/test2 -> (ns1, /data2/test2 )
> (2)mv file to trash
> ns0:   /user/test-user/.Trash/Current/home/data/file1
> ns1:   /user/test-user/.Trash/Current/data1/test1/file1
> (3) client via DFSRouter  ls will not see  
> /user/test-user/.Trash/Current/data1
> (4) client ls  /user/test-user/.Trash/Current/data2/test2 will return 
> exception .
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17263) [RBF] Fix client ls trash path cannot get except default nameservices trash path

2023-11-22 Thread liuguanghua (Jira)
liuguanghua created HDFS-17263:
--

 Summary: [RBF] Fix client ls trash path cannot get except default 
nameservices trash path
 Key: HDFS-17263
 URL: https://issues.apache.org/jira/browse/HDFS-17263
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua


With  HDFS-16024, we can rename data to the Trash should be based on src 
locations. That is great for my useage.  After a period of use, I found this 
cause a issue.

There are two nameservices ns0   ns1,  and ns0 is the default nameservice.

(1) Add moutTable 

/home/data -> (ns0, /home/data)

/data1/test1 -> (ns1, /data1/test1 )

/data2/test2 -> (ns1, /data2/test2 )

(2)mv file to trash

ns0:   /user/test-user/.Trash/Current/home/data/file1

ns1:   /user/test-user/.Trash/Current/data1/test1/file1

(3) client via DFSRouter  ls will not see  /user/test-user/.Trash/Current/data1

(4) client ls  /user/test-user/.Trash/Current/data2/test2 will return exception 
.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17261) Fix getFileInfo return wrong path when get mountTable path which multi-level

2023-11-21 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17261:
---
Description: 
With DFSRouter, Suppose there are two nameservices : ns0,ns1
 # Add mountTable      /testgetfileinfo/ns1/dir  -> (ns1 -> 
/testgetfileinfo/ns1/dir) 
 # hdfs client via DFSRouter accesses a directory:   hdfs dfs -ls -d 
/testgetfileinfo
 # it will return worng path :    /testgetfileinfo/testgetfileinfo

 

> Fix getFileInfo return wrong path when get mountTable path which multi-level
> 
>
> Key: HDFS-17261
> URL: https://issues.apache.org/jira/browse/HDFS-17261
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> With DFSRouter, Suppose there are two nameservices : ns0,ns1
>  # Add mountTable      /testgetfileinfo/ns1/dir  -> (ns1 -> 
> /testgetfileinfo/ns1/dir) 
>  # hdfs client via DFSRouter accesses a directory:   hdfs dfs -ls -d 
> /testgetfileinfo
>  # it will return worng path :    /testgetfileinfo/testgetfileinfo
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17261) Fix getFileInfo return wrong path when get mountTable path which multi-level

2023-11-21 Thread liuguanghua (Jira)
liuguanghua created HDFS-17261:
--

 Summary: Fix getFileInfo return wrong path when get mountTable 
path which multi-level
 Key: HDFS-17261
 URL: https://issues.apache.org/jira/browse/HDFS-17261
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) invalid

2023-11-14 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Labels:   (was: pull-request-available)

> invalid
> ---
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17249) TestDFSUtil.testIsValidName() run failure

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17249:
--

Assignee: liuguanghua

> TestDFSUtil.testIsValidName() run failure
> -
>
> Key: HDFS-17249
> URL: https://issues.apache.org/jira/browse/HDFS-17249
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
> TestDFSUtil.testIsValidName runs failed when  
> assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 
> Add test case in TestDFSUtil.testIsValidName.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17249) TestDFSUtil.testIsValidName() run failure

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17249:
---
Summary: TestDFSUtil.testIsValidName() run failure  (was: Add test case for 
DFSUtil.isValidName)

> TestDFSUtil.testIsValidName() run failure
> -
>
> Key: HDFS-17249
> URL: https://issues.apache.org/jira/browse/HDFS-17249
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> TestDFSUtil.testIsValidName runs failed when  
> assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 
> Add test case in TestDFSUtil.testIsValidName.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17249) Add test case for DFSUtil.isValidName

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17249:
---
Description: 
TestDFSUtil.testIsValidName runs failed when  
assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 

Add test case in TestDFSUtil.testIsValidName.

  was:
TestDFSUtil.testIsValidName runs failed when  
assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 

Add test case in TestDFSUtil.testIsValidName  for HDFS-17246.


> Add test case for DFSUtil.isValidName
> -
>
> Key: HDFS-17249
> URL: https://issues.apache.org/jira/browse/HDFS-17249
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> TestDFSUtil.testIsValidName runs failed when  
> assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 
> Add test case in TestDFSUtil.testIsValidName.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17249) Add test case for DFSUtil.isValidName

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17249:
---
Description: 
TestDFSUtil.testIsValidName runs failed when  
assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 

Add test case in TestDFSUtil.testIsValidName  for HDFS-17246.

  was:
TestDFSUtil.testIsValidName runs failed when  
assertFalse(DFSUtil.isValidName("/foo/:/bar")) ,fixed it. 

Add test case in TestDFSUtil.testIsValidName  for HDFS-17246.


> Add test case for DFSUtil.isValidName
> -
>
> Key: HDFS-17249
> URL: https://issues.apache.org/jira/browse/HDFS-17249
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> TestDFSUtil.testIsValidName runs failed when  
> assertFalse(DFSUtil.isValidName("/foo/:/bar")) , fixed it. 
> Add test case in TestDFSUtil.testIsValidName  for HDFS-17246.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17249) Add test case for DFSUtil.isValidName

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17249:
---
Description: 
TestDFSUtil.testIsValidName runs failed when  
assertFalse(DFSUtil.isValidName("/foo/:/bar")) ,fixed it. 

Add test case in TestDFSUtil.testIsValidName  for HDFS-17246.

> Add test case for DFSUtil.isValidName
> -
>
> Key: HDFS-17249
> URL: https://issues.apache.org/jira/browse/HDFS-17249
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> TestDFSUtil.testIsValidName runs failed when  
> assertFalse(DFSUtil.isValidName("/foo/:/bar")) ,fixed it. 
> Add test case in TestDFSUtil.testIsValidName  for HDFS-17246.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17249) Add test case for DFSUtil.isValidName

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17249:
---
Summary: Add test case for DFSUtil.isValidName  (was: Fix test case for 
HDFS-17246)

> Add test case for DFSUtil.isValidName
> -
>
> Key: HDFS-17249
> URL: https://issues.apache.org/jira/browse/HDFS-17249
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17249) Fix test case for HDFS-17246

2023-11-03 Thread liuguanghua (Jira)
liuguanghua created HDFS-17249:
--

 Summary: Fix test case for HDFS-17246
 Key: HDFS-17249
 URL: https://issues.apache.org/jira/browse/HDFS-17249
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) Fix

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Description: 
 

 

  was:
 

Add test case for HDFS-17246  and 

assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
merged.


> Fix 
> 
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) invalid

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Summary: invalid  (was: Fix )

> invalid
> ---
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) Fix

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Summary: Fix   (was: Add Test  case for for building Hadoop on Windows and )

> Fix 
> 
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
>  
> Add test case for HDFS-17246  and 
> assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
> merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17246) Fix shaded client for building Hadoop on Windows

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17246:
---
Summary: Fix shaded client for building Hadoop on Windows  (was: Fix 
DFSUtilClient.ValidName ERROR)

> Fix shaded client for building Hadoop on Windows
> 
>
> Key: HDFS-17246
> URL: https://issues.apache.org/jira/browse/HDFS-17246
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.4.0
> Environment: Windows 10
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2023-11-03-17-31-14-990.png
>
>
> Currently, the *shaded client* Yetus personality in Hadoop fails to build on 
> Windows - 
> https://github.com/apache/hadoop/blob/4c04a6768c0cb3d5081cfa5d84ffb389d92f5805/dev-support/bin/hadoop.sh#L541-L615.
> This happens due to the integration test failures in Hadoop client modules - 
> https://github.com/apache/hadoop/tree/4c04a6768c0cb3d5081cfa5d84ffb389d92f5805/hadoop-client-modules/hadoop-client-integration-tests.
> There are several issues that need to be addressed in order to get the 
> integration tests working -
> # Set the HADOOP_HOME, needed by the Mini DFS and YARN clusters spawned by 
> the integration tests.
> # Add Hadoop binaries to PATH, so that winutils.exe can be located.
> # Create a new user with Symlink privilege in the Docker image. This is 
> needed for the proper working of Mini YARN cluster, spawned by the 
> integration tests.
> # Fix a bug in DFSUtilClient.java that prevents colon ( *:* ) in the path. 
> The colon is used a delimiter for the PATH variable while specifying multiple 
> paths. However, this isn't a delimiter in the case of Windows and must be 
> handled appropriately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17246) Fix DFSUtilClient.ValidName ERROR

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17246:
---
Summary: Fix DFSUtilClient.ValidName ERROR  (was: Fix shaded client for 
building Hadoop on Windows)

> Fix DFSUtilClient.ValidName ERROR
> -
>
> Key: HDFS-17246
> URL: https://issues.apache.org/jira/browse/HDFS-17246
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.4.0
> Environment: Windows 10
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2023-11-03-17-31-14-990.png
>
>
> Currently, the *shaded client* Yetus personality in Hadoop fails to build on 
> Windows - 
> https://github.com/apache/hadoop/blob/4c04a6768c0cb3d5081cfa5d84ffb389d92f5805/dev-support/bin/hadoop.sh#L541-L615.
> This happens due to the integration test failures in Hadoop client modules - 
> https://github.com/apache/hadoop/tree/4c04a6768c0cb3d5081cfa5d84ffb389d92f5805/hadoop-client-modules/hadoop-client-integration-tests.
> There are several issues that need to be addressed in order to get the 
> integration tests working -
> # Set the HADOOP_HOME, needed by the Mini DFS and YARN clusters spawned by 
> the integration tests.
> # Add Hadoop binaries to PATH, so that winutils.exe can be located.
> # Create a new user with Symlink privilege in the Docker image. This is 
> needed for the proper working of Mini YARN cluster, spawned by the 
> integration tests.
> # Fix a bug in DFSUtilClient.java that prevents colon ( *:* ) in the path. 
> The colon is used a delimiter for the PATH variable while specifying multiple 
> paths. However, this isn't a delimiter in the case of Windows and must be 
> handled appropriately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) Add Test case for for building Hadoop on Windows and

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Description: 
 

Add test case for HDFS-17246  and 

assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
merged.

  was:assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when 
HDFS-17246 merged.


> Add Test  case for for building Hadoop on Windows and 
> --
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
>  
> Add test case for HDFS-17246  and 
> assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
> merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17248) Add Test case for for building Hadoop on Windows and

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua resolved HDFS-17248.

Resolution: Invalid

> Add Test  case for for building Hadoop on Windows and 
> --
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
>  
> Add test case for HDFS-17246  and 
> assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
> merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) Add Test case for for building Hadoop on Windows and

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Summary: Add Test  case for for building Hadoop on Windows and   (was: Add 
Test for for building Hadoop on Windows)

> Add Test  case for for building Hadoop on Windows and 
> --
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
> assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
> merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) Add Test for for building Hadoop on Windows

2023-11-03 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Summary: Add Test for for building Hadoop on Windows  (was: Fix 
DFSUtilClient.ValidName ERROR)

> Add Test for for building Hadoop on Windows
> ---
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
>
> assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
> merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17248) Fix DFSUtilClient.ValidName ERROR

2023-11-02 Thread liuguanghua (Jira)
liuguanghua created HDFS-17248:
--

 Summary: Fix DFSUtilClient.ValidName ERROR
 Key: HDFS-17248
 URL: https://issues.apache.org/jira/browse/HDFS-17248
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua


assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17248) Fix DFSUtilClient.ValidName ERROR

2023-11-02 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17248:
---
Priority: Minor  (was: Major)

> Fix DFSUtilClient.ValidName ERROR
> -
>
> Key: HDFS-17248
> URL: https://issues.apache.org/jira/browse/HDFS-17248
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>
> assertFalse(DFSUtil.isValidName("/foo/:/bar")); will be error when HDFS-17246 
> merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14500) NameNode StartupProgress continues to report edit log segments after the LOADING_EDITS phase is finished

2023-09-22 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-14500:
---
Description: 
When testing out a cluster with the edit log tailing fast path feature enabled 
(HDFS-13150), an unrelated issue caused the NameNode to remain in safe mode for 
an extended period of time, preventing the NameNode from fully completing its 
startup sequence. We noticed that the Startup Progress web UI displayed many 
edit log segments (millions of them).

I traced this problem back to {{{}StartupProgress{}}}. Within 
{{{}FSEditLogLoader{}}}, the loader continually tries to update the startup 
progress with a new {{Step}} any time that it loads edits. Per the Javadoc for 
{{{}StartupProgress{}}}, this should be a no-op once startup is completed:
{code:java|title=StartupProgress.java}
 * After startup completes, the tracked data is frozen.  Any subsequent updates
 * or counter increments are no-ops.
{code}
However, {{StartupProgress}} only implements that logic once the _entire_ 
startup sequence has been completed. When {{FSEditLogLoader}} calls 
{{{}addStep(){}}}, it adds it into the {{LOADING_EDITS}} phase:
{code:java|title=FSEditLogLoader.java}
StartupProgress prog = NameNode.getStartupProgress();
Step step = createStartupProgressStep(edits);
prog.beginStep(Phase.LOADING_EDITS, step);
{code}
This phase, in our case, ended long before, so it is nonsensical to continue to 
add steps to it. I believe it is a bug that {{StartupProgress}} accepts such 
steps instead of ignoring them; once a phase is complete, it should no longer 
change.

  was:
When testing out a cluster with the edit log tailing fast path feature enabled 
(HDFS-13150), an unrelated issue caused the NameNode to remain in safe mode for 
an extended period of time, preventing the NameNode from fully completing its 
startup sequence. We noticed that the Startup Progress web UI displayed many 
edit log segments (millions of them).

I traced this problem back to {{StartupProgress}}. Within {{FSEditLogLoader}}, 
the loader continually tries to update the startup progress with a new {{Step}} 
any time that it loads edits. Per the Javadoc for {{StartupProgress}}, this 
should be a no-op once startup is completed:
{code:title=StartupProgress.java}
 * After startup completes, the tracked data is frozen.  Any subsequent updates
 * or counter increments are no-ops.
{code}
However, {{StartupProgress}} only implements that logic once the _entire_ 
startup sequence has been completed. When {{FSEditLogLoader}} calls 
{{addStep()}}, it adds it into the {{LOADING_EDITS}} phase:
{code:title=FSEditLogLoader.java}
StartupProgress prog = NameNode.getStartupProgress();
Step step = createStartupProgressStep(edits);
prog.beginStep(Phase.LOADING_EDITS, step);
{code}
This phase, in our case, ended long before, so it is nonsensical to continue to 
add steps to it. I believe it is a bug that {{StartupProgress}} accepts such 
steps instead of ignoring them; once a phase is complete, it should no longer 
change.


> NameNode StartupProgress continues to report edit log segments after the 
> LOADING_EDITS phase is finished
> 
>
> Key: HDFS-14500
> URL: https://issues.apache.org/jira/browse/HDFS-14500
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0, 2.9.2, 3.0.3, 2.8.5, 3.1.2
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14500-branch-2.001.patch, HDFS-14500.000.patch, 
> HDFS-14500.001.patch
>
>
> When testing out a cluster with the edit log tailing fast path feature 
> enabled (HDFS-13150), an unrelated issue caused the NameNode to remain in 
> safe mode for an extended period of time, preventing the NameNode from fully 
> completing its startup sequence. We noticed that the Startup Progress web UI 
> displayed many edit log segments (millions of them).
> I traced this problem back to {{{}StartupProgress{}}}. Within 
> {{{}FSEditLogLoader{}}}, the loader continually tries to update the startup 
> progress with a new {{Step}} any time that it loads edits. Per the Javadoc 
> for {{{}StartupProgress{}}}, this should be a no-op once startup is completed:
> {code:java|title=StartupProgress.java}
>  * After startup completes, the tracked data is frozen.  Any subsequent 
> updates
>  * or counter increments are no-ops.
> {code}
> However, {{StartupProgress}} only implements that logic once the _entire_ 
> startup sequence has been completed. When {{FSEditLogLoader}} calls 
> {{{}addStep(){}}}, it adds it into the {{LOADING_EDITS}} phase:
> {code:java|title=FSEditLogLoader.java}
> StartupProgress prog 

[jira] [Created] (HDFS-17186) NamenodeProtocol.versionRequest() threw null in DFSRouter service.

2023-09-11 Thread liuguanghua (Jira)
liuguanghua created HDFS-17186:
--

 Summary: NamenodeProtocol.versionRequest() threw null in DFSRouter 
service.
 Key: HDFS-17186
 URL: https://issues.apache.org/jira/browse/HDFS-17186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.3.2
Reporter: liuguanghua


In the dfsrouter service, I found blow erros.  It seems that 
NamenodeProtocol.versionRequest()  got errors. And then with retry found client 
CallId is null. 

2023-09-09 04:00:50,822 ERROR 
org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService: 
Unexpected exception while communicating with hdfs1-nn1:hdfsmaster1-001:8022: 
null
java.lang.IllegalStateException
        at 
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494)
        at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:119)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy17.versionRequest(Unknown Source)
        at 
org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.getNamenodeStatusReport(NamenodeHeartbeatService.java:270)
        at 
org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:218)
        at 
org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:172)
        at 
org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17182) DataSetLockManager.lockLeakCheck() is not thread-safe.

2023-09-11 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17182:
---
Summary: DataSetLockManager.lockLeakCheck() is not thread-safe.   (was: 
DataSetLockManager.threadCountMap is not thread-safe. )

> DataSetLockManager.lockLeakCheck() is not thread-safe. 
> ---
>
> Key: HDFS-17182
> URL: https://issues.apache.org/jira/browse/HDFS-17182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>
> threadCountMap is not thread-safe. Other functions add protected by 
> synchronized expect   lockLeakCheck(). Add synchronized on function 
> lockLeakCheck().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17182) DataSetLockManager.threadCountMap is not thread-safe.

2023-09-11 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17182:
---
Summary: DataSetLockManager.threadCountMap is not thread-safe.   (was: 
threadCountMap is not thread-safe. )

> DataSetLockManager.threadCountMap is not thread-safe. 
> --
>
> Key: HDFS-17182
> URL: https://issues.apache.org/jira/browse/HDFS-17182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>
> threadCountMap is not thread-safe. Other functions add protected by 
> synchronized expect   lockLeakCheck(). Add synchronized on function 
> lockLeakCheck().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17182) threadCountMap is not thread-safe.

2023-09-08 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17182:
--

Assignee: liuguanghua

> threadCountMap is not thread-safe. 
> ---
>
> Key: HDFS-17182
> URL: https://issues.apache.org/jira/browse/HDFS-17182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>
> threadCountMap is not thread-safe. Other functions add protected by 
> synchronized expect   lockLeakCheck(). Add synchronized on function 
> lockLeakCheck().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17182) threadCountMap is not thread-safe.

2023-09-07 Thread liuguanghua (Jira)
liuguanghua created HDFS-17182:
--

 Summary: threadCountMap is not thread-safe. 
 Key: HDFS-17182
 URL: https://issues.apache.org/jira/browse/HDFS-17182
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: liuguanghua


threadCountMap is not thread-safe. Other functions add protected by 
synchronized expect   lockLeakCheck(). Add synchronized on function 
lockLeakCheck().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-09-06 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reassigned HDFS-17129:
--

Assignee: liuguanghua

> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-09-06 Thread liuguanghua (Jira)


[ https://issues.apache.org/jira/browse/HDFS-17129 ]


liuguanghua deleted comment on HDFS-17129:


was (Author: liuguanghua):
merge into  HDFS-17121

> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-09-04 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua reopened HDFS-17129:


> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17170) add metris for datanode in function processQueueMessages and reportTo

2023-08-28 Thread liuguanghua (Jira)
liuguanghua created HDFS-17170:
--

 Summary: add metris for datanode in function processQueueMessages 
and reportTo
 Key: HDFS-17170
 URL: https://issues.apache.org/jira/browse/HDFS-17170
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: liuguanghua


Add two metrics for datanode:

(1)BPServiceActorAction.reportTo  will  execute errorReport and 
reportBadBlocks, record the counts of it.

 (2) BPServiceActor.processQueueMessages in the loop with heartbeat to 
NN,should recore the numOps and time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-07-27 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua resolved HDFS-17129.

Resolution: Abandoned

merge into  HDFS-17121

> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-07-26 Thread liuguanghua (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747716#comment-17747716
 ] 

liuguanghua commented on HDFS-17129:


Yes , the fbr should acquire the lock all the time.  And blockreport will send 
ibr and then fbr, so it can prevent mis-order.   

> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-07-26 Thread liuguanghua (Jira)
liuguanghua created HDFS-17129:
--

 Summary: mis-order of ibr and fbr on datanode 
 Key: HDFS-17129
 URL: https://issues.apache.org/jira/browse/HDFS-17129
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.4.0
 Environment: hdfs3.4.0
Reporter: liuguanghua


HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17121) BPServiceActor to provide new thread to handle FBR

2023-07-25 Thread liuguanghua (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17121:
---
Description: 
After HDFS-16016 , it makes ibr in a thread to avoid heartbeat blocking with 
ibr when require readlock  in Datanode.

 Now fbr should do as this.   The reason is this:

(1)heartbeat maybe block because of fbr with readlock in datanode

(2)fbr will only may return FinalizeCommand 

  was:
# After HDFS-16016 , it makes ibr in a thread to avoid heartbeat blocking with 
ibr when require readlock  in Datanode.
 #  Now fbr should do as this.


> BPServiceActor to provide new thread to handle FBR
> --
>
> Key: HDFS-17121
> URL: https://issues.apache.org/jira/browse/HDFS-17121
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: liuguanghua
>Priority: Minor
>
> After HDFS-16016 , it makes ibr in a thread to avoid heartbeat blocking with 
> ibr when require readlock  in Datanode.
>  Now fbr should do as this.   The reason is this:
> (1)heartbeat maybe block because of fbr with readlock in datanode
> (2)fbr will only may return FinalizeCommand 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17121) BPServiceActor to provide new thread to handle FBR

2023-07-25 Thread liuguanghua (Jira)
liuguanghua created HDFS-17121:
--

 Summary: BPServiceActor to provide new thread to handle FBR
 Key: HDFS-17121
 URL: https://issues.apache.org/jira/browse/HDFS-17121
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: liuguanghua


# After HDFS-16016 , it makes ibr in a thread to avoid heartbeat blocking with 
ibr when require readlock  in Datanode.
 #  Now fbr should do as this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >