from:"Xiaoqiao He \\\\\\\(Jira\\\\\\\)"

[jira] [Commented] (HDFS-15003) RBF: Make Router support storage type quota.

2019-12-26 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003867#comment-17003867
 ] 

Xiaoqiao He commented on HDFS-15003:


Hi [~LiJinglun], OOM error is not related with this PR in my opinion, please 
ignore it. HDFS-14997 is tracking 'java.lang.OutOfMemoryError: Java heap space' 
Jenkins reported.

> RBF: Make Router support storage type quota.
> 
>
> Key: HDFS-15003
> URL: https://issues.apache.org/jira/browse/HDFS-15003
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15003.001.patch, HDFS-15003.002.patch, 
> HDFS-15003.003.patch, HDFS-15003.004.patch, HDFS-15003.005.patch, 
> HDFS-15003.006.patch, HDFS-15003.007.patch, HDFS-15003.008.patch
>
>
> Make Router support storage type quota.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-26 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003868#comment-17003868
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~ayushtkn],[~elgoiri],[~Kevin_Zheng] for your helps and reviews.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-26 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Status: Open  (was: Patch Available)

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-26 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Status: Open  (was: Patch Available)

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-26 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Status: Patch Available  (was: Open)

re-trigger Jenkins manually

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-26 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Status: Patch Available  (was: Open)

re-trigger Jenkins manually

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-28 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.003.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-28 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004667#comment-17004667
 ] 

Xiaoqiao He commented on HDFS-15075:


update v003, fix failed unit tests and checkstyle.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-28 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Attachment: HDFS-15051.006.patch

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-28 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004668#comment-17004668
 ] 

Xiaoqiao He commented on HDFS-15051:


v006 rebase trunk and fix checkstyle.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2019-12-28 Thread Xiaoqiao He (Jira)

Xiaoqiao He created HDFS-15082:
--

 Summary: RBF: Check each component length of destination path when 
add/update mount entry
 Key: HDFS-15082
 URL: https://issues.apache.org/jira/browse/HDFS-15082
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: rbf
Reporter: Xiaoqiao He
Assignee: Xiaoqiao He


When add/update mount entry, each component length of destination path could 
exceed filesystem path component length limit, reference to 
`dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
each component length of destination path when add/update mount entry at Router 
side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2019-12-28 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15082:
---
Attachment: HDFS-15082.001.patch
Status: Patch Available  (was: Open)

submit patch v001 and pending Jenkins.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2019-12-29 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004700#comment-17004700
 ] 

Xiaoqiao He commented on HDFS-15082:


It is useless entry at router side if we add/update mount entry with no length 
restriction component of destination path when enable 
`DFS_NAMENODE_MAX_COMPONENT_LENGTH_KEY` in namenode side, actually, it is 
enable with 255 by default. So I think we should check it and avoid unused 
entry to add/update into router mount table.
{quote}is it similar to HDFS-13576?{quote}
they are exactly the same. sorry for no watched before.
{quote}Are we having a separate configuration at the Router to specify the path 
length independent of the namespace?{quote}
Any other suggestions? I have thought reuse 
`dfs.namenode.fs-limits.max-component-length` at the beginning, but it seems 
that it will be more convenient to configure if we separate them IMO.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-29 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004715#comment-17004715
 ] 

Xiaoqiao He commented on HDFS-15051:


Thanks [~ayushtkn],
{quote}Give a check what is the scenario with mkdirs(), what all permissions 
are checked, we should keep same as mkdir IMO.{quote}
it makes sense to me. will update later.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2019-12-29 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005115#comment-17005115
 ] 

Xiaoqiao He commented on HDFS-15082:


Hi [~ayushtkn], Thanks for your comments.
This thought comes from [~elgoiri]'s 
[comments|https://issues.apache.org/jira/browse/HDFS-15051?focusedCommentId=16994972&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16994972],
 Since we open the add/update mount entry privilege to end user, there are some 
cases that these operations will be not valid such as path component too long. 
If end user create an invalid entry but not clean it for this situation in 
time, it will pile up more and more entries and not used anymore without any 
auto clean mechanism now. It may be reduce the performance of 
query/write/delete operations.(Sorry I do not collect detailed benchmark 
result, just observed the latency while there are hundreds of thousands of 
znodes(mount table entries & delegation token entries) with 
ZKDelegationTokenSecretManager) So my thought is,
a. revoke mount entry operation privilege to super user only, (it looks that 
this solution is not proper for [~elgoiri]'s cases.) 
b. limit the invalid entries into mount table. (PathComponentTooLongException 
is one case of them)
For the configuration value, the current demo patch is not the final solution, 
welcome some more suggestions. Of course, it is not very graceful for different 
namespaces with different max-component-limit values. What about give a bigger 
default value(such as 512/1024) to limit it. other thought?

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-30 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Attachment: HDFS-15051.007.patch

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-30 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005417#comment-17005417
 ] 

Xiaoqiao He commented on HDFS-15051:


[~ayushtkn],[~elgoiri], v007 try to check write permission about the nearest 
parent mount table entry and other ancestor entries execute permission if exist 
when add mount table entry. Please take another reviews. Thanks.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch, 
> HDFS-15051.006.patch, HDFS-15051.007.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15088) Correct annotation typo of RouterPermissionChecker#checkPermission

2019-12-30 Thread Xiaoqiao He (Jira)

Xiaoqiao He created HDFS-15088:
--

 Summary: Correct annotation typo of 
RouterPermissionChecker#checkPermission
 Key: HDFS-15088
 URL: https://issues.apache.org/jira/browse/HDFS-15088
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: rbf
Reporter: Xiaoqiao He
Assignee: Xiaoqiao He


Correct annotation typo of RouterPermissionChecker#checkPermission.
{code:java}
  /**
   * Whether a mount table entry can be accessed by the current context.
   *
   * @param mountTable
   *  MountTable being accessed
   * @param access
   *  type of action being performed on the cache pool
   * @throws AccessControlException
   *   if mount table cannot be accessed
   */
  public void checkPermission(MountTable mountTable, FsAction access)
  throws AccessControlException {
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15088) Correct annotation typo of RouterPermissionChecker#checkPermission

2019-12-30 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15088:
---
Attachment: HDFS-15088.patch
Status: Patch Available  (was: Open)

> Correct annotation typo of RouterPermissionChecker#checkPermission
> --
>
> Key: HDFS-15088
> URL: https://issues.apache.org/jira/browse/HDFS-15088
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Trivial
> Attachments: HDFS-15088.patch
>
>
> Correct annotation typo of RouterPermissionChecker#checkPermission.
> {code:java}
>   /**
>* Whether a mount table entry can be accessed by the current context.
>*
>* @param mountTable
>*  MountTable being accessed
>* @param access
>*  type of action being performed on the cache pool
>* @throws AccessControlException
>*   if mount table cannot be accessed
>*/
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register

2020-01-01 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006581#comment-17006581
 ] 

Xiaoqiao He commented on HDFS-15068:


v005 LGTM,  Run the failed unit test on local, it seems most of them(except 
failed #TestRedudantBlocks which seems not related with this changes.) are 
passed. ping [~elgoiri],[~iwasakims],[~weichiu] Would like take another look if 
have time?

> DataNode could meet deadlock if invoke refreshVolumes when register
> ---
>
> Key: HDFS-15068
> URL: https://issues.apache.org/jira/browse/HDFS-15068
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Aiphago
>Priority: Critical
> Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, 
> HDFS-15068.003.patch, HDFS-15068.004.patch, HDFS-15068.005.patch
>
>
> DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host 
> start` to trigger #refreshVolumes.
> 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} 
> when enter this method first, then try to hold BPOfferService {{readlock}} 
> when `bpos.getNamespaceInfo()` in following code segment. 
> {code:java}
> for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) {
>   nsInfos.add(bpos.getNamespaceInfo());
> }
> {code}
> 2. BPOfferService#registrationSucceeded (which is invoked by #register when 
> DataNode start or #reregister when processCommandFromActor) hold 
> BPOfferService {{writelock}} first, then try to hold datanode instance 
> ownable {{synchronizer}} in following method.
> {code:java}
>   synchronized void bpRegistrationSucceeded(DatanodeRegistration 
> bpRegistration,
>   String blockPoolId) throws IOException {
> id = bpRegistration;
> if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) {
>   throw new IOException("Inconsistent Datanode IDs. Name-node returned "
>   + bpRegistration.getDatanodeUuid()
>   + ". Expecting " + storage.getDatanodeUuid());
> }
> 
> registerBlockPoolWithSecretManager(bpRegistration, blockPoolId);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15091) Cache Admin and Quota Commands Should Check SuperUser Before Taking Lock

2020-01-02 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006735#comment-17006735
 ] 

Xiaoqiao He commented on HDFS-15091:


v02 LGTM, +1 from my side.

> Cache Admin and Quota Commands Should Check SuperUser Before Taking Lock
> 
>
> Key: HDFS-15091
> URL: https://issues.apache.org/jira/browse/HDFS-15091
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15091-01.patch, HDFS-15091-02.patch
>
>
> As of now all API check superuser before taking lock, Similarly can be done 
> for the cache commands and setQuota.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15083) Add new trash rpc which move the trash (mkdir and the rename) operation to the server side.

2020-01-05 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008293#comment-17008293
 ] 

Xiaoqiao He commented on HDFS-15083:


Thanks [~zhuqi] for your works, this is very meaningful work. I also considered 
if there are some backward compatibility issue? IIUC there are some other stub 
filesystems under filesystem interface.
Just minor suggestions, we should split this task to add new interface and RBF 
separately, It will be better to discuss and push forward IMO. 
cc [~ayushtkn],[~ramkumar] any thought about this JIRA?

> Add new trash rpc which move the trash (mkdir and the rename) operation to 
> the server side.
> ---
>
> Key: HDFS-15083
> URL: https://issues.apache.org/jira/browse/HDFS-15083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient, namenode, rbf
>Affects Versions: 2.10.0, 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15083.001.patch
>
>
> Now the rbf trash with multi cluster mounted  in 
> [HDFS-14117|https://issues.apache.org/jira/browse/HDFS-14117] , the solution 
> is not graceful。
> If we can move the client side trash (mkdir and rename) to the  server side, 
> we can not only solve the problem gracefully, but also reduce the trash rpc 
> load in server side to about %50 compare to the origin trash which call two 
> times rpc(mkdir and rename).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2020-01-05 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008298#comment-17008298
 ] 

Xiaoqiao He commented on HDFS-15079:


Thanks [~ferhui] for your works and ping. I believe this issue is clear and 
obvious based on above discussion.
About the solution, it seems they are same issue with data locality and we will 
turn back to discussion again. Due to we have to involve other modules to 
change and both solutions have its own pros and cons, we do not reach the final 
conclusion up to now. I would like to list both solutions here.
a. limited open interface get/set ClientId and CallId to some super user.
b. set ClientId/CallId to CallerContext and parse them at NameNode side. The 
detailed Pros and Cons have stated at HDFS-13248, FYI.
In my own opinion, I prefer to solution a + strict access control. pending some 
other more suggestions?

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: HDFS-15079.001.patch, UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-11 Thread Xiaoqiao He (Jira)

Xiaoqiao He created HDFS-15113:
--

 Summary: Missing IBR when NameNode restart if open processCommand 
async feature
 Key: HDFS-15113
 URL: https://issues.apache.org/jira/browse/HDFS-15113
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Xiaoqiao He
Assignee: Xiaoqiao He


Recently, I meet one case that NameNode missing block after restart which is 
related with HDFS-14997.
a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
when receive some RPC request from DataNode.
b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister async.
{code:java}
  void reRegister() throws IOException {
if (shouldRun()) {
  // re-retrieve namespace info to make sure that, if the NN
  // was restarted, we still match its version (HDFS-2120)
  NamespaceInfo nsInfo = retrieveNamespaceInfo();
  // and re-register
  register(nsInfo);
  scheduler.scheduleHeartbeat();
  // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
  // for sometime.
  if (state == HAServiceState.STANDBY || state == HAServiceState.OBSERVER) {
ibrManager.clearIBRs();
  }
}
  }
{code}
c. As we know, #register will trigger BR immediately.
d. because #reRegister run async, so we could not make sure which one run first 
between send FBR and clear IBR. If clean IBR run first, it will be OK. But if 
send FBR first then clear IBR, it will missing some blocks received between 
these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Attachment: HDFS-15113.001.patch
Status: Patch Available  (was: Open)

submit v001 to fix this case, unit test will attach later.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2020-01-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.004.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2020-01-11 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013457#comment-17013457
 ] 

Xiaoqiao He commented on HDFS-15075:


Thanks [~elgoiri] and [~dabecker] for your review comments, v004 try to fix, 
please take another review.
BTW, I meet another issue recently, HDFS-15113 is tracking and try to fix it, 
if some one want to open this feature, please pending HDFS-15113.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch, 
> HDFS-15075.003.patch, HDFS-15075.004.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-01-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15082:
---
Attachment: HDFS-15082.002.patch

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15082) RBF: Check each component length of destination path when add/update mount entry

2020-01-11 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013460#comment-17013460
 ] 

Xiaoqiao He commented on HDFS-15082:


Thanks [~elgoiri] for your reviews and comments. v002 update and follows the 
suggestions.

> RBF: Check each component length of destination path when add/update mount 
> entry
> 
>
> Key: HDFS-15082
> URL: https://issues.apache.org/jira/browse/HDFS-15082
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15082.001.patch, HDFS-15082.002.patch
>
>
> When add/update mount entry, each component length of destination path could 
> exceed filesystem path component length limit, reference to 
> `dfs.namenode.fs-limits.max-component-length` of NameNode. So we should check 
> each component length of destination path when add/update mount entry at 
> Router side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15088) RBF: Correct annotation typo of RouterPermissionChecker#checkPermission

2020-01-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15088:
---
Summary: RBF: Correct annotation typo of 
RouterPermissionChecker#checkPermission  (was: Correct annotation typo of 
RouterPermissionChecker#checkPermission)

> RBF: Correct annotation typo of RouterPermissionChecker#checkPermission
> ---
>
> Key: HDFS-15088
> URL: https://issues.apache.org/jira/browse/HDFS-15088
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Trivial
> Attachments: HDFS-15088.patch
>
>
> Correct annotation typo of RouterPermissionChecker#checkPermission.
> {code:java}
>   /**
>* Whether a mount table entry can be accessed by the current context.
>*
>* @param mountTable
>*  MountTable being accessed
>* @param access
>*  type of action being performed on the cache pool
>* @throws AccessControlException
>*   if mount table cannot be accessed
>*/
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Attachment: HDFS-15113.002.patch

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-11 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013664#comment-17013664
 ] 

Xiaoqiao He commented on HDFS-15113:


submit v002 with unit test. Please help to take another review if have time. 
Thanks [~elgoiri].

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14672) Backport HDFS-12703 to branch-2

2020-01-13 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014401#comment-17014401
 ] 

Xiaoqiao He commented on HDFS-14672:


Thanks [~Tao Yang] for your reminder, after checking the commit, I found it can 
cherry-pick to branch-2.8 and branch-2.9 directly. [~xkrogen] Would like to 
help for backport HDFS-12703 to these branches? Thanks.

> Backport HDFS-12703 to branch-2
> ---
>
> Key: HDFS-14672
> URL: https://issues.apache.org/jira/browse/HDFS-14672
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 2.10.0
>
> Attachments: HDFS-12703.branch-2.001.patch, 
> HDFS-12703.branch-2.002.patch, HDFS-12703.branch-2.003.patch
>
>
> Currently, `decommission monitor exception cause namenode fatal` is only in 
> trunk (branch-3). This JIRA aims to backport this bugfix to branch-2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15115) Namenode crash caused by NPE in BlockPlacementPolicyDefault when dynamically change logger to debug

2020-01-13 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014426#comment-17014426
 ] 

Xiaoqiao He commented on HDFS-15115:


this case is very similar to HDFS-11827 in my opinion, not sure why HDFS-14103 
revert these guard statement in branch trunk, any thought?

> Namenode crash caused by NPE in BlockPlacementPolicyDefault when dynamically 
> change logger to debug
> ---
>
> Key: HDFS-15115
> URL: https://issues.apache.org/jira/browse/HDFS-15115
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: wangzhixiang
>Priority: Major
> Attachments: HDFS-15115.001.patch
>
>
> To get debug info, we dynamically change the logger of 
> BlockPlacementPolicyDefault to debug when namenode is running. However, the 
> Namenode crashs. From the log, we find some NPE in 
> BlockPlacementPolicyDefault.chooseRandom. Because *StringBuilder builder* 
> will be used 4 times in BlockPlacementPolicyDefault.chooseRandom method. 
> While the *builder* only initializes in the first time of this method. If we 
> change the logger of BlockPlacementPolicyDefault to debug after the part, the 
> *builder* in remaining part is *NULL* and cause *NPE*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-13 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014429#comment-17014429
 ] 

Xiaoqiao He commented on HDFS-15113:


cc [~elgoiri],[~weichiu],[~brahmareddy] any suggestions here? Thanks.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-14 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15113:
---
Attachment: HDFS-15113.003.patch

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-14 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015065#comment-17015065
 ] 

Xiaoqiao He commented on HDFS-15113:


Thanks all for your comments. [~elgoiri],[~weichiu],[~brahmareddy].
To [~elgoiri]
{quote}In the test should we have the old case and the new one?{quote}
TestBPOfferService#testIBRClearanceForStandbyOnReRegister could cover most case 
about restart, So I try to add logic just for this corner case. If we need 
split them, I would like to do that later. Thanks.
To [~brahmareddy]
{quote}is this have high chance when "dfs.blockreport.initialDelay" is 
configured with "0"{quote}
It is exactly true. in my experience, we do not set and used the default value 
0, so it is very easy to reproduce.
For the unit test, it could reproduce if we revert {{BPServiceActor}}, then add 
the following fault injector between schedule heartbeat and clean IBR.
{code:java}
  DataNodeFaultInjector.get().waitFullBlockReport();
{code}
Thanks a lot.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14672) Backport HDFS-12703 to branch-2

2020-01-16 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-14672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-14672:
---
Attachment: HDFS-14672.branch-2.8.001.patch

> Backport HDFS-12703 to branch-2
> ---
>
> Key: HDFS-14672
> URL: https://issues.apache.org/jira/browse/HDFS-14672
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 2.10.0
>
> Attachments: HDFS-12703.branch-2.001.patch, 
> HDFS-12703.branch-2.002.patch, HDFS-12703.branch-2.003.patch, 
> HDFS-14672.branch-2.8.001.patch
>
>
> Currently, `decommission monitor exception cause namenode fatal` is only in 
> trunk (branch-3). This JIRA aims to backport this bugfix to branch-2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14672) Backport HDFS-12703 to branch-2

2020-01-16 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016843#comment-17016843
 ] 

Xiaoqiao He commented on HDFS-14672:


[~Tao Yang],  [^HDFS-14672.branch-2.8.001.patch] try to offer path for 
branch-2.8 without carefully test, FYI.

> Backport HDFS-12703 to branch-2
> ---
>
> Key: HDFS-14672
> URL: https://issues.apache.org/jira/browse/HDFS-14672
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 2.10.0
>
> Attachments: HDFS-12703.branch-2.001.patch, 
> HDFS-12703.branch-2.002.patch, HDFS-12703.branch-2.003.patch, 
> HDFS-14672.branch-2.8.001.patch
>
>
> Currently, `decommission monitor exception cause namenode fatal` is only in 
> trunk (branch-3). This JIRA aims to backport this bugfix to branch-2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15115) Namenode crash caused by NPE in BlockPlacementPolicyDefault when dynamically change logger to debug

2020-01-16 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016844#comment-17016844
 ] 

Xiaoqiao He commented on HDFS-15115:


[~wzx513], would you like to check all same cases in 
#BlockPlacementPolicyDefault and try to update new patch, it is better to add 
UT just like [~weichiu] mentioned above. Thanks.

> Namenode crash caused by NPE in BlockPlacementPolicyDefault when dynamically 
> change logger to debug
> ---
>
> Key: HDFS-15115
> URL: https://issues.apache.org/jira/browse/HDFS-15115
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: wangzhixiang
>Priority: Major
> Attachments: HDFS-15115.001.patch
>
>
> To get debug info, we dynamically change the logger of 
> BlockPlacementPolicyDefault to debug when namenode is running. However, the 
> Namenode crashs. From the log, we find some NPE in 
> BlockPlacementPolicyDefault.chooseRandom. Because *StringBuilder builder* 
> will be used 4 times in BlockPlacementPolicyDefault.chooseRandom method. 
> While the *builder* only initializes in the first time of this method. If we 
> change the logger of BlockPlacementPolicyDefault to debug after the part, the 
> *builder* in remaining part is *NULL* and cause *NPE*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15113) Missing IBR when NameNode restart if open processCommand async feature

2020-01-16 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016851#comment-17016851
 ] 

Xiaoqiao He commented on HDFS-15113:


[~elgoiri],[~weichiu],[~brahmareddy], any bandwidth to push this issue forward? 
Consider this is serious bug, and some users want to backport to their own 
internal branch, it is better to fix it ASAP. I would like to follow up if any 
suggestions. Thanks again.

> Missing IBR when NameNode restart if open processCommand async feature
> --
>
> Key: HDFS-15113
> URL: https://issues.apache.org/jira/browse/HDFS-15113
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
> if (shouldRun()) {
>   // re-retrieve namespace info to make sure that, if the NN
>   // was restarted, we still match its version (HDFS-2120)
>   NamespaceInfo nsInfo = retrieveNamespaceInfo();
>   // and re-register
>   register(nsInfo);
>   scheduler.scheduleHeartbeat();
>   // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>   // for sometime.
>   if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
> ibrManager.clearIBRs();
>   }
> }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15087) RBF: Balance/Rename across federation namespaces

2020-02-13 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036004#comment-17036004
 ] 

Xiaoqiao He commented on HDFS-15087:


Thanks [~LiJinglun] for your proposal, it is very interesting and useful 
feature in my opinion. Some minor concern from me,
1. For step saveTree, it generates 2 files, TREE-FILE and TREE-META 
respectively, I don't get the difference with one meta file with all 
information. My concerns is we should skip to check consistency and other 
sanity checks if keep only one meta file.
2. It may be not enough to ensure source directory not changes if only revoke 
write permission for source directory/file since some super user action 
actually not do permission check when do read/write ops. +1 for adding extra 
xattributes to refuse write operation request.
{quote}remove all permissions of the source directory and force 
recoverLease()/close all open files. Normal users can't change the source 
directory anymore, both directories and files.{quote}
3. Any consideration about HA mode? do we need to distribution meta file/files 
to both ANN and SBN before execute GraftTree or both of them request meta data 
from external storage? AND how do we undo it if ANN Graft part of inode tree 
then ha failover due to some reasons? otherwise it would be not strong 
consistent between ANN and SBN in my opinion.
4. Some user case s do not shared DN's in federated clusters as [~ayushtkn] 
mentioned above, it needs to rollback to block transfer if use hard-link 
solution?
5. It introduces {{Scheduler}} and {{Externel Storage}} modules in design doc, 
what about use Router to schedule rename task and NN local storage (may be it 
is same as fsimage persist path) to keep meta data then  we could not need 
extra module and reduce maintain cost.
Thanks again [~LiJinglun], I do not review the initial patch carefully, please 
correct me if I missing something.
About FastCp, we use it since 3 years ago, and it is not obvious different with 
DistCp except effective because hardlink(FastCp) vs transfer(DistCp) in my 
opinion. If anyone is interested for FastCp as one option for this solution, I 
would like to push it forward again.

> RBF: Balance/Rename across federation namespaces
> 
>
> Key: HDFS-15087
> URL: https://issues.apache.org/jira/browse/HDFS-15087
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HDFS-15087.initial.patch, HFR_Rename Across Federation 
> Namespaces.pdf
>
>
> The Xiaomi storage team has developed a new feature called HFR(HDFS 
> Federation Rename) that enables us to do balance/rename across federation 
> namespaces. The idea is to first move the meta to the dst NameNode and then 
> link all the replicas. It has been working in our largest production cluster 
> for 2 months. We use it to balance the namespaces. It turns out HFR is fast 
> and flexible. The detail could be found in the design doc. 
> Looking forward to a lively discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16627.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk. Thanks [~slfan1989] for your contributions.

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16627:
---
Summary: Improve BPServiceActor#register log to add NameNode address  (was: 
improve BPServiceActor#register Log Add NN Addr)

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553086#comment-17553086
 ] 

Xiaoqiao He commented on HDFS-16627:


[~slfan1989] BTW, the fix version label should be not set when file jira. 
Generally it should be set after PR check in by committer. Thanks again.

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16609) Fix Flakes Junit Tests that often report timeouts

2022-06-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16609.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk. Thanks [~slfan1989].

> Fix Flakes Junit Tests that often report timeouts
> -
>
> Key: HDFS-16609
> URL: https://issues.apache.org/jira/browse/HDFS-16609
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I was dealing with HDFS-16590 JIRA, Junit Tests often reported errors, I 
> found that one type of problem is TimeOut problem, these problems can be 
> avoided by adjusting TimeOut time.
> The modified method is as follows:
> 1.org.apache.hadoop.hdfs.TestFileCreation#testServerDefaultsWithMinimalCaching
> {code:java}
> [ERROR] 
> testServerDefaultsWithMinimalCaching(org.apache.hadoop.hdfs.TestFileCreation) 
>  Time elapsed: 7.136 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> Timed out waiting for condition. 
> Thread diagnostics: 
> [WARNING] 
> org.apache.hadoop.hdfs.TestFileCreation.testServerDefaultsWithMinimalCaching(org.apache.hadoop.hdfs.TestFileCreation)
> [ERROR]   Run 1: TestFileCreation.testServerDefaultsWithMinimalCaching:277 
> Timeout Timed out ...
> [INFO]   Run 2: PASS{code}
> 2.org.apache.hadoop.hdfs.TestDFSShell#testFilePermissions
> {code:java}
> [ERROR] testFilePermissions(org.apache.hadoop.hdfs.TestDFSShell)  Time 
> elapsed: 30.022 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.dumpThreads(Native Method)
>   at java.lang.Thread.getStackTrace(Thread.java:1549)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.createTimeoutException(FailOnTimeout.java:182)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.getResult(FailOnTimeout.java:177)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.evaluate(FailOnTimeout.java:128)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> [WARNING] 
> org.apache.hadoop.hdfs.TestDFSShell.testFilePermissions(org.apache.hadoop.hdfs.TestDFSShell)
> [ERROR]   Run 1: TestDFSShell.testFilePermissions TestTimedOut test timed out 
> after 3 mil...
> [INFO]   Run 2: PASS {code}
> 3.org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier#testSPSWhenFileHasExcessRedundancyBlocks
> {code:java}
> [ERROR] 
> testSPSWhenFileHasExcessRedundancyBlocks(org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier)
>   Time elapsed: 67.904 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> Timed out waiting for condition. 
> [WARNING] 
> org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks(org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier)
> [ERROR]   Run 1: 
> TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks:1379
>  Timeout
> [ERROR]   Run 2: 
> TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks:1379
>  Timeout
> [INFO]   Run 3: PASS {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16598) Fix DataNode Fs

2022-06-14 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16598:
---
Summary: Fix DataNode Fs  (was: All datanodes 
[DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
 are bad. Aborting...)

> Fix DataNode Fs
> ---
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16598:
---
Summary: Fix DataNode FsDatasetImpl lock issue without GS checks.  (was: 
Fix DataNode Fs)

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16598:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Bug)

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16598.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works!

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16628:
---
Summary: RBF: Correct target directory when move to trash for kerberos 
login user.  (was: RBF: kerberos user remove Non-default namespace data failed)

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-16628:
--

Assignee: Xiping Zhang

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16628.

Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

Committed to trunk. Thanks [~zhangxiping].

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16600) Deadlock on DataNode

2022-06-17 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16600:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Bug)

> Deadlock on DataNode
> 
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16600:
---
Summary: Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.  
(was: Deadlock on DataNode)

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16600.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works!

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555638#comment-17555638
 ] 

Xiaoqiao He commented on HDFS-16600:


[~xuzq_zander] BTW, do you deploy this feature on your prod cluster? Would you 
mind offer some performance result versus without this feature if so? Although 
it is deployed on my internal cluster for over year and it works well, I 
believe it could show different performance result more or less for different 
version (my internal version is based branch-2.7). Thanks again.

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2022-07-03 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561997#comment-17561997
 ] 

Xiaoqiao He commented on HDFS-15079:


[~xuzq_zander] Thanks to involve me here. I am not fans about `CallerContext`, 
but considering this is serious issue, I totally support to push this solution 
forward to fix it first, then improve it if need.
Actually, in my practice, I try to extend RPC interface to support super user 
to get/set ClientId and CallId for Router, this solution is also applied to 
`data locality` and some other similar cases. It seems to work fine for more 
than two years.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Hui Fei
>Priority: Critical
> Attachments: HDFS-15079.001.patch, HDFS-15079.002.patch, 
> UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16657) Changing pool-level lock to volume-level lock for invalidation of blocks

2022-07-12 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566132#comment-17566132
 ] 

Xiaoqiao He commented on HDFS-16657:


[~yuanbo] Thanks for your proposal. IIRC, we has discussed this issue for a 
while. I think it is time to improve it.
For the furthermore improvement, we should consider acquire volume-level lock 
cost, such as process command to invalidate blocks, it is possible to batch 
them and reduce acquire lock frequently. I am not sure if other cases to block 
heartbeat and some other flow.
Anyway, would you like to contribute and improve it? Thanks again.

> Changing pool-level lock to volume-level lock for invalidation of blocks
> 
>
> Key: HDFS-16657
> URL: https://issues.apache.org/jira/browse/HDFS-16657
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Yuanbo Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-07-13-10-25-37-383.png, 
> image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently we see that the heartbeating of dn become slow in a very busy 
> cluster, here is the chart:
> !image-2022-07-13-10-25-37-383.png|width=665,height=245!
>  
> After getting jstack of the dn, we find that dn heartbeat stuck in 
> invalidation of blocks:
> !image-2022-07-13-10-27-01-386.png|width=658,height=308!
> !image-2022-07-13-10-27-44-258.png|width=502,height=325!
> The key code is:
> {code:java}
> // code placeholder
> try {
>   File blockFile = new File(info.getBlockURI());
>   if (blockFile != null && blockFile.getParentFile() == null) {
> errors.add("Failed to delete replica " + invalidBlks[i]
> +  ". Parent not found for block file: " + blockFile);
> continue;
>   }
> } catch(IllegalArgumentException e) {
>   LOG.warn("Parent directory check failed; replica " + info
>   + " is not backed by a local file");
> } {code}
> DN is trying to locate parent path of block file, thus there is a disk I/O in 
> pool-level lock. When the disk becomes very busy with high io wait, All the 
> pending threads will be blocked by the pool-level lock, and the time of 
> heartbeat is high. We proposal to change the pool-level lock to volume-level 
> lock for block invalidation
> cc: [~hexiaoqiao] [~Aiphag0] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16655) OIV: print out erasure coding policy name in oiv Delimited output

2022-07-25 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16655.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~max2049] for your contributions.

> OIV: print out erasure coding policy name in oiv Delimited output
> -
>
> Key: HDFS-16655
> URL: https://issues.apache.org/jira/browse/HDFS-16655
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> By adding erasure coding policy name to oiv output, it will help with oiv 
> post-analysis to have a overview of all folders/files with specified ec 
> policy and to apply internal regulation based on this information. In 
> particular, it wiil be convenient for the platform to calculate the real 
> storage size of the ec file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16655) OIV: print out erasure coding policy name in oiv Delimited output

2022-07-25 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16655:
---
Component/s: erasure-coding
 (was: tools)

> OIV: print out erasure coding policy name in oiv Delimited output
> -
>
> Key: HDFS-16655
> URL: https://issues.apache.org/jira/browse/HDFS-16655
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> By adding erasure coding policy name to oiv output, it will help with oiv 
> post-analysis to have a overview of all folders/files with specified ec 
> policy and to apply internal regulation based on this information. In 
> particular, it wiil be convenient for the platform to calculate the real 
> storage size of the ec file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16658) BlockManager should output some logs when logEveryBlock is true.

2022-07-28 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16658.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander].

> BlockManager should output some logs when logEveryBlock is true.
> 
>
> Key: HDFS-16658
> URL: https://issues.apache.org/jira/browse/HDFS-16658
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> During locating some abnormal cases in our prod environment, I found that 
> BlockManager does not out put some logs in `addStoredBlock` even though 
> `logEveryBlock` is true.
> I feel that we need to change the log level from DEBUG to INFO.
> {code:java}
> // Some comments here
> private Block addStoredBlock(final BlockInfo block,
>final Block reportedBlock,
>DatanodeStorageInfo storageInfo,
>DatanodeDescriptor delNodeHint,
>boolean logEveryBlock)
>   throws IOException {
> 
>   if (logEveryBlock) {
> blockLog.debug("BLOCK* addStoredBlock: {} is added to {} (size={})",
> node, storedBlock, storedBlock.getNumBytes());
>   }
> ...
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16704) Datanode return empty response instead of NPE for GetVolumeInfo during restarting

2022-08-15 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16704.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander].

> Datanode return empty response instead of NPE for GetVolumeInfo during 
> restarting
> -
>
> Key: HDFS-16704
> URL: https://issues.apache.org/jira/browse/HDFS-16704
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> During datanode starting, I found some NPE in logs:
> {code:java}
> Caused by: java.lang.NullPointerException: Storage not yet initialized
>     at 
> org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:899)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.getVolumeInfo(DataNode.java:3533)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
>     at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:276)
>     at 
> com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:193)
>     at 
> com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:175)
>  {code}
> Because the storage of datanode not yet initialized when we trying to get 
> metrics of datanode, and related code as below:
> {code:java}
> @Override // DataNodeMXBean
> public String getVolumeInfo() {
>   Preconditions.checkNotNull(data, "Storage not yet initialized");
>   return JSON.toString(data.getVolumeInfoMap());
> } {code}
> The logic is ok, but I feel that the more reasonable logic should be return 
> an empty response instead of NPE, because InfoServer will be started before 
> initBlockPool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16717) Replace NPE with IOException in DataNode.class

2022-08-23 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16717.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander]!

> Replace NPE with IOException in DataNode.class
> --
>
> Key: HDFS-16717
> URL: https://issues.apache.org/jira/browse/HDFS-16717
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In current logic, if storage not yet initialized, it will throw a NPE in 
> DataNode.class. Developers or SREs are very sensitive to NPE, so I feel that 
> we can use IOException instead of NPE when storage not yet initialized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16735) Reduce the number of HeartbeatManager loops

2022-08-28 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16735.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~zhangshuyan] for your contributions!

> Reduce the number of HeartbeatManager loops
> ---
>
> Key: HDFS-16735
> URL: https://issues.apache.org/jira/browse/HDFS-16735
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> HeartbeatManager only processes one dead datanode (and failed storage) per 
> round in heartbeatCheck(), that is to say, if there are ten failed storages, 
> all datanode states need to be scanned 10 times, which is unnecessary and a 
> waste of resources. This patch makes the number of bad storages processed per 
> scan configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16593) Correct inaccurate BlocksRemoved metric on DataNode side

2022-09-05 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16593.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk.

> Correct inaccurate BlocksRemoved metric on DataNode side
> 
>
> Key: HDFS-16593
> URL: https://issues.apache.org/jira/browse/HDFS-16593
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When tracing the root cause of production issue, I found that the 
> BlocksRemoved  metric on Datanode size was inaccurate.
> {code:java}
> case DatanodeProtocol.DNA_INVALIDATE:
>   //
>   // Some local block(s) are obsolete and can be 
>   // safely garbage-collected.
>   //
>   Block toDelete[] = bcmd.getBlocks();
>   try {
> // using global fsdataset
> dn.getFSDataset().invalidate(bcmd.getBlockPoolId(), toDelete);
>   } catch(IOException e) {
> // Exceptions caught here are not expected to be disk-related.
> throw e;
>   }
>   dn.metrics.incrBlocksRemoved(toDelete.length);
>   break;
> {code}
> Because even if the invalidate method throws an exception, some blocks may 
> have been successfully deleted internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16781) setReplication方法不能设置整个目录的副本吗？whole folder?

2022-09-25 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16781.

Resolution: Invalid

[~zdltvxq] Thanks for your report. IIRC, setReplication api is kept for long 
time and there is no other interface to recursive setrep (include the latest 
release version). IMO it could prevent from overload of NameNode to process 
many files at one request. Another side, for Shell, it also get all files and 
request setReplication one by one.

BTW, just recommend: a). use English language when file new ticket. b) please 
send mail to u...@hadoop.apache.org or user...@hadoop.apache.org (which is for 
Chinese language) first when meet any issues to reach out. Generally, ticket 
here is about bugfix/feature/improvement for devs.

Thanks again. 

> setReplication方法不能设置整个目录的副本吗？whole folder?
> --
>
> Key: HDFS-16781
> URL: https://issues.apache.org/jira/browse/HDFS-16781
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 2.6.0
> Environment: org.apache.hadoop.fs.FileSystem
>Reporter: zdl
>Priority: Blocker
>
> org.apache.hadoop.fs.FileSystem里有个setReplication方法可以设置目标文件的副本数，但是只能指定到文件，如果指定目录，就不生效。
> 我希望指定目录时，整个目录下所有文件和子目录下的文件都能设置副本，实际上在命令行中用hadoop fs -setrep /dirs 
> 方法是可以实现这个目标的，为什么在这个java api里面反倒不行？
> 或者是说，这个问题是否已经在某个新版本里解决了？请指点！谢谢。
> I want to set replications for whole folder's files by 
> org.apache.hadoop.fs.FileSystem setReplication function, while now in v2.6.0 
> it can only set to a file.
> Is there any solutions or a later version can solve it ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16781) setReplication方法不能设置整个目录的副本吗？whole folder?

2022-09-25 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16781:
---
Priority: Minor  (was: Blocker)

> setReplication方法不能设置整个目录的副本吗？whole folder?
> --
>
> Key: HDFS-16781
> URL: https://issues.apache.org/jira/browse/HDFS-16781
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 2.6.0
> Environment: org.apache.hadoop.fs.FileSystem
>Reporter: zdl
>Priority: Minor
>
> org.apache.hadoop.fs.FileSystem里有个setReplication方法可以设置目标文件的副本数，但是只能指定到文件，如果指定目录，就不生效。
> 我希望指定目录时，整个目录下所有文件和子目录下的文件都能设置副本，实际上在命令行中用hadoop fs -setrep /dirs 
> 方法是可以实现这个目标的，为什么在这个java api里面反倒不行？
> 或者是说，这个问题是否已经在某个新版本里解决了？请指点！谢谢。
> I want to set replications for whole folder's files by 
> org.apache.hadoop.fs.FileSystem setReplication function, while now in v2.6.0 
> it can only set to a file.
> Is there any solutions or a later version can solve it ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16781) setReplication方法不能设置整个目录的副本吗？whole folder?

2022-09-26 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609358#comment-17609358
 ] 

Xiaoqiao He commented on HDFS-16781:


Great, welcome and look forwards to more discussion and contribution.

> setReplication方法不能设置整个目录的副本吗？whole folder?
> --
>
> Key: HDFS-16781
> URL: https://issues.apache.org/jira/browse/HDFS-16781
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 2.6.0
> Environment: org.apache.hadoop.fs.FileSystem
>Reporter: zdl
>Priority: Minor
>
> org.apache.hadoop.fs.FileSystem里有个setReplication方法可以设置目标文件的副本数，但是只能指定到文件，如果指定目录，就不生效。
> 我希望指定目录时，整个目录下所有文件和子目录下的文件都能设置副本，实际上在命令行中用hadoop fs -setrep /dirs 
> 方法是可以实现这个目标的，为什么在这个java api里面反倒不行？
> 或者是说，这个问题是否已经在某个新版本里解决了？请指点！谢谢。
> I want to set replications for whole folder's files by 
> org.apache.hadoop.fs.FileSystem setReplication function, while now in v2.6.0 
> it can only set to a file.
> Is there any solutions or a later version can solve it ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16798) SerialNumberMap should decrease current counter if the item exist

2022-10-09 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16798:
---
Summary: SerialNumberMap should decrease current counter if the item exist  
(was: SerialNumberMap should decrease the current if the old already exist)

> SerialNumberMap should decrease current counter if the item exist
> -
>
> Key: HDFS-16798
> URL: https://issues.apache.org/jira/browse/HDFS-16798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> During looking into some code related XATTR, I found there is a bug in 
> SerialNumberMap, as bellow:
> {code:java}
> public int get(T t) {
>   if (t == null) {
> return 0;
>   }
>   Integer sn = t2i.get(t);
>   if (sn == null) {
> sn = current.getAndIncrement();
> if (sn > max) {
>   current.getAndDecrement();
>   throw new IllegalStateException(name + ": serial number map is full");
> }
> Integer old = t2i.putIfAbsent(t, sn);
> if (old != null) {
>   // here: if the old is not null, we should decrease the current value.
>   return old;
> }
> i2t.put(sn, t);
>   }
>   return sn;
> } {code}
> This bug will only cause that the capacity of serialNumberMap is less than 
> expected, no other impact.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16798) SerialNumberMap should decrease current counter if the item exist

2022-10-09 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16798.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> SerialNumberMap should decrease current counter if the item exist
> -
>
> Key: HDFS-16798
> URL: https://issues.apache.org/jira/browse/HDFS-16798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> During looking into some code related XATTR, I found there is a bug in 
> SerialNumberMap, as bellow:
> {code:java}
> public int get(T t) {
>   if (t == null) {
> return 0;
>   }
>   Integer sn = t2i.get(t);
>   if (sn == null) {
> sn = current.getAndIncrement();
> if (sn > max) {
>   current.getAndDecrement();
>   throw new IllegalStateException(name + ": serial number map is full");
> }
> Integer old = t2i.putIfAbsent(t, sn);
> if (old != null) {
>   // here: if the old is not null, we should decrease the current value.
>   return old;
> }
> i2t.put(sn, t);
>   }
>   return sn;
> } {code}
> This bug will only cause that the capacity of serialNumberMap is less than 
> expected, no other impact.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16783) Remove the redundant lock in deepCopyReplica

2022-10-10 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16783:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Improvement)

> Remove the redundant lock in deepCopyReplica
> 
>
> Key: HDFS-16783
> URL: https://issues.apache.org/jira/browse/HDFS-16783
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> When patching the fine-grained locking of datanode, found there is a 
> redundant lock in deepCopyReplica, maybe we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16783) Remove the redundant lock in deepCopyReplica and getFinalizedBlocks

2022-10-10 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16783:
---
Summary: Remove the redundant lock in deepCopyReplica and 
getFinalizedBlocks  (was: Remove the redundant lock in deepCopyReplica)

> Remove the redundant lock in deepCopyReplica and getFinalizedBlocks
> ---
>
> Key: HDFS-16783
> URL: https://issues.apache.org/jira/browse/HDFS-16783
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> When patching the fine-grained locking of datanode, found there is a 
> redundant lock in deepCopyReplica, maybe we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16783) Remove the redundant lock in deepCopyReplica and getFinalizedBlocks

2022-10-10 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16783.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove the redundant lock in deepCopyReplica and getFinalizedBlocks
> ---
>
> Key: HDFS-16783
> URL: https://issues.apache.org/jira/browse/HDFS-16783
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When patching the fine-grained locking of datanode, found there is a 
> redundant lock in deepCopyReplica, maybe we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16787) Remove the redundant lock in DataSetLockManager#removeLock.

2022-10-10 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16787:
---
Summary: Remove the redundant lock in DataSetLockManager#removeLock.  (was: 
Remove redundant lock in DataSetLockManager#removeLock in datanode.)

> Remove the redundant lock in DataSetLockManager#removeLock.
> ---
>
> Key: HDFS-16787
> URL: https://issues.apache.org/jira/browse/HDFS-16787
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> During patching the datanode fine-grained locking, found there is a redundant 
> lock in DataSetLockManager#removeLock, and the code as bellow:
> {code:java}
> @Override
> public void removeLock(LockLevel level, String... resources) {
>   String lockName = generateLockName(level, resources);
>   try (AutoCloseDataSetLock lock = writeLock(level, resources)) {
> // Here, this lock is redundant.
> lock.lock();
> lockMap.removeLock(lockName);
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16787) Remove the redundant lock in DataSetLockManager#removeLock.

2022-10-10 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16787:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Improvement)

> Remove the redundant lock in DataSetLockManager#removeLock.
> ---
>
> Key: HDFS-16787
> URL: https://issues.apache.org/jira/browse/HDFS-16787
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> During patching the datanode fine-grained locking, found there is a redundant 
> lock in DataSetLockManager#removeLock, and the code as bellow:
> {code:java}
> @Override
> public void removeLock(LockLevel level, String... resources) {
>   String lockName = generateLockName(level, resources);
>   try (AutoCloseDataSetLock lock = writeLock(level, resources)) {
> // Here, this lock is redundant.
> lock.lock();
> lockMap.removeLock(lockName);
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16787) Remove the redundant lock in DataSetLockManager#removeLock.

2022-10-10 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16787.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove the redundant lock in DataSetLockManager#removeLock.
> ---
>
> Key: HDFS-16787
> URL: https://issues.apache.org/jira/browse/HDFS-16787
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> During patching the datanode fine-grained locking, found there is a redundant 
> lock in DataSetLockManager#removeLock, and the code as bellow:
> {code:java}
> @Override
> public void removeLock(LockLevel level, String... resources) {
>   String lockName = generateLockName(level, resources);
>   try (AutoCloseDataSetLock lock = writeLock(level, resources)) {
> // Here, this lock is redundant.
> lock.lock();
> lockMap.removeLock(lockName);
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2022-10-26 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624847#comment-17624847
 ] 

Xiaoqiao He commented on HDFS-10453:


[~yuyanlei] It works find in my internal cluster. It has been checkin trunk and 
other active branches. Thanks.

> ReplicationMonitor thread could stuck for long time due to the race between 
> replication and delete of same file in a large cluster.
> ---
>
> Key: HDFS-10453
> URL: https://issues.apache.org/jira/browse/HDFS-10453
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.1, 2.5.2, 2.7.1, 2.6.4
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
>
> Attachments: HDFS-10453-branch-2.001.patch, 
> HDFS-10453-branch-2.003.patch, HDFS-10453-branch-2.7.004.patch, 
> HDFS-10453-branch-2.7.005.patch, HDFS-10453-branch-2.7.006.patch, 
> HDFS-10453-branch-2.7.007.patch, HDFS-10453-branch-2.7.008.patch, 
> HDFS-10453-branch-2.7.009.patch, HDFS-10453-branch-2.8.001.patch, 
> HDFS-10453-branch-2.8.002.patch, HDFS-10453-branch-2.9.001.patch, 
> HDFS-10453-branch-2.9.002.patch, HDFS-10453-branch-3.0.001.patch, 
> HDFS-10453-branch-3.0.002.patch, HDFS-10453-trunk.001.patch, 
> HDFS-10453-trunk.002.patch, HDFS-10453.001.patch
>
>
> ReplicationMonitor thread could stuck for long time and loss data with little 
> probability. Consider the typical scenario：
> (1) create and close a file with the default replicas(3);
> (2) increase replication (to 10) of the file.
> (3) delete the file while ReplicationMonitor is scheduling blocks belong to 
> that file for replications.
> if ReplicationMonitor stuck reappeared, NameNode will print log as:
> {code:xml}
> 2016-04-19 10:20:48,083 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> ..
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough 
> replicas: expected size is 7 but only 0 storage types can be selected 
> (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, 
> DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) All required storage types are unavailable:  
> unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> {code}
> This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) 
> process same block at the same moment.
> (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to 
> replicate and leave the global lock.
> (2) FSNamesystem#delete invoked to delete blocks then clear the reference in 
> blocksmap, needReplications, etc. the block's NumBytes will set 
> NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does 
> not need explicit ACK from the node. 
> (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to 
> chooseTargets for the same blocks and no node will be selected after traverse 
> whole cluster because  no node choice satisfy the goodness criteria 
> (remaining spaces achieve required size Long.MAX_VALUE). 
> During of stage#3 ReplicationMonitor stuck for long time, especial in a large 
> cluster. invalidateBlocks & neededReplications continues to grow and no 
> consumes. it will loss data at the worst.
> This can mostly be avoided by skip chooseTarget for

[jira] [Commented] (HDFS-16836) StandbyCheckpointer can still trigger rollback fs image after RU is finalized

2022-11-09 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631431#comment-17631431
 ] 

Xiaoqiao He commented on HDFS-16836:


[~Lei Yang] Great catch and thanks for your report.
{quote}doCheckpoint() failed.
{quote}
Do you mean doCheckpoint failed but not throw exception or error, then 
needRollbackImage set back to false? Would you mind try to submit PR to fix it 
following this guide,  
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute Thanks.

> StandbyCheckpointer can still trigger rollback fs image after RU is finalized
> -
>
> Key: HDFS-16836
> URL: https://issues.apache.org/jira/browse/HDFS-16836
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Lei Yang
>Priority: Major
>
> StandbyCheckpointer trigger rollback fsimage when RU is started.
> When ru is started, a flag (needRollbackImage) was set to true during edit 
> log replay.
> And it only gets reset to false when doCheckpoint() succeeded.
> Think about following scenario:
>  # Start RU, needRollbackImage is set to true.
>  # doCheckpoint() failed.
>  # RU is finalized.
>  # namesystem.getFSImage().hasRollbackFSImage() is always false since 
> rollback image cannot be generated once RU is over.
>  # needRollbackImage was never set to false.
>  # Checkpoints threshold(1m txns) and period(1hr) are not honored.
> {code:java}
> StandbyCheckpointer:
> void doWork() {
>  
>   doCheckpoint();
>   // reset needRollbackCheckpoint to false only when we finish a ckpt
>   // for rollback image
>   if (needRollbackCheckpoint
>   && namesystem.getFSImage().hasRollbackFSImage()) {
> namesystem.setCreatedRollbackImages(true);
> namesystem.setNeedRollbackFsImage(false);
>   }
>   lastCheckpointTime = now;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16785) Avoid to hold write lock to improve performance when add volume.

2022-11-13 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16785:
---
Summary: Avoid to hold write lock to improve performance when add volume.   
(was: DataNode hold BP write lock to scan disk)

> Avoid to hold write lock to improve performance when add volume. 
> -
>
> Key: HDFS-16785
> URL: https://issues.apache.org/jira/browse/HDFS-16785
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> When patching the fine-grained locking of datanode, I  found that `addVolume` 
> will hold the write block of the BP lock to scan the new volume to get the 
> blocks. If we try to add one full volume that was fixed offline before, i 
> will hold the write lock for a long time.
> The related code as bellows:
> {code:java}
> for (final NamespaceInfo nsInfo : nsInfos) {
>   String bpid = nsInfo.getBlockPoolID();
>   try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> fsVolume.addBlockPool(bpid, this.conf, this.timer);
> fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker);
>   } catch (IOException e) {
> LOG.warn("Caught exception when adding " + fsVolume +
> ". Will throw later.", e);
> exceptions.add(e);
>   }
> } {code}
> And I noticed that this lock is added by HDFS-15382, means that this logic is 
> not in lock before. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16785) Avoid to hold write lock to improve performance when add volume.

2022-11-13 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16785:
---
Parent: HDFS-15382
Issue Type: Sub-task  (was: Improvement)

> Avoid to hold write lock to improve performance when add volume. 
> -
>
> Key: HDFS-16785
> URL: https://issues.apache.org/jira/browse/HDFS-16785
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> When patching the fine-grained locking of datanode, I  found that `addVolume` 
> will hold the write block of the BP lock to scan the new volume to get the 
> blocks. If we try to add one full volume that was fixed offline before, i 
> will hold the write lock for a long time.
> The related code as bellows:
> {code:java}
> for (final NamespaceInfo nsInfo : nsInfos) {
>   String bpid = nsInfo.getBlockPoolID();
>   try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> fsVolume.addBlockPool(bpid, this.conf, this.timer);
> fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker);
>   } catch (IOException e) {
> LOG.warn("Caught exception when adding " + fsVolume +
> ". Will throw later.", e);
> exceptions.add(e);
>   }
> } {code}
> And I noticed that this lock is added by HDFS-15382, means that this logic is 
> not in lock before. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16785) Avoid to hold write lock to improve performance when add volume.

2022-11-13 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16785.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works.

> Avoid to hold write lock to improve performance when add volume. 
> -
>
> Key: HDFS-16785
> URL: https://issues.apache.org/jira/browse/HDFS-16785
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When patching the fine-grained locking of datanode, I  found that `addVolume` 
> will hold the write block of the BP lock to scan the new volume to get the 
> blocks. If we try to add one full volume that was fixed offline before, i 
> will hold the write lock for a long time.
> The related code as bellows:
> {code:java}
> for (final NamespaceInfo nsInfo : nsInfos) {
>   String bpid = nsInfo.getBlockPoolID();
>   try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> fsVolume.addBlockPool(bpid, this.conf, this.timer);
> fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker);
>   } catch (IOException e) {
> LOG.warn("Caught exception when adding " + fsVolume +
> ". Will throw later.", e);
> exceptions.add(e);
>   }
> } {code}
> And I noticed that this lock is added by HDFS-15382, means that this logic is 
> not in lock before. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16836) StandbyCheckpointer can still trigger rollback fs image after RU is finalized

2022-11-14 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634144#comment-17634144
 ] 

Xiaoqiao He commented on HDFS-16836:


{quote}I mean doCheckpoint can fail and throw exception hence needRollbackImage 
is never reset to false and can leak after RU is done.
{quote}
Thanks, IIUC, when trigger rollback `needRollbackCheckpoint` set to false only 
when finish a checkpoint for rollback image as code comment said, if meet 
exception and set back it will never save one rollback image, right? Not sure 
if this is one truth way.

> StandbyCheckpointer can still trigger rollback fs image after RU is finalized
> -
>
> Key: HDFS-16836
> URL: https://issues.apache.org/jira/browse/HDFS-16836
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Lei Yang
>Priority: Major
>  Labels: pull-request-available
>
> StandbyCheckpointer trigger rollback fsimage when RU is started.
> When ru is started, a flag (needRollbackImage) was set to true during edit 
> log replay.
> And it only gets reset to false when doCheckpoint() succeeded.
> Think about following scenario:
>  # Start RU, needRollbackImage is set to true.
>  # doCheckpoint() failed.
>  # RU is finalized.
>  # namesystem.getFSImage().hasRollbackFSImage() is always false since 
> rollback image cannot be generated once RU is over.
>  # needRollbackImage was never set to false.
>  # Checkpoints threshold(1m txns) and period(1hr) are not honored.
> {code:java}
> StandbyCheckpointer:
> void doWork() {
>  
>   doCheckpoint();
>   // reset needRollbackCheckpoint to false only when we finish a ckpt
>   // for rollback image
>   if (needRollbackCheckpoint
>   && namesystem.getFSImage().hasRollbackFSImage()) {
> namesystem.setCreatedRollbackImages(true);
> namesystem.setNeedRollbackFsImage(false);
>   }
>   lastCheckpointTime = now;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-16836) StandbyCheckpointer can still trigger rollback fs image after RU is finalized

2022-11-14 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634144#comment-17634144
 ] 

Xiaoqiao He edited comment on HDFS-16836 at 11/15/22 4:10 AM:
--

{quote}I mean doCheckpoint can fail and throw exception hence needRollbackImage 
is never reset to false and can leak after RU is done.
{quote}
Thanks, IIUC, when trigger rollback `needRollbackCheckpoint` set to false only 
when finish a checkpoint for rollback image as code comment said, if meet 
exception and set back it will never save one rollback image, right? Not sure 
if your proposal is one truth way.


was (Author: hexiaoqiao):
{quote}I mean doCheckpoint can fail and throw exception hence needRollbackImage 
is never reset to false and can leak after RU is done.
{quote}
Thanks, IIUC, when trigger rollback `needRollbackCheckpoint` set to false only 
when finish a checkpoint for rollback image as code comment said, if meet 
exception and set back it will never save one rollback image, right? Not sure 
if this is one truth way.

> StandbyCheckpointer can still trigger rollback fs image after RU is finalized
> -
>
> Key: HDFS-16836
> URL: https://issues.apache.org/jira/browse/HDFS-16836
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Lei Yang
>Priority: Major
>  Labels: pull-request-available
>
> StandbyCheckpointer trigger rollback fsimage when RU is started.
> When ru is started, a flag (needRollbackImage) was set to true during edit 
> log replay.
> And it only gets reset to false when doCheckpoint() succeeded.
> Think about following scenario:
>  # Start RU, needRollbackImage is set to true.
>  # doCheckpoint() failed.
>  # RU is finalized.
>  # namesystem.getFSImage().hasRollbackFSImage() is always false since 
> rollback image cannot be generated once RU is over.
>  # needRollbackImage was never set to false.
>  # Checkpoints threshold(1m txns) and period(1hr) are not honored.
> {code:java}
> StandbyCheckpointer:
> void doWork() {
>  
>   doCheckpoint();
>   // reset needRollbackCheckpoint to false only when we finish a ckpt
>   // for rollback image
>   if (needRollbackCheckpoint
>   && namesystem.getFSImage().hasRollbackFSImage()) {
> namesystem.setCreatedRollbackImages(true);
> namesystem.setNeedRollbackFsImage(false);
>   }
>   lastCheckpointTime = now;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16836) StandbyCheckpointer can still trigger rollback fs image after RU is finalized

2022-11-15 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634619#comment-17634619
 ] 

Xiaoqiao He commented on HDFS-16836:


{quote}What we have seen is #1 succeeded but #2 failed so 
needRollbackCheckpoint is never set back to false and all the subsequent 
checkpointings are just continuously triggering rollback fsimage for RU even 
after RU is finalized. This bypasses the checkpoint period and threshold 
check.{quote}
Thanks for the detailed explains. It makes sense to me.

> StandbyCheckpointer can still trigger rollback fs image after RU is finalized
> -
>
> Key: HDFS-16836
> URL: https://issues.apache.org/jira/browse/HDFS-16836
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Lei Yang
>Assignee: Lei Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> StandbyCheckpointer trigger rollback fsimage when RU is started.
> When ru is started, a flag (needRollbackImage) was set to true during edit 
> log replay.
> And it only gets reset to false when doCheckpoint() succeeded.
> Think about following scenario:
>  # Start RU, needRollbackImage is set to true.
>  # doCheckpoint() failed.
>  # RU is finalized.
>  # namesystem.getFSImage().hasRollbackFSImage() is always false since 
> rollback image cannot be generated once RU is over.
>  # needRollbackImage was never set to false.
>  # Checkpoints threshold(1m txns) and period(1hr) are not honored.
> {code:java}
> StandbyCheckpointer:
> void doWork() {
>  
>   doCheckpoint();
>   // reset needRollbackCheckpoint to false only when we finish a ckpt
>   // for rollback image
>   if (needRollbackCheckpoint
>   && namesystem.getFSImage().hasRollbackFSImage()) {
> namesystem.setCreatedRollbackImages(true);
> namesystem.setNeedRollbackFsImage(false);
>   }
>   lastCheckpointTime = now;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-29 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641058#comment-17641058
 ] 

Xiaoqiao He commented on HDFS-16855:


[~dingshun]Thanks for your catches. Sorry I didn't get the relationship between 
#addBlockPool and #replicas here. Would you mind to add some information about 
the dead lock? Thanks.

> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start，maybe happened deadlock，when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641171#comment-17641171
 ] 

Xiaoqiao He commented on HDFS-16855:


[~dingshun] Thanks for the detailed explanation. IIRC, at the beginning of 
DataNode's fine-grained lock, it indeed could cause dead lock here, hence it 
had fixed at the following PRs. Sorry I don't find the related PR now. Would 
you mind to check the logic branch trunk? welcome more feedback if any issue 
you meet. Thanks again. cc [~Aiphag0] Any more suggestions?

> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start，maybe happened deadlock，when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16868) Audit log duplicate problem when an ACE occurs in FSNamesystem.

2022-12-11 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645937#comment-17645937
 ] 

Xiaoqiao He commented on HDFS-16868:


[~chino71] Great catch! Thanks for your report. Would you like to try to fix it?

> Audit log duplicate problem when an ACE occurs in FSNamesystem.
> ---
>
> Key: HDFS-16868
> URL: https://issues.apache.org/jira/browse/HDFS-16868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Beibei Zhao
>Priority: Major
>  Labels: pull-request-available
>
> checkSuperuserPrivilege call logAuditEvent and throw ace when an 
> AccessControlException occurs.
> {code:java}
>   // This method logs operationName without super user privilege.
>   // It should be called without holding FSN lock.
>   void checkSuperuserPrivilege(String operationName, String path)
>   throws IOException {
> if (isPermissionEnabled) {
>   try {
> FSPermissionChecker.setOperationType(operationName);
> FSPermissionChecker pc = getPermissionChecker();
> pc.checkSuperuserPrivilege(path);
>   } catch(AccessControlException ace){
> logAuditEvent(false, operationName, path);
> throw ace;
>   }
> }
>   }
> {code}
> It' s callers like metaSave call it like this: 
> {code:java}
>   /**
>* Dump all metadata into specified file
>* @param filename
>*/
>   void metaSave(String filename) throws IOException {
> String operationName = "metaSave";
> checkSuperuserPrivilege(operationName);
> ..
> try {
> ..
> metaSave(out);
> ..
>   }
> } finally {
>   readUnlock(operationName, getLockReportInfoSupplier(null));
> }
> logAuditEvent(true, operationName, null);
>   }
> {code}
> but setQuota, addCachePool, modifyCachePool, removeCachePool, 
> createEncryptionZone and reencryptEncryptionZone catch the ace and log the 
> same msg again, it' s a waste of memory I think: 
> {code:java}
>   /**
>* Set the namespace quota and storage space quota for a directory.
>* See {@link ClientProtocol#setQuota(String, long, long, StorageType)} for 
> the
>* contract.
>* 
>* Note: This does not support ".inodes" relative path.
>*/
>   void setQuota(String src, long nsQuota, long ssQuota, StorageType type)
>   throws IOException {
> ..
> try {
>   if(!allowOwnerSetQuota) {
> checkSuperuserPrivilege(operationName, src);
>   }
>  ..
> } catch (AccessControlException ace) {
>   logAuditEvent(false, operationName, src);
>   throw ace;
> }
> getEditLog().logSync();
> logAuditEvent(true, operationName, src);
>   }
> {code}
> Maybe we should move the checkSuperuserPrivilege out of the try block as 
> metaSave and other callers do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16868) Fix audit log duplicate issue when an ACE occurs in FSNamesystem.

2022-12-12 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16868:
---
Summary: Fix audit log duplicate issue when an ACE occurs in FSNamesystem.  
(was: Audit log duplicate problem when an ACE occurs in FSNamesystem.)

> Fix audit log duplicate issue when an ACE occurs in FSNamesystem.
> -
>
> Key: HDFS-16868
> URL: https://issues.apache.org/jira/browse/HDFS-16868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Beibei Zhao
>Assignee: Beibei Zhao
>Priority: Major
>  Labels: pull-request-available
>
> checkSuperuserPrivilege call logAuditEvent and throw ace when an 
> AccessControlException occurs.
> {code:java}
>   // This method logs operationName without super user privilege.
>   // It should be called without holding FSN lock.
>   void checkSuperuserPrivilege(String operationName, String path)
>   throws IOException {
> if (isPermissionEnabled) {
>   try {
> FSPermissionChecker.setOperationType(operationName);
> FSPermissionChecker pc = getPermissionChecker();
> pc.checkSuperuserPrivilege(path);
>   } catch(AccessControlException ace){
> logAuditEvent(false, operationName, path);
> throw ace;
>   }
> }
>   }
> {code}
> It' s callers like metaSave call it like this: 
> {code:java}
>   /**
>* Dump all metadata into specified file
>* @param filename
>*/
>   void metaSave(String filename) throws IOException {
> String operationName = "metaSave";
> checkSuperuserPrivilege(operationName);
> ..
> try {
> ..
> metaSave(out);
> ..
>   }
> } finally {
>   readUnlock(operationName, getLockReportInfoSupplier(null));
> }
> logAuditEvent(true, operationName, null);
>   }
> {code}
> but setQuota, addCachePool, modifyCachePool, removeCachePool, 
> createEncryptionZone and reencryptEncryptionZone catch the ace and log the 
> same msg again, it' s a waste of memory I think: 
> {code:java}
>   /**
>* Set the namespace quota and storage space quota for a directory.
>* See {@link ClientProtocol#setQuota(String, long, long, StorageType)} for 
> the
>* contract.
>* 
>* Note: This does not support ".inodes" relative path.
>*/
>   void setQuota(String src, long nsQuota, long ssQuota, StorageType type)
>   throws IOException {
> ..
> try {
>   if(!allowOwnerSetQuota) {
> checkSuperuserPrivilege(operationName, src);
>   }
>  ..
> } catch (AccessControlException ace) {
>   logAuditEvent(false, operationName, src);
>   throw ace;
> }
> getEditLog().logSync();
> logAuditEvent(true, operationName, src);
>   }
> {code}
> Maybe we should move the checkSuperuserPrivilege out of the try block as 
> metaSave and other callers do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16868) Fix audit log duplicate issue when an ACE occurs in FSNamesystem.

2022-12-12 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16868.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix audit log duplicate issue when an ACE occurs in FSNamesystem.
> -
>
> Key: HDFS-16868
> URL: https://issues.apache.org/jira/browse/HDFS-16868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Beibei Zhao
>Assignee: Beibei Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> checkSuperuserPrivilege call logAuditEvent and throw ace when an 
> AccessControlException occurs.
> {code:java}
>   // This method logs operationName without super user privilege.
>   // It should be called without holding FSN lock.
>   void checkSuperuserPrivilege(String operationName, String path)
>   throws IOException {
> if (isPermissionEnabled) {
>   try {
> FSPermissionChecker.setOperationType(operationName);
> FSPermissionChecker pc = getPermissionChecker();
> pc.checkSuperuserPrivilege(path);
>   } catch(AccessControlException ace){
> logAuditEvent(false, operationName, path);
> throw ace;
>   }
> }
>   }
> {code}
> It' s callers like metaSave call it like this: 
> {code:java}
>   /**
>* Dump all metadata into specified file
>* @param filename
>*/
>   void metaSave(String filename) throws IOException {
> String operationName = "metaSave";
> checkSuperuserPrivilege(operationName);
> ..
> try {
> ..
> metaSave(out);
> ..
>   }
> } finally {
>   readUnlock(operationName, getLockReportInfoSupplier(null));
> }
> logAuditEvent(true, operationName, null);
>   }
> {code}
> but setQuota, addCachePool, modifyCachePool, removeCachePool, 
> createEncryptionZone and reencryptEncryptionZone catch the ace and log the 
> same msg again, it' s a waste of memory I think: 
> {code:java}
>   /**
>* Set the namespace quota and storage space quota for a directory.
>* See {@link ClientProtocol#setQuota(String, long, long, StorageType)} for 
> the
>* contract.
>* 
>* Note: This does not support ".inodes" relative path.
>*/
>   void setQuota(String src, long nsQuota, long ssQuota, StorageType type)
>   throws IOException {
> ..
> try {
>   if(!allowOwnerSetQuota) {
> checkSuperuserPrivilege(operationName, src);
>   }
>  ..
> } catch (AccessControlException ace) {
>   logAuditEvent(false, operationName, src);
>   throw ace;
> }
> getEditLog().logSync();
> logAuditEvent(true, operationName, src);
>   }
> {code}
> Maybe we should move the checkSuperuserPrivilege out of the try block as 
> metaSave and other callers do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16868) Fix audit log duplicate issue when an ACE occurs in FSNamesystem.

2022-12-12 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16868:
---
Issue Type: Improvement  (was: Bug)

> Fix audit log duplicate issue when an ACE occurs in FSNamesystem.
> -
>
> Key: HDFS-16868
> URL: https://issues.apache.org/jira/browse/HDFS-16868
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Beibei Zhao
>Assignee: Beibei Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> checkSuperuserPrivilege call logAuditEvent and throw ace when an 
> AccessControlException occurs.
> {code:java}
>   // This method logs operationName without super user privilege.
>   // It should be called without holding FSN lock.
>   void checkSuperuserPrivilege(String operationName, String path)
>   throws IOException {
> if (isPermissionEnabled) {
>   try {
> FSPermissionChecker.setOperationType(operationName);
> FSPermissionChecker pc = getPermissionChecker();
> pc.checkSuperuserPrivilege(path);
>   } catch(AccessControlException ace){
> logAuditEvent(false, operationName, path);
> throw ace;
>   }
> }
>   }
> {code}
> It' s callers like metaSave call it like this: 
> {code:java}
>   /**
>* Dump all metadata into specified file
>* @param filename
>*/
>   void metaSave(String filename) throws IOException {
> String operationName = "metaSave";
> checkSuperuserPrivilege(operationName);
> ..
> try {
> ..
> metaSave(out);
> ..
>   }
> } finally {
>   readUnlock(operationName, getLockReportInfoSupplier(null));
> }
> logAuditEvent(true, operationName, null);
>   }
> {code}
> but setQuota, addCachePool, modifyCachePool, removeCachePool, 
> createEncryptionZone and reencryptEncryptionZone catch the ace and log the 
> same msg again, it' s a waste of memory I think: 
> {code:java}
>   /**
>* Set the namespace quota and storage space quota for a directory.
>* See {@link ClientProtocol#setQuota(String, long, long, StorageType)} for 
> the
>* contract.
>* 
>* Note: This does not support ".inodes" relative path.
>*/
>   void setQuota(String src, long nsQuota, long ssQuota, StorageType type)
>   throws IOException {
> ..
> try {
>   if(!allowOwnerSetQuota) {
> checkSuperuserPrivilege(operationName, src);
>   }
>  ..
> } catch (AccessControlException ace) {
>   logAuditEvent(false, operationName, src);
>   throw ace;
> }
> getEditLog().logSync();
> logAuditEvent(true, operationName, src);
>   }
> {code}
> Maybe we should move the checkSuperuserPrivilege out of the try block as 
> metaSave and other callers do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-16898) Make write lock fine-grain in processCommandFromActor method

2023-02-05 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-16898:
--

Assignee: ZhangHB

> Make write lock fine-grain in processCommandFromActor method
> 
>
> Key: HDFS-16898
> URL: https://issues.apache.org/jira/browse/HDFS-16898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> Now in method processCommandFromActor,  we have code like below:
>  
> {code:java}
> writeLock();
> try {
>   if (actor == bpServiceToActive) {
> return processCommandFromActive(cmd, actor);
>   } else {
> return processCommandFromStandby(cmd, actor);
>   }
> } finally {
>   writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not 
> release.
>  
> It maybe block the updateActorStatesFromHeartbeat method in 
> offerService，furthermore, it can cause the lastcontact of datanode very high, 
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
> this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to 
> address this problem
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16898) Make write lock fine-grain in processCommandFromActor method

2023-02-05 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684410#comment-17684410
 ] 

Xiaoqiao He commented on HDFS-16898:


Add ZhangHB to contributors group, and assign this ticket to him.

> Make write lock fine-grain in processCommandFromActor method
> 
>
> Key: HDFS-16898
> URL: https://issues.apache.org/jira/browse/HDFS-16898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> Now in method processCommandFromActor,  we have code like below:
>  
> {code:java}
> writeLock();
> try {
>   if (actor == bpServiceToActive) {
> return processCommandFromActive(cmd, actor);
>   } else {
> return processCommandFromStandby(cmd, actor);
>   }
> } finally {
>   writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not 
> release.
>  
> It maybe block the updateActorStatesFromHeartbeat method in 
> offerService，furthermore, it can cause the lastcontact of datanode very high, 
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
> this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to 
> address this problem
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16898) Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat

2023-02-07 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16898:
---
Summary: Remove write lock for processCommandFromActor of DataNode to 
reduce impact on heartbeat  (was: Make write lock fine-grain in 
processCommandFromActor method)

> Remove write lock for processCommandFromActor of DataNode to reduce impact on 
> heartbeat
> ---
>
> Key: HDFS-16898
> URL: https://issues.apache.org/jira/browse/HDFS-16898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> Now in method processCommandFromActor,  we have code like below:
>  
> {code:java}
> writeLock();
> try {
>   if (actor == bpServiceToActive) {
> return processCommandFromActive(cmd, actor);
>   } else {
> return processCommandFromStandby(cmd, actor);
>   }
> } finally {
>   writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not 
> release.
>  
> It maybe block the updateActorStatesFromHeartbeat method in 
> offerService，furthermore, it can cause the lastcontact of datanode very high, 
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
> this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to 
> address this problem
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16898) Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat

2023-02-07 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16898.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove write lock for processCommandFromActor of DataNode to reduce impact on 
> heartbeat
> ---
>
> Key: HDFS-16898
> URL: https://issues.apache.org/jira/browse/HDFS-16898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Now in method processCommandFromActor,  we have code like below:
>  
> {code:java}
> writeLock();
> try {
>   if (actor == bpServiceToActive) {
> return processCommandFromActive(cmd, actor);
>   } else {
> return processCommandFromStandby(cmd, actor);
>   }
> } finally {
>   writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not 
> release.
>  
> It maybe block the updateActorStatesFromHeartbeat method in 
> offerService，furthermore, it can cause the lastcontact of datanode very high, 
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
> this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to 
> address this problem
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16939) Fix the thread safety bug in LowRedundancyBlocks

2023-03-06 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16939.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix the thread safety bug in LowRedundancyBlocks
> 
>
> Key: HDFS-16939
> URL: https://issues.apache.org/jira/browse/HDFS-16939
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The remove method in LowRedundancyBlocks is not protected by synchronized. 
> This method is private and is called by BlockManager. As a result, 
> priorityQueues has the risk of being accessed concurrently by multiple 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16939) Fix the thread safety bug in LowRedundancyBlocks

2023-03-09 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698373#comment-17698373
 ] 

Xiaoqiao He commented on HDFS-16939:


[~ste...@apache.org] Thanks for your reminder, [~zhangshuyan] is checking now. 
However I am not sure if we could be ready before next week. IMO this bugfix 
should not block release progress, because this is corner case and it does not 
impact core logic even if it is triggered. Anyway, I will sync here no mater if 
ready or not. Thanks again.

> Fix the thread safety bug in LowRedundancyBlocks
> 
>
> Key: HDFS-16939
> URL: https://issues.apache.org/jira/browse/HDFS-16939
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The remove method in LowRedundancyBlocks is not protected by synchronized. 
> This method is private and is called by BlockManager. As a result, 
> priorityQueues has the risk of being accessed concurrently by multiple 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16939) Fix the thread safety bug in LowRedundancyBlocks

2023-03-11 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16939:
---
Fix Version/s: 3.3.5

> Fix the thread safety bug in LowRedundancyBlocks
> 
>
> Key: HDFS-16939
> URL: https://issues.apache.org/jira/browse/HDFS-16939
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> The remove method in LowRedundancyBlocks is not protected by synchronized. 
> This method is private and is called by BlockManager. As a result, 
> priorityQueues has the risk of being accessed concurrently by multiple 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16939) Fix the thread safety bug in LowRedundancyBlocks

2023-03-11 Thread Xiaoqiao He (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699191#comment-17699191
 ] 

Xiaoqiao He commented on HDFS-16939:


[~ste...@apache.org] This PR has been committed to branch-3.3 and branch-3.3.5. 
FYI.

> Fix the thread safety bug in LowRedundancyBlocks
> 
>
> Key: HDFS-16939
> URL: https://issues.apache.org/jira/browse/HDFS-16939
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> The remove method in LowRedundancyBlocks is not protected by synchronized. 
> This method is private and is called by BlockManager. As a result, 
> priorityQueues has the risk of being accessed concurrently by multiple 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16948) Update log of BlockManager#chooseExcessRedundancyStriped when EC internal block is moved by balancer

2023-03-16 Thread Xiaoqiao He (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-16948:
---
Summary: Update log of BlockManager#chooseExcessRedundancyStriped when EC 
internal block is moved by balancer  (was: Do not print "excess types chosen 
... is empty" when EC internal block is moved by balancer)

> Update log of BlockManager#chooseExcessRedundancyStriped when EC internal 
> block is moved by balancer
> 
>
> Key: HDFS-16948
> URL: https://issues.apache.org/jira/browse/HDFS-16948
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
>
> This is a follow-up of HDFS-16179.When the EC internal block is moved by the 
> balancer, it will trigger chooseExcessRedundancyStriped( ), and the parameter 
> delNodeHint is not null.So the nonExcess list will be modified in 
> processChosenExcessRedundancy( ), normally the size of nonExcess will be 
> reduced to the expected EC group size, that's why the annoying log "excess 
> types chose ... is empty." will be printed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

< 4 5 6 7 8 9 10 11 >

801 - 900 of 1076 matches

Mail list logo