[jira] [Assigned] (HDFS-16428) Source path setted storagePolicy will cause wrong typeConsumed in rename operation
[ https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-16428: -- Assignee: lei w > Source path setted storagePolicy will cause wrong typeConsumed in rename > operation > --- > > Key: HDFS-16428 > URL: https://issues.apache.org/jira/browse/HDFS-16428 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Reporter: lei w >Assignee: lei w >Priority: Major > Labels: pull-request-available > Attachments: example.txt > > Time Spent: 1h > Remaining Estimate: 0h > > When compute quota in rename operation , we use storage policy of the target > directory to compute src quota usage. This will cause wrong value of > typeConsumed when source path was setted storage policy. I provided a unit > test to present this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-16083: --- Attachment: HDFS-16083.005.1.patch > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-16083: --- Attachment: (was: HDFS-16083.005.1.patch) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-16083: --- Status: Open (was: Patch Available) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-16083: --- Attachment: HDFS-16083.005.1.patch Status: Patch Available (was: Open) > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, HDFS-16083.005.1.patch, > HDFS-16083.005.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-16083: --- Attachment: HDFS-16083.004.patch Status: Patch Available (was: Open) Re-submit v04. > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, HDFS-16083.004.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371915#comment-17371915 ] Jinglun commented on HDFS-16083: Hi [~lei w], thanks your patch, some comments. In EditLogTailer.java: # I prefer using `shouldRollLog` instead of avoidTriggerActiveLogRoll. {code:java} if (shouldRollLog && tooLongSinceLastLoad() && lastRollTriggerTxId < lastLoadedTxnId) {{code} In TestStandbyRollEditsLogOnly.java: # The test case and setup method should not be static. # We need a License for the new file. In TestStandbyRollEditsLogOnly#testOnlyStandbyRollEditlog: # When you compare observerRollTimeMs1, could you use assertEquals instead of assertTrue. # The message of the assert should be more specific. Something like: "Standby should roll the log." and "The observer is not expected to roll the log." # I'd prefer using standbyInitialRollTime and standbyLastRollTime instead of using numbers standbyRollTimeMs1 and standbyRollTimeMs2. # The sleep time is too long, can we make it faster ? In TestStandbyRollEditsLogOnly#testTransObToStandbyThenRollLog: # It fails, could you give it a check. # The verify logic is very like testOnlyStandbyRollEditlog, can we extract the same part as a new method. # The idea of this test is good. We can transition the state and verify roll edit more times. May be do it 3 times ? There is also some checkstyle issue. Please follow jenkins suggestions. I'll re-submit v03 as v04 to trigger jenkins. > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch, > HDFS-16083.003.patch, activeRollEdits.png > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we forbid Observer NameNode trigger active > namenode log roll ? We 'dfs.ha.log-roll.period' configured is 300( 5 > minutes) and active NameNode receives rollEditLog RPC as shown in > activeRollEdits.png -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368012#comment-17368012 ] Jinglun commented on HDFS-16083: Hi [~lei w], thanks your report ! The description makes sense to me. I have a quick look of rollEdit, seems the redundant roll edit did exist. Could you add some logs of the active NameNode showing it actually rollEdit more frequently than configured in 'dfs.ha.log-roll.period'. Also we need a unit test in the patch to make is solid. > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16083) Forbid Observer NameNode trigger active namenode log roll
[ https://issues.apache.org/jira/browse/HDFS-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-16083: -- Assignee: lei w > Forbid Observer NameNode trigger active namenode log roll > -- > > Key: HDFS-16083 > URL: https://issues.apache.org/jira/browse/HDFS-16083 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Reporter: lei w >Assignee: lei w >Priority: Minor > Attachments: HDFS-16083.001.patch, HDFS-16083.002.patch > > > When the Observer NameNode is turned on in the cluster, the Active NameNode > will receive rollEditLog RPC requests from the Standby NameNode and Observer > NameNode in a short time. Observer NameNode's rollEditLog request is a > repetitive operation, so should we prohibit Observer NameNode from triggering > rollEditLog? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16038) DataNode Unrecognized Observer Node when cluster add an observer node
[ https://issues.apache.org/jira/browse/HDFS-16038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354202#comment-17354202 ] Jinglun commented on HDFS-16038: I mean you can update the package and the configuration at the same time. The DataNode with old package doesn't know the existence of the Observer. So there won't be the HAServiceState.observer issue. Would you like to share more details about your upgrade progress and why update the package and the configuration at the same time doesn't work for you ? > DataNode Unrecognized Observer Node when cluster add an observer node > - > > Key: HDFS-16038 > URL: https://issues.apache.org/jira/browse/HDFS-16038 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: lei w >Priority: Critical > > When an Observer node is added to the cluster, the DataNode will not be able > to recognize the HAServiceState.observer, This is because we did not upgrade > the DataNode. Generally, it will take a long time for a big cluster to > upgrade the DataNode . So should we add a switch to replace the Observer > state with the Standby state when DataNode can not recognize the > HAServiceState.observer state? > The following are some error messages of DataNode: > {code:java} > 11:14:31,812 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > IOException in offerService > com.google.protobuf.InvalidProtocolBufferException: Message missing required > fields: haStatus.state > at > com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81) > at > com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:71) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15973) RBF: Add permission check before doing router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353991#comment-17353991 ] Jinglun edited comment on HDFS-15973 at 5/30/21, 11:44 AM: --- I made a mistake when commit to trunk. I'm woking in a new environment and forget to update my git user.name and user.email. It left Chinese characters in the commit message which might make people confused. So I revert it and re-commit with the correct message. I sincerely apologize to anyone who is disturbed by the commit message. Very sorry. was (Author: lijinglun): I made a mistake when commit to trunk. I'm woking in a new environment and forget to update my git user.name and user.email. It left Chinese characters in the commit message which might make people confused. So I revert it and re-commit with the correct message. I sincerely apologize to everyone who is disturbed by the commit message. Very sorry. > RBF: Add permission check before doing router federation rename. > > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doing router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353991#comment-17353991 ] Jinglun commented on HDFS-15973: I made a mistake when commit to trunk. I'm woking in a new environment and forget to update my git user.name and user.email. It left Chinese characters in the commit message which might make people confused. So I revert it and re-commit with the correct message. I sincerely apologize to everyone who is disturbed by the commit message. Very sorry. > RBF: Add permission check before doing router federation rename. > > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doing router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353971#comment-17353971 ] Jinglun commented on HDFS-15973: Commit to trunk. Thanks [~elgoiri] for review ! > RBF: Add permission check before doing router federation rename. > > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doing router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Summary: RBF: Add permission check before doing router federation rename. (was: RBF: Add permission check before doting router federation rename.) > RBF: Add permission check before doing router federation rename. > > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doing router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Resolution: Fixed Status: Resolved (was: Patch Available) > RBF: Add permission check before doing router federation rename. > > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16038) DataNode Unrecognized Observer Node when cluster add an observer node
[ https://issues.apache.org/jira/browse/HDFS-16038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353945#comment-17353945 ] Jinglun commented on HDFS-16038: Hi [~lei w], thans for your report. I have one question. IMO, when the observer is added to the cluster, the Datanode won't automatically recognize it. The administrator needs to update both the configuration and the package of the DataNode so the it can recognize the address of the observer and the `HAServiceState.observer`. If we update both the configuration and the package we won't run into the situation that the DataNode doesn't recognize the HAServiceState.observer. > DataNode Unrecognized Observer Node when cluster add an observer node > - > > Key: HDFS-16038 > URL: https://issues.apache.org/jira/browse/HDFS-16038 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: lei w >Priority: Critical > > When an Observer node is added to the cluster, the DataNode will not be able > to recognize the HAServiceState.observer, This is because we did not upgrade > the DataNode. Generally, it will take a long time for a big cluster to > upgrade the DataNode . So should we add a switch to replace the Observer > state with the Standby state when DataNode can not recognize the > HAServiceState.observer state? > The following are some error messages of DataNode: > {code:java} > 11:14:31,812 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > IOException in offerService > com.google.protobuf.InvalidProtocolBufferException: Message missing required > fields: haStatus.state > at > com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81) > at > com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:71) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352232#comment-17352232 ] Jinglun commented on HDFS-15973: Wait one day for further comments. After that I'll commit this. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.010.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch, HDFS-15973.010.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350781#comment-17350781 ] Jinglun commented on HDFS-15973: Hi [~elgoiri], thanks your comments ! {quote}is just removing the sleep good enough? {quote} Yes I think so. The sleep wants to make sure the test directories are all created. The `cluster.createTestDirectoriesNamenode()` actually verifies whether the path exists after creating. So no need to wait. Fix white space and change rpc to capitals. Submit v10. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.009.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch, > HDFS-15973.009.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350410#comment-17350410 ] Jinglun commented on HDFS-15973: Hi [~elgoiri], thanks your comments ! Submit v09 follow your suggestions. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348939#comment-17348939 ] Jinglun commented on HDFS-15973: Hi [~elgoiri], could you help reviewing v08, thanks ! > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347640#comment-17347640 ] Jinglun commented on HDFS-15294: {quote}if source directory be writting all the time, is it means Federation balance will never exit? {quote} Hi [~zhengchenyu], nice comments. HDFS-15640 has introduced a new option: 'diffThreshold'. If the diff entries size is no greater than this threshold and the open files check is satisfied(no open files or force close all open files), the fedBalance will go to the final round of distcp. By specifying diff threshold we can make the federation balance job exit. Does it work for your situation ? I'll take a review of HDFS-15750. > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347640#comment-17347640 ] Jinglun edited comment on HDFS-15294 at 5/19/21, 12:40 PM: --- {quote}if source directory be writting all the time, is it means Federation balance will never exit? {quote} Hi [~zhengchenyu], nice comments. HDFS-15640 has introduced an option: 'diffThreshold'. If the diff entries size is no greater than this threshold and the open files check is satisfied(no open files or force close all open files), the fedBalance will go to the final round of distcp. By specifying diff threshold we can make the federation balance job exit. Does it work for your situation ? I'll take a review of HDFS-15750. was (Author: lijinglun): {quote}if source directory be writting all the time, is it means Federation balance will never exit? {quote} Hi [~zhengchenyu], nice comments. HDFS-15640 has introduced a new option: 'diffThreshold'. If the diff entries size is no greater than this threshold and the open files check is satisfied(no open files or force close all open files), the fedBalance will go to the final round of distcp. By specifying diff threshold we can make the federation balance job exit. Does it work for your situation ? I'll take a review of HDFS-15750. > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
[ https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-13671: -- Assignee: Haibin Huang > Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet > -- > > Key: HDFS-13671 > URL: https://issues.apache.org/jira/browse/HDFS-13671 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.3 >Reporter: Yiqun Lin >Assignee: Haibin Huang >Priority: Major > > NameNode hung when deleting large files/blocks. The stack info: > {code} > "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 > tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > {code} > In the current deletion logic in NameNode, there are mainly two steps: > * Collect INodes and all blocks to be deleted, then delete INodes. > * Remove blocks chunk by chunk in a loop. > Actually the first step should be a more expensive operation and will takes > more time. However, now we always see NN hangs during the remove block > operation. > Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a > better performance in dealing FBR/IBRs. But compared with early > implementation in remove-block logic, {{FoldedTreeSet}} seems more slower > since It will take additional time to balance tree node. When there are large > block to be removed/deleted, it looks bad. > For the get type operations in {{DatanodeStorageInfo}}, we only provide the > {{getBlockIterator}} to return blocks iterator and no other get operation > with specified block. Still we need to use {{FoldedTreeSet}} in > {{DatanodeStorageInfo}}? As we know {{FoldedTreeSet}} is benefit for Get not > Update. Maybe we can revert this to the early implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13671) Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
[ https://issues.apache.org/jira/browse/HDFS-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347285#comment-17347285 ] Jinglun commented on HDFS-13671: In Xiaomi we have seen the same slow deletion problem. [~huanghaibin] solved this by revert the FoldedTreeSet. Would you like to contribute your work here ? [~huanghaibin] > Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet > -- > > Key: HDFS-13671 > URL: https://issues.apache.org/jira/browse/HDFS-13671 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.3 >Reporter: Yiqun Lin >Priority: Major > > NameNode hung when deleting large files/blocks. The stack info: > {code} > "IPC Server handler 4 on 8020" #87 daemon prio=5 os_prio=0 > tid=0x7fb505b27800 nid=0x94c3 runnable [0x7fa861361000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.compare(FoldedTreeSet.java:474) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:849) > at > org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:252) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3813) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3617) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4270) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4244) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4180) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4164) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:871) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:311) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:625) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > {code} > In the current deletion logic in NameNode, there are mainly two steps: > * Collect INodes and all blocks to be deleted, then delete INodes. > * Remove blocks chunk by chunk in a loop. > Actually the first step should be a more expensive operation and will takes > more time. However, now we always see NN hangs during the remove block > operation. > Looking into this, we introduced a new structure {{FoldedTreeSet}} to have a > better performance in dealing FBR/IBRs. But compared with early > implementation in remove-block logic, {{FoldedTreeSet}} seems more slower > since It will take additional time to balance tree node. When there are large > block to be removed/deleted, it looks bad. > For the get type operations in {{DatanodeStorageInfo}}, we only provide the > {{getBlockIterator}} to return blocks iterator and no other get operation > with specified block. Still we need to use {{FoldedTreeSet}} in > {{DatanodeStorageInfo}}? As we know {{FoldedTreeSet}} is benefit for Get not > Update. Maybe we can revert this to the early implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346104#comment-17346104 ] Jinglun commented on HDFS-15973: Submit v08 fix checkstyle. The failed unit test runs well on my local environment so is not related. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.008.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch, HDFS-15973.008.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.007.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346003#comment-17346003 ] Jinglun commented on HDFS-15973: Hi [~elgoiri], thanks your comments ! The failed test is not related. I test it and it works fine. Complete javadocs of RouterFederationRename and update the description in HDFSRouterFederation.md. Uplaod v07. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch, HDFS-15973.007.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.006.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344359#comment-17344359 ] Jinglun commented on HDFS-15973: Hi [~elgoiri], thanks your nice comments ! Update the check of snapshot path and permission. Move the testPermissionCheck() to a new test class. Submit v06. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch, > HDFS-15973.006.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342263#comment-17342263 ] Jinglun commented on HDFS-15973: Hi [~zhengzhuobinzzb] [~elgoiri], do you have time to help reviewing v05 ? Thanks very much ! > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341339#comment-17341339 ] Jinglun commented on HDFS-15973: Since HDFS-15923 is resolved, submit v05 based on the authentication fix. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.005.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch, HDFS-15973.005.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16006) TestRouterFederationRename is flaky
[ https://issues.apache.org/jira/browse/HDFS-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341300#comment-17341300 ] Jinglun commented on HDFS-16006: Hi [~elgoiri] [~hexiaoqiao], HDFS-15923 fixed this issue. The timeout is changed from 10s to 20s and the case TestRouterFederationRename#testCounter is ok now. > TestRouterFederationRename is flaky > --- > > Key: HDFS-16006 > URL: https://issues.apache.org/jira/browse/HDFS-16006 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Akira Ajisaka >Priority: Major > Attachments: patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt > > > {quote} > [ERROR] Errors: > [ERROR] > TestRouterFederationRename.testCounter:440->Object.wait:502->Object.wait:-2 ? > TestTimedOut > [ERROR] TestRouterFederationRename.testSetup:145 ? Remote The directory > /src cannot be... > [ERROR] TestRouterFederationRename.testSetup:145 ? Remote The directory > /src cannot be... > {quote} > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2970/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15923: --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Commit to trunk. Thanks [~zhengzhuobinzzb]'s contribution and [~elgoiri]'s review ! > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Fix For: 3.4.0 > > Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, > HDFS-15923.003.patch, HDFS-15923.stack-trace, > hdfs-15923-fix-security-issue.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Router Login UGI doAs create DistcpProcedure and > TrashProcedure and submit Job. > > Beside, we should check user permission for src and dst path in router side > before do rename internal. (HDFS-15973) > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > or
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340649#comment-17340649 ] Jinglun commented on HDFS-15923: +1 on v03. Waiting one day for further comments. After that I'll commit this. > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, > HDFS-15923.003.patch, HDFS-15923.stack-trace, > hdfs-15923-fix-security-issue.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Router Login UGI doAs create DistcpProcedure and > TrashProcedure and submit Job. > > Beside, we should check user permission for src and dst path in router side > before do rename internal. (HDFS-15973) > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.ha
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332893#comment-17332893 ] Jinglun commented on HDFS-15923: LGTM. +1 on v002. > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, > HDFS-15923.stack-trace, hdfs-15923-fix-security-issue.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332465#comment-17332465 ] Jinglun commented on HDFS-15923: Hi [~zhengzhuobinzzb], thanks your explanation ! Only some minor comments. 1. I think we can just remove the code comment at RouterFederationRename.java#L114. 2. TestRouterFederationRenameInKerberosEnv.java#L129 the code comment could be removed too. Other than that the patch is good to me. Nice work ! > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Attachments: HDFS-15923.001.patch, HDFS-15923.stack-trace, > hdfs-15923-fix-security-issue.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hado
[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15923: --- Attachment: hdfs-15923-fix-security-issue.patch > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Attachments: HDFS-15923.stack-trace, > hdfs-15923-fix-security-issue.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.cre
[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15923: --- Attachment: HDFS-15923.stack-trace > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Attachments: HDFS-15923.stack-trace > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544) > at > o
[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331517#comment-17331517 ] Jinglun edited comment on HDFS-15923 at 4/25/21, 1:04 PM: -- Hi [~zhengzhuobinzzb], nice work ! Here are some comments. 1. The unit test couldn't pass on my local environment. I got a NPE when the MiniDFSCluster trys to verify the registered datanodes. The stack trace is uploaded. I think it is caused by the Datanode doesn't load the security configuration(The DataNode will call UserGroupInformation.setConfiguration and change the authenticationMethod to SIMPLE). I did a little change, see _hdfs-15923-fix-security-issue.patch._ I don't know why it worked well on your environment and yetus, do you know why ? 2. Does the TestRouterFederationRenameInKerberosEnv need to extend the ClientBaseWithFixes and why ? 3. I'd prefer reviewing your patch from jira. Could you change to Jira when submitting your next patch ? was (Author: lijinglun): Hi [~zhengzhuobinzzb], nice work ! Here are some comments. 1. The unit test couldn't pass on my local environment. I got a NPE when the MiniDFSCluster trys to verify the registered datanodes. The stack trace is uploaded. I think it is caused by the Datanode doesn't load the security configuration(The DataNode will call UserGroupInformation.setConfiguration and change the authenticationMethod to SIMPLE). I did a little change, see _fix-datanode-security-issue.patch._ I don't know why it worked well on your environment and yetus, do you know why ? 2. Does the TestRouterFederationRenameInKerberosEnv need to extend the ClientBaseWithFixes and why ? 3. I'd prefer reviewing your patch from jira. Could you change to Jira when submitting your next patch ? > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >
[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331517#comment-17331517 ] Jinglun edited comment on HDFS-15923 at 4/25/21, 1:03 PM: -- Hi [~zhengzhuobinzzb], nice work ! Here are some comments. 1. The unit test couldn't pass on my local environment. I got a NPE when the MiniDFSCluster trys to verify the registered datanodes. The stack trace is uploaded. I think it is caused by the Datanode doesn't load the security configuration(The DataNode will call UserGroupInformation.setConfiguration and change the authenticationMethod to SIMPLE). I did a little change, see _fix-datanode-security-issue.patch._ I don't know why it worked well on your environment and yetus, do you know why ? 2. Does the TestRouterFederationRenameInKerberosEnv need to extend the ClientBaseWithFixes and why ? 3. I'd prefer reviewing your patch from jira. Could you change to Jira when submitting your next patch ? was (Author: lijinglun): Hi [~zhengzhuobinzzb], nice work ! Here are some comments. 1. The unit test couldn't pass on my local environment. I got a NPE when the MiniDFSCluster trys to verify the registered datanodes. The stack trace is uploaded. I think it is caused by the Datanode doesn't load the security configuration. I did a little change, see _fix-datanode-security-issue.patch._ I don't know why it worked well on your environment and yetus, do you know why ? 2. Does the TestRouterFederationRenameInKerberosEnv need to extend the ClientBaseWithFixes and why ? 3. I'd prefer reviewing your patch from jira. Could you change to Jira when submitting your next patch ? > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at ja
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331517#comment-17331517 ] Jinglun commented on HDFS-15923: Hi [~zhengzhuobinzzb], nice work ! Here are some comments. 1. The unit test couldn't pass on my local environment. I got a NPE when the MiniDFSCluster trys to verify the registered datanodes. The stack trace is uploaded. I think it is caused by the Datanode doesn't load the security configuration. I did a little change, see _fix-datanode-security-issue.patch._ I don't know why it worked well on your environment and yetus, do you know why ? 2. Does the TestRouterFederationRenameInKerberosEnv need to extend the ClientBaseWithFixes and why ? 3. I'd prefer reviewing your patch from jira. Could you change to Jira when submitting your next patch ? > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) >
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331438#comment-17331438 ] Jinglun commented on HDFS-15923: OK, I am going to review this. Hi [~elgoiri] , [~ayushtkn] could you help adding zhoubin zheng as a contributor ? > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create
[jira] [Assigned] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-15923: -- Assignee: (was: Jinglun) > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:471) >
[jira] [Assigned] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-15923: -- Assignee: Jinglun > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: Jinglun >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 1h 10m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFi
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325026#comment-17325026 ] Jinglun commented on HDFS-15973: Hi [~zhengzhuobinzzb], thanks your comments ! The security mode is not considered in v03, thanks your explanation ! Submit v04. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.004.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: (was: HDFS-15973.004.patch) > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.004.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch, HDFS-15973.004.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.003.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322571#comment-17322571 ] Jinglun commented on HDFS-15973: Submit v03 fix checkstyle. The failed unit tests are not related. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch, > HDFS-15973.003.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322569#comment-17322569 ] Jinglun commented on HDFS-15973: Hi [~zhengzhuobinzzb], thanks your comments. {quote}I think access check also need credentials in kerberos environment {quote} I don't fully understand, could you describe it in more detail. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322214#comment-17322214 ] Jinglun commented on HDFS-15973: Submit v02 using FileSystem.access(). > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.002.patch > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch, HDFS-15973.002.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321889#comment-17321889 ] Jinglun commented on HDFS-15973: Hi [~zhengzhuobinzzb], thanks your comments. Using FileSystem.access() is better, I made a negligence of the rpc :P. I'll submit v02 using the access() and the second point can be handled too. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320887#comment-17320887 ] Jinglun edited comment on HDFS-15923 at 4/14/21, 3:42 PM: -- Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still need some test cases. {quote}In the current code logic, storing tasks in Journal does not use super users and Kerberos credentials. (Because when RPC executes Call, it uses the corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.) {quote} I'll start a new Jira(HDFS-15973) to resolve the permission check issue. was (Author: lijinglun): Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still need some test cases. {quote}In the current code logic, storing tasks in Journal does not use super users and Kerberos credentials. (Because when RPC executes Call, it uses the corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.) {quote} I'll start a new Jira to resolve the permission check issue. > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 50m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$P
[jira] [Commented] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321099#comment-17321099 ] Jinglun commented on HDFS-15973: Submit the initial patch. The patch introduces the RouterINode class to save the file status. First it collects the file status of the src and the dst and saves to the RouterINode array. Then it uses RouterPermissionChecker(very like the FsPermissionChecker) to do the permission check. > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15973: --- Attachment: HDFS-15973.001.patch Status: Patch Available (was: Open) > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15973.001.patch > > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15973) RBF: Add permission check before doting router federation rename.
[ https://issues.apache.org/jira/browse/HDFS-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-15973: -- Assignee: Jinglun > RBF: Add permission check before doting router federation rename. > - > > Key: HDFS-15973 > URL: https://issues.apache.org/jira/browse/HDFS-15973 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > The router federation rename is lack of permission check. It is a security > issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15973) RBF: Add permission check before doting router federation rename.
Jinglun created HDFS-15973: -- Summary: RBF: Add permission check before doting router federation rename. Key: HDFS-15973 URL: https://issues.apache.org/jira/browse/HDFS-15973 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jinglun The router federation rename is lack of permission check. It is a security issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15923: --- Parent: HDFS-15747 Issue Type: Sub-task (was: Bug) > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 50m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:544) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedF
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320887#comment-17320887 ] Jinglun commented on HDFS-15923: Hi [~zhengzhuobinzzb], you are right ! Please continue with your work, we still need some test cases. {quote}In the current code logic, storing tasks in Journal does not use super users and Kerberos credentials. (Because when RPC executes Call, it uses the corresponding Ugi's doAs, and the Ugi does not have a Kerberberos certificate.) {quote} I'll start a new Jira to resolve the permission check issue. > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 50m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedF
[jira] [Commented] (HDFS-15972) Fedbalance only copies data partially when there's existing opened file
[ https://issues.apache.org/jira/browse/HDFS-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320123#comment-17320123 ] Jinglun commented on HDFS-15972: Hi [~coconut_icecream], thanks for your report ! I'll try to reproduce then dig into it this week. > Fedbalance only copies data partially when there's existing opened file > --- > > Key: HDFS-15972 > URL: https://issues.apache.org/jira/browse/HDFS-15972 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Felix N >Priority: Major > > If there are opened files when fedbalance is run and data is being written to > these files, fedbalance might skip the newly written data. > Steps to recreate the issue: > # Create a dummy file /test/file with some data: {{echo "start" | hdfs dfs > -appendToFile /test/file}} > # Start writing to the file: {{hdfs dfs -appendToFile /test/file}} but do > not stop writing > # Run fedbalance: {{hadoop fedbalance submit hdfs://ns1/test > hdfs://ns2/test}} > # Write something to the file while fedbalance is running, "end" for > example, then stop writing > # After fedbalance is done, {{hdfs://ns2/test/file}} should only contain > "start" while {{hdfs://ns1/user/hadoop/.Trash/Current/test/file}} contains > "start\nend" > Fedbalance is run with default configs and arguments so no diff should happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320117#comment-17320117 ] Jinglun commented on HDFS-15923: Hi [~zhengzhuobinzzb], I'll take over this, hope you don't mind. The description of this Jira is not precise. After I finish the patch I'll start a new Jira to deal with the permission issue. > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 40m > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:530) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:8
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311555#comment-17311555 ] Jinglun commented on HDFS-15923: Hi [~zhengzhuobinzzb], thanks your question ! In the current design the journal and the distcp procedure are all done with the Router's kerberos credential (a super user). Both the journal path and the yarn queue are configured by the administrator. The super user's credential is also used for preserving all the permissions in distcp. So we shouldn't use the user's ugi. The user's ugi won't have write access of the journal path. The ugi doesn't have access of the super user's yarn queue too. But there is an issue about the user's ugi: "The Router doesn't do any permission check before doing the Router Federation Rename". We should check both the source and the dst with the user's ugi before submitting the Balance Job. Let me know your thoughts. If you also agree with the permission issue, are you interested in fixing it ? > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Time Spent: 0.5h > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Proxy UGI doAs create DistcpProcedure and TrashProcedure and > submit Job. > In patch i use proxy ugi doAs above method. It worked. > But there are another strange thing and this patch not solve: > Router use ugi itself to submit the Distcp job. But not user ugi or proxy > ugi. This may cause excessive distcp permissions. > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.h
[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303875#comment-17303875 ] Jinglun commented on HDFS-15899: Submit v02 fix checkstyle. The failed unit tests are not related. > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch > > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15899: --- Attachment: HDFS-15899.002.patch > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15899.001.patch, HDFS-15899.002.patch > > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302372#comment-17302372 ] Jinglun commented on HDFS-15899: Submit v01. Hi [~leosun08], could you help reviewing this, thanks ! > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15899.001.patch > > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15899: --- Attachment: HDFS-15899.001.patch Status: Patch Available (was: Open) > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15899.001.patch > > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
Jinglun created HDFS-15899: -- Summary: Remove rpcThreadPool from DeadNodeDetector. Key: HDFS-15899 URL: https://issues.apache.org/jira/browse/HDFS-15899 Project: Hadoop HDFS Issue Type: Improvement Reporter: Jinglun The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The purpose is to use the thread pool timeout to monitor the probe timeout. But the rpc client already has a timeout. We can use the rpc client timeout instead of the thread pool timeout and remove the rpcThreadPool. The rpcThreadPool introduces additional complexity for probing the DataNode. The probe task waiting in the busy rpcThreadPool might exceed the configured timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15899) Remove rpcThreadPool from DeadNodeDetector.
[ https://issues.apache.org/jira/browse/HDFS-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-15899: -- Assignee: Jinglun > Remove rpcThreadPool from DeadNodeDetector. > --- > > Key: HDFS-15899 > URL: https://issues.apache.org/jira/browse/HDFS-15899 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > The DeadNodeDetector uses a thread pool to do all the probe rpc calls. The > purpose is to use the thread pool timeout to monitor the probe timeout. But > the rpc client already has a timeout. We can use the rpc client timeout > instead of the thread pool timeout and remove the rpcThreadPool. > The rpcThreadPool introduces additional complexity for probing the DataNode. > The probe task waiting in the busy rpcThreadPool might exceed the configured > timeout. The probe task will be marked as failed even it is not scheduled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.007.patch > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch, > HDFS-15809.006.patch, HDFS-15809.007.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.006.patch > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch, > HDFS-15809.006.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296505#comment-17296505 ] Jinglun edited comment on HDFS-15809 at 3/6/21, 10:51 AM: -- After a discussion with [~leosun08], we decide to abandon the Collections.synchronizedSet(new LinkedHashSet<>()) plan because the interface is not friendly and makes the patch more complicated. Also it makes the probe queue hard to spy(set wrapped by synchronized would be final). Thanks [~leosun08]'s suggestions for design and unit tests. Submit v05 fixing unit test. was (Author: lijinglun): After a offline discussion with [~leosun08], we decide to abandon the Collections.synchronizedSet(new LinkedHashSet<>()) plan because the interface is not friendly and makes the patch more complicated. Also it makes the probe queue hard to spy(set wrapped by synchronized would be final). Thanks [~leosun08]'s suggestions for design and unit tests. Submit v05 fixing unit test. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296505#comment-17296505 ] Jinglun commented on HDFS-15809: After a offline discussion with [~leosun08], we decide to abandon the Collections.synchronizedSet(new LinkedHashSet<>()) plan because the interface is not friendly and makes the patch more complicated. Also it makes the probe queue hard to spy(set wrapped by synchronized would be final). Thanks [~leosun08]'s suggestions for design and unit tests. Submit v05 fixing unit test. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.005.patch > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch, HDFS-15809.005.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289599#comment-17289599 ] Jinglun commented on HDFS-15809: Hi [~leosun08], thanks your comments. Submit v04 using LinkedHashSet. The test case testDeadNodeDetectionDeadNodeProbe can cover the situation. It verifies the whole progress of the deadnodedetector. One node should be first put into suspect queue, then marked as dead and finally probed by the dead queue multi times. In the original implementation the 3 datanodes won't be all dead. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.004.patch > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch, HDFS-15809.004.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15845) RBF: Router fails to start due to NoClassDefFoundError for hadoop-federation-balance
[ https://issues.apache.org/jira/browse/HDFS-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17288800#comment-17288800 ] Jinglun commented on HDFS-15845: Hi [~tasanuma], thanks for your nice fix ! LGTM +1. > RBF: Router fails to start due to NoClassDefFoundError for > hadoop-federation-balance > > > Key: HDFS-15845 > URL: https://issues.apache.org/jira/browse/HDFS-15845 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > $ hdfs dfsrouter > ... > 2021-02-22 17:21:55,400 ERROR router.DFSRouter: Failed to start router > java.lang.NoClassDefFoundError: > org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure > at > org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.(RouterClientProtocol.java:195) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.(RouterRpcServer.java:394) > at > org.apache.hadoop.hdfs.server.federation.router.Router.createRpcServer(Router.java:391) > at > org.apache.hadoop.hdfs.server.federation.router.Router.serviceInit(Router.java:188) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) > at > org.apache.hadoop.hdfs.server.federation.router.DFSRouter.main(DFSRouter.java:69) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.tools.fedbalance.procedure.BalanceProcedure > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 6 more > 2021-02-22 17:21:55,402 INFO util.ExitUtil: Exiting with status 1: > java.lang.NoClassDefFoundError: > org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure > 2021-02-22 17:21:55,404 INFO router.DFSRouter: SHUTDOWN_MSG: > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15845) RBF: Router fails to start due to NoClassDefFoundError for hadoop-federation-balance
[ https://issues.apache.org/jira/browse/HDFS-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17288799#comment-17288799 ] Jinglun commented on HDFS-15845: My bad ! I missed the classpath of command dfsrouter. Hi [~tasanuma], would you please give a try of adding `hadoop_add_to_classpath_tools hadoop-federation-balance` to the command dfsrouter. Like below. {quote}File: hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs b/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs {quote} {quote}dfsrouter) HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true" HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.federation.router.DFSRouter' hadoop_add_to_classpath_tools hadoop-federation-balance // add this. ;;{quote} > RBF: Router fails to start due to NoClassDefFoundError for > hadoop-federation-balance > > > Key: HDFS-15845 > URL: https://issues.apache.org/jira/browse/HDFS-15845 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > $ hdfs dfsrouter > ... > 2021-02-22 17:21:55,400 ERROR router.DFSRouter: Failed to start router > java.lang.NoClassDefFoundError: > org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure > at > org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.(RouterClientProtocol.java:195) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.(RouterRpcServer.java:394) > at > org.apache.hadoop.hdfs.server.federation.router.Router.createRpcServer(Router.java:391) > at > org.apache.hadoop.hdfs.server.federation.router.Router.serviceInit(Router.java:188) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) > at > org.apache.hadoop.hdfs.server.federation.router.DFSRouter.main(DFSRouter.java:69) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.tools.fedbalance.procedure.BalanceProcedure > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 6 more > 2021-02-22 17:21:55,402 INFO util.ExitUtil: Exiting with status 1: > java.lang.NoClassDefFoundError: > org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure > 2021-02-22 17:21:55,404 INFO router.DFSRouter: SHUTDOWN_MSG: > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (HDFS-15845) RBF: Router fails to start due to NoClassDefFoundError for hadoop-federation-balance
[ https://issues.apache.org/jira/browse/HDFS-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15845: --- Comment: was deleted (was: My bad ! I missed the classpath of command dfsrouter. Hi [~tasanuma], would you please give a try of adding `hadoop_add_to_classpath_tools hadoop-federation-balance` to the command dfsrouter. Like below. {quote}File: hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs b/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/hdfs {quote} {quote}dfsrouter) HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true" HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.federation.router.DFSRouter' hadoop_add_to_classpath_tools hadoop-federation-balance // add this. ;;{quote} ) > RBF: Router fails to start due to NoClassDefFoundError for > hadoop-federation-balance > > > Key: HDFS-15845 > URL: https://issues.apache.org/jira/browse/HDFS-15845 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {noformat} > $ hdfs dfsrouter > ... > 2021-02-22 17:21:55,400 ERROR router.DFSRouter: Failed to start router > java.lang.NoClassDefFoundError: > org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure > at > org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.(RouterClientProtocol.java:195) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.(RouterRpcServer.java:394) > at > org.apache.hadoop.hdfs.server.federation.router.Router.createRpcServer(Router.java:391) > at > org.apache.hadoop.hdfs.server.federation.router.Router.serviceInit(Router.java:188) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) > at > org.apache.hadoop.hdfs.server.federation.router.DFSRouter.main(DFSRouter.java:69) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.tools.fedbalance.procedure.BalanceProcedure > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 6 more > 2021-02-22 17:21:55,402 INFO util.ExitUtil: Exiting with status 1: > java.lang.NoClassDefFoundError: > org/apache/hadoop/tools/fedbalance/procedure/BalanceProcedure > 2021-02-22 17:21:55,404 INFO router.DFSRouter: SHUTDOWN_MSG: > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287477#comment-17287477 ] Jinglun commented on HDFS-15809: Submit v03 fix checkstyle and unit tests. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.003.patch > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch, > HDFS-15809.003.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287014#comment-17287014 ] Jinglun commented on HDFS-15809: I haven't deal with the checkstyle complain and it is out of date now(cry). Re-upload v02 to trigger the jenkins. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.002.patch > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch, HDFS-15809.002.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.
[ https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286871#comment-17286871 ] Jinglun commented on HDFS-15806: Hi [~ayushtkn], thanks your comments ! {quote}before this was there some kind of memory leak, or these threads were getting cleared later? {quote} In Xiaomi we use the dead node detector feature only for hbase. The HBase doesn't close the files system and the dfs client. So we haven't notice the leak before. Recently we found the dead node detector won't remove alive nodes from the dead node set, as described in HDFS-15809. So I started reviewing the whole feature and found this leak bug. {quote}Secondly, for the shutdown is there some specific order, or it is just random {quote} It is random. Most of the threads are connected by queue(the producer-consumer model). So the order of stopping the producer or the consumer won't be a problem. 1) The DeadNodeDetector thread is responsible for add nodes from _suspectAndDeadNodes_ set to _deadNodesProbeQueue_. 2) The _probeDeadNodesSchedulerThr_ is responsible for taking nodes from _deadNodesProbeQueue_ and __ submit probe tasks to _probeDeadNodesThreadPool_. 3) The _probeSuspectNodesSchedulerThr_ is responsible for taking nodes from _suspectNodesProbeQueue_ and submit probe tasks to _probeSuspectNodesThreadPool_. 4) All the probe tasks submit getDatanodeInfo rpc calls in the thread pool _rpcThreadPool_. Some other thoughts: the thread model is a little complicated and could be improved. For example I think we can do the rpc call at the probe task instead of submitting to rpcThreadPool. I need first figure out the purpose of the original design then may be start a new Jira for the thread improvement later. > DeadNodeDetector should close all the threads when it is closed. > > > Key: HDFS-15806 > URL: https://issues.apache.org/jira/browse/HDFS-15806 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15806.001.patch > > > The DeadNodeDetector doesn't close all the threads when it is closed. This > Jira trys to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286832#comment-17286832 ] Jinglun edited comment on HDFS-15809 at 2/19/21, 3:23 AM: -- Hi [~leosun08], thanks you comments. The solution in v01 introduces a new deduplicated queue. It won't accept duplicated nodes being added. The size of the queue is not fixed too so all the dead nodes could be added to the deduplicated queue. Thus the situation of duplicated dead nodes being repeatedly added to the probe queue won't happen anymore. The queue itself is deduplicated so we don't need to worry the queue size explosion. The size is no greater than the size of datanodes. Shuffle is a good idea and is a much simpler way. But I think the deduplicated way is more efficiency because there is no duplicated probe. Adjust the queue size won't fix the problem because the queue accept duplicated nodes. Even the queue size is 10 it could still be filled up with the first 30 nodes. was (Author: lijinglun): Hi [~leosun08], thanks you comments. The solution in v01 is to avoid adding duplicated dead nodes to the probe queue. So the queue won't be filled up with duplicated dead nodes. Shuffle is a good idea and is a much simpler way. I also agree with the shuffle way. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286832#comment-17286832 ] Jinglun commented on HDFS-15809: Hi [~leosun08], thanks you comments. The solution in v01 is to avoid adding duplicated dead nodes to the probe queue. So the queue won't be filled up with duplicated dead nodes. Shuffle is a good idea and is a much simpler way. I also agree with the shuffle way. > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15809: --- Attachment: HDFS-15809.001.patch Status: Patch Available (was: Open) > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276326#comment-17276326 ] Jinglun commented on HDFS-15809: Hi [~leosun08], could you help reviewing this, thanks ! > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15809.001.patch > > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
Jinglun created HDFS-15809: -- Summary: DeadNodeDetector doesn't remove live nodes from dead node set. Key: HDFS-15809 URL: https://issues.apache.org/jira/browse/HDFS-15809 Project: Hadoop HDFS Issue Type: Bug Reporter: Jinglun We found the dead node detector might never remove the alive nodes from the dead node set in a big cluster. For example: # 200 nodes are added to the dead node set by DeadNodeDetector. # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the deadNodesProbeQueue because the queue limited length is 100. # The probe threads start working and probe 30 nodes. # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the same as the last time. So the 30 nodes that has already been probed are added to the queue again. # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.
[ https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-15809: -- Assignee: Jinglun > DeadNodeDetector doesn't remove live nodes from dead node set. > -- > > Key: HDFS-15809 > URL: https://issues.apache.org/jira/browse/HDFS-15809 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > We found the dead node detector might never remove the alive nodes from the > dead node set in a big cluster. For example: > # 200 nodes are added to the dead node set by DeadNodeDetector. > # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the > deadNodesProbeQueue because the queue limited length is 100. > # The probe threads start working and probe 30 nodes. > # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead > node set and adds 30 nodes to the deadNodesProbeQueue. But the order is the > same as the last time. So the 30 nodes that has already been probed are added > to the queue again. > # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If > they are all dead then the live nodes behind them could never be recovered. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.
[ https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15806: --- Attachment: HDFS-15806.001.patch Status: Patch Available (was: Open) > DeadNodeDetector should close all the threads when it is closed. > > > Key: HDFS-15806 > URL: https://issues.apache.org/jira/browse/HDFS-15806 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15806.001.patch > > > The DeadNodeDetector doesn't close all the threads when it is closed. This > Jira trys to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.
[ https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276095#comment-17276095 ] Jinglun commented on HDFS-15806: Hi [~leosun08], would you help reviewing this, thanks ! > DeadNodeDetector should close all the threads when it is closed. > > > Key: HDFS-15806 > URL: https://issues.apache.org/jira/browse/HDFS-15806 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15806.001.patch > > > The DeadNodeDetector doesn't close all the threads when it is closed. This > Jira trys to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.
[ https://issues.apache.org/jira/browse/HDFS-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun reassigned HDFS-15806: -- Assignee: Jinglun > DeadNodeDetector should close all the threads when it is closed. > > > Key: HDFS-15806 > URL: https://issues.apache.org/jira/browse/HDFS-15806 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > > The DeadNodeDetector doesn't close all the threads when it is closed. This > Jira trys to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15806) DeadNodeDetector should close all the threads when it is closed.
Jinglun created HDFS-15806: -- Summary: DeadNodeDetector should close all the threads when it is closed. Key: HDFS-15806 URL: https://issues.apache.org/jira/browse/HDFS-15806 Project: Hadoop HDFS Issue Type: Bug Reporter: Jinglun The DeadNodeDetector doesn't close all the threads when it is closed. This Jira trys to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15661) The DeadNodeDetector shouldn't be shared by different DFSClients.
[ https://issues.apache.org/jira/browse/HDFS-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271212#comment-17271212 ] Jinglun commented on HDFS-15661: Hi [~leosun08], thanks your comments ! Fix checkstyle and submit v05. The failed unit tests run well on my local computer so should not be related. > The DeadNodeDetector shouldn't be shared by different DFSClients. > - > > Key: HDFS-15661 > URL: https://issues.apache.org/jira/browse/HDFS-15661 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15661.001.patch, HDFS-15661.002.patch, > HDFS-15661.003.patch, HDFS-15661.004.patch, HDFS-15661.005.patch > > > Currently the DeadNodeDetector is a member of ClientContext. That means it is > shared by many different DFSClients. When one DFSClient.close() is invoked, > the DeadNodeDetecotor thread would be interrupted and impact other DFSClients. > From the original design of HDFS-13571 we could see the DeadNodeDetector is > supposed to share dead nodes of many input streams from the same client. > We should move the DeadNodeDetector as a member of DFSClient instead of > ClientContext. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org