[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783072#comment-17783072 ] Viraj Jasani commented on HDFS-16016: - I think so too. Do you think you might be able to share logs in the meantime? I just want to get some more clarity on the sequence and correlate with namenode processing the report. If you might not be able to share the logs, that's fine too. Thanks > BPServiceActor add a new thread to handle IBR > - > > Key: HDFS-16016 > URL: https://issues.apache.org/jira/browse/HDFS-16016 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.6 > > Attachments: image-2023-11-03-18-11-54-502.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. > We can handle IBR independently to improve the performance of heartbeat and > FBR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782674#comment-17782674 ] Viraj Jasani commented on HDFS-16016: - If possible, could you please also share DN and NN logs for the affected block and block reports? > BPServiceActor add a new thread to handle IBR > - > > Key: HDFS-16016 > URL: https://issues.apache.org/jira/browse/HDFS-16016 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.6 > > Attachments: image-2023-11-03-18-11-54-502.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. > We can handle IBR independently to improve the performance of heartbeat and > FBR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782670#comment-17782670 ] Viraj Jasani commented on HDFS-16016: - Interesting, thanks for reporting this [~yuanbo], let me try reproducing this on some heavy test env. btw you might also be interested in HDFS-17121 and HDFS-17129. We are also using this patch in prod for quite some time now. > BPServiceActor add a new thread to handle IBR > - > > Key: HDFS-16016 > URL: https://issues.apache.org/jira/browse/HDFS-16016 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.6 > > Attachments: image-2023-11-03-18-11-54-502.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. > We can handle IBR independently to improve the performance of heartbeat and > FBR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16938) Utility to trigger heartbeat and wait until BP thread queue is fully processed
[ https://issues.apache.org/jira/browse/HDFS-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani reassigned HDFS-16938: --- Assignee: (was: Viraj Jasani) > Utility to trigger heartbeat and wait until BP thread queue is fully processed > -- > > Key: HDFS-16938 > URL: https://issues.apache.org/jira/browse/HDFS-16938 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Priority: Major > Labels: pull-request-available > > As a follow-up to HDFS-16935, we should provide utility to trigger heartbeat > and wait until BP thread queue is fully processed. This would ensure 100% > consistency w.r.t active namenode being able to receive bad block reports > from the given datanode. This utility would resolve flakes for the tests that > rely on namenode's awareness of the reported bad blocks by datanodes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17129) mis-order of ibr and fbr on datanode
[ https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747712#comment-17747712 ] Viraj Jasani commented on HDFS-17129: - Thanks for filing this [~liuguanghua], as discussed on the PR, are we planning to use lock to prevent mis-order? > mis-order of ibr and fbr on datanode > - > > Key: HDFS-17129 > URL: https://issues.apache.org/jira/browse/HDFS-17129 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0 > Environment: hdfs3.4.0 >Reporter: liuguanghua >Priority: Major > > HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. > But it maybe casue the mis-order of ibr and fbr -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17041) RBF: Fix putAll impl for mysql and file based state stores
Viraj Jasani created HDFS-17041: --- Summary: RBF: Fix putAll impl for mysql and file based state stores Key: HDFS-17041 URL: https://issues.apache.org/jira/browse/HDFS-17041 Project: Hadoop HDFS Issue Type: Bug Reporter: Viraj Jasani Assignee: Viraj Jasani Only zookeeper based state store allows all records to be inserted even though only few of them already exists and "errorIfExists" is true, however file/fs as well as mysql based putAll fails the whole putAll operation immediately after encountering single record that already exists in the records and "errorIfExists" is provided true (which is the case while inserting records for the first time). For all implementations, we should allow inserts of the records that do not already exist and report any record as failure that already exists, rather than failing the whole operation and not trying to insert valid records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724137#comment-17724137 ] Viraj Jasani commented on HDFS-17017: - {quote}btw. lets fix something when it is broken, or give suggestions on how the fix could be better rather than telling ok, this is a minor thing, it won't happen, or it was the tests fault old people didn't cover, anybody can miss anything in the code, lets not make the one who found a valid bug even small feel any less when it is very valid use case, not dragging it further, saw this as a repetitive occurrence, just a friendly 2 cents rest upto you {quote} Ayush, I am not sure why you would think that I am making anyone feel bad for this, I really appreciate this fix and I have suggested test change to make the fix even better with solid test. It has never been my intention to say that old ppl didn't cover something, I would never say that. Hadoop is massive codebase, things can be missed. For this fix, i agree that it was a miss from HDFS-16521. I am not denying it. I hope I have not offended you and if you felt that way, I apologize, that was never my purpose, i was just providing my viewpoint that while testing the changes, i only used it with "-live" because my usecase was "live but not slow" but you are right that usecase could be different too. > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121 ] Viraj Jasani edited comment on HDFS-17017 at 5/19/23 5:55 AM: -- anyways, in order to prevent "any new argument for any of the existing commands" from running into similar "exhausting max arguments" case, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code still remains 0? if we have such test, then whenever someone introduces a new argument in future, the test will automatically pass the argument to the command and the test would fail, forcing dev to handle the "max argument" case. [~haiyang Hu] i have attached the patch on the PR to make the test more robust, and cover the missing case of identifying whether we have exceeded max arguments and need to adjust max arguments allowed for -report command. Thank you. was (Author: vjasani): anyways, in order to prevent "any new argument for any of the existing commands" from running into similar "exhausting max arguments" case, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code still remains 0? if we have such test, then whenever someone introduces a new argument in future, the test will automatically pass the argument to the command and the test would fail, forcing dev to handle the "max argument" case. > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121 ] Viraj Jasani edited comment on HDFS-17017 at 5/19/23 5:25 AM: -- anyways, in order to prevent "any new argument for any of the existing commands" from running into similar "exhausting max arguments" case, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code still remains 0? if we have such test, then whenever someone introduces a new argument in future, the test will automatically pass the argument to the command and the test would fail, forcing dev to handle the "max argument" case. was (Author: vjasani): anyways, in order to prevent any new argument for any of the existing commands to get into similar case of exhausting max arguments, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code still remains 0? if we have such test, then whenever someone introduces a new argument in future, the test will automatically pass the argument to the command and the test would likely fail, forcing dev to handle the "max argument" case. > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121 ] Viraj Jasani edited comment on HDFS-17017 at 5/19/23 5:24 AM: -- anyways, in order to prevent any new argument for any of the existing commands to get into similar case of exhausting max arguments, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code still remains 0? if we have such test, then whenever someone introduces a new argument in future, the test will automatically pass the argument to the command and the test would likely fail, forcing dev to handle the "max argument" case. was (Author: vjasani): anyways, in order to prevent any new argument for -report to get into similar case, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code remains 0? > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121 ] Viraj Jasani commented on HDFS-17017: - anyways, in order to prevent any new argument for -report to get into similar case, what can we do? can we write a test that can parse all possible arguments for the given command ("-report" in this case) and pass them all and ensure that the output return code/exit code remains 0? > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724119#comment-17724119 ] Viraj Jasani commented on HDFS-17017: - for the command output, it's not an intersection as such, but from slow node usecase perspecitve, we want to know how many nodes are live and how many are slow among them so the usecase of getting slow nodes is usually coupled with live nodes (intention is for user/client to keep only "live - slow" nodes for dfs ops i.e. live nodes that are not slow), but i agree that from the general usability viewpoint, user should be able to print all or any combination of categories. > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.
[ https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724101#comment-17724101 ] Viraj Jasani commented on HDFS-17017: - Functionally this change makes sense because all of them are arguments, however in practice "-slownodes" is mostly only meant to be used with "-live". We anyways don't have "slow" and "dead or decommissioned or inmaintenance" nodes. Thanks for the attempting to fix the max argument [~haiyang Hu]! > Fix the issue of arguments number limit in report command in DFSAdmin. > -- > > Key: HDFS-17017 > URL: https://issues.apache.org/jira/browse/HDFS-17017 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, the DFSAdmin report command should support a maximum number of > arguments of 7, such as : > hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] > [-enteringmaintenance] [-inmaintenance] [-slownodes] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17020) RBF: mount table addAll should print failed records in std error
Viraj Jasani created HDFS-17020: --- Summary: RBF: mount table addAll should print failed records in std error Key: HDFS-17020 URL: https://issues.apache.org/jira/browse/HDFS-17020 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani Now that state store putAll supports returning failed records keys, addAll command for mount entries should also support printing failed records in the standard error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17009) RBF: state store putAll should also return failed records
[ https://issues.apache.org/jira/browse/HDFS-17009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723566#comment-17723566 ] Viraj Jasani commented on HDFS-17009: - Looks like the Jira to Github link didn't work so let me link the PR manually. > RBF: state store putAll should also return failed records > - > > Key: HDFS-17009 > URL: https://issues.apache.org/jira/browse/HDFS-17009 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Fix For: 3.4.0 > > > State store implementations allow adding/updating multiple records using > putAll. The implementation returns whether all records were successfully > added or updated. We should also allow the implementation to return which > records failed to get updated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17009) RBF: state store putAll should also return failed records
Viraj Jasani created HDFS-17009: --- Summary: RBF: state store putAll should also return failed records Key: HDFS-17009 URL: https://issues.apache.org/jira/browse/HDFS-17009 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani State store implementations allow adding/updating multiple records using putAll. The implementation returns whether all records were successfully added or updated. We should also allow the implementation to return which records failed to get updated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17008) Fix rbf jdk 11 javadoc warnings
Viraj Jasani created HDFS-17008: --- Summary: Fix rbf jdk 11 javadoc warnings Key: HDFS-17008 URL: https://issues.apache.org/jira/browse/HDFS-17008 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani HDFS-16978 excluded proto packages from maven-javadoc-plugin for rbf, hence now we have JDK 11 javadoc warnings (e.g. [here|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5554/14/artifact/out/results-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.txt]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16978) RBF: Admin command to support bulk add of mount points
[ https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721628#comment-17721628 ] Viraj Jasani commented on HDFS-16978: - will create follow-up jiras soon > RBF: Admin command to support bulk add of mount points > -- > > Key: HDFS-16978 > URL: https://issues.apache.org/jira/browse/HDFS-16978 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > All state store implementations support adding multiple state store records > using single putAll() implementation. We should provide new router admin API > to support bulk addition of mount table entries that can utilize this build > add implementation at state store level. > For more than one mount point to be added, the goal of bulk addition should be > # To reduce frequent router calls > # To avoid frequent state store cache refreshers with each single mount > point addition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16978) RBF: Admin command to support bulk add of mount points
[ https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721626#comment-17721626 ] Viraj Jasani commented on HDFS-16978: - Thanks again [~ayushtkn] [~elgoiri] [~simbadzina] !!! > RBF: Admin command to support bulk add of mount points > -- > > Key: HDFS-16978 > URL: https://issues.apache.org/jira/browse/HDFS-16978 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > All state store implementations support adding multiple state store records > using single putAll() implementation. We should provide new router admin API > to support bulk addition of mount table entries that can utilize this build > add implementation at state store level. > For more than one mount point to be added, the goal of bulk addition should be > # To reduce frequent router calls > # To avoid frequent state store cache refreshers with each single mount > point addition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11063) Set NameNode RPC server handler thread name with more descriptive information about the RPC call.
[ https://issues.apache.org/jira/browse/HDFS-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720248#comment-17720248 ] Viraj Jasani commented on HDFS-11063: - Thanks [~cnauroth], i was going through similar observation on thread dumps and with some search, was able to find this old Jira, glad to see some discussion is already present. Do you think it is still worth pursuing this today? Maybe we can make this an opt-in behavior just in case any user would be in favor of disabling it to avoid redundant info in logs? At least this would be quite helpful for debugging thread dumps. > Set NameNode RPC server handler thread name with more descriptive information > about the RPC call. > - > > Key: HDFS-11063 > URL: https://issues.apache.org/jira/browse/HDFS-11063 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chris Nauroth >Priority: Major > > We often run {{jstack}} on a NameNode process as a troubleshooting step if it > is suffering high load or appears to be hanging. By reading the stack trace, > we can identify if a caller is blocked inside an expensive operation. This > would be even more helpful if we updated the RPC server handler thread name > with more descriptive information about the RPC call. This could include the > calling user, the called RPC method, and the most significant argument to > that method (most likely the path). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16998) RBF: Add ops metrics for getSlowDatanodeReport in RouterClientActivity
Viraj Jasani created HDFS-16998: --- Summary: RBF: Add ops metrics for getSlowDatanodeReport in RouterClientActivity Key: HDFS-16998 URL: https://issues.apache.org/jira/browse/HDFS-16998 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points
[ https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711710#comment-17711710 ] Viraj Jasani commented on HDFS-16978: - Reg the usecase, I had to migrate mount points from zk to hdfs (due to zk being hotspot for multiple usecases and adding router is just additional load) and during that time, I realized that we have no way to bulk add all mount points in one shot, hence I thought of adding this improvement. {quote}anyway, should be adjusted in an existing commands like router -add ; and like -update , and so on {quote} That still adds each point separately right? We are still not adding/updating mount points in one shot. It's all about using putAll() at state store level impl. If you are not fine with me pursuing this, please let me know and I will not create PR. > RBF: New Router admin command to support bulk add of mount points > - > > Key: HDFS-16978 > URL: https://issues.apache.org/jira/browse/HDFS-16978 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > > All state store implementations support adding multiple state store records > using single putAll() implementation. We should provide new router admin API > to support bulk addition of mount table entries that can utilize this build > add implementation at state store level. > For more than one mount point to be added, the goal of bulk addition should be > # To reduce frequent router calls > # To avoid frequent state store cache refreshers with each single mount > point addition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points
[ https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711692#comment-17711692 ] Viraj Jasani edited comment on HDFS-16978 at 4/13/23 6:05 AM: -- {quote}Mount table operations are admin operations not user operations. {quote} I understand but having an admin endpoint for adding multiple mount entries as part of "single router admin command" rather than "multiple -add commands" is only optimization for reducing multiple router calls as well as reducing state store cache refreshes. We already have putAll() that all state store implements so why not use it by router admin? The goal of this Jira is meant to be an optimization for admin operation. [~ayushtkn] [~goiri] [~elgoiri] was (Author: vjasani): {quote}Mount table operations are admin operations not user operations. {quote} I understand but having an admin endpoint for adding multiple mount entries as part of "single router admin command" rather than "multiple -add commands" is only optimization for reducing multiple router calls as well as reducing state store cache refreshes. We already have putAll() that all state store implements so why not use it by router admin? The goal of this Jira is meant to be an optimization for admin operation. [~ayushtkn] [~goiri] [~inigoiri] > RBF: New Router admin command to support bulk add of mount points > - > > Key: HDFS-16978 > URL: https://issues.apache.org/jira/browse/HDFS-16978 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > All state store implementations support adding multiple state store records > using single putAll() implementation. We should provide new router admin API > to support bulk addition of mount table entries that can utilize this build > add implementation at state store level. > For more than one mount point to be added, the goal of bulk addition should be > # To reduce frequent router calls > # To avoid frequent state store cache refreshers with each single mount > point addition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points
[ https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711692#comment-17711692 ] Viraj Jasani commented on HDFS-16978: - {quote}Mount table operations are admin operations not user operations. {quote} I understand but having an admin endpoint for adding multiple mount entries as part of "single router admin command" rather than "multiple -add commands" is only optimization for reducing multiple router calls as well as reducing state store cache refreshes. We already have putAll() that all state store implements so why not use it by router admin? The goal of this Jira is meant to be an optimization for admin operation. [~ayushtkn] [~goiri] [~inigoiri] > RBF: New Router admin command to support bulk add of mount points > - > > Key: HDFS-16978 > URL: https://issues.apache.org/jira/browse/HDFS-16978 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > All state store implementations support adding multiple state store records > using single putAll() implementation. We should provide new router admin API > to support bulk addition of mount table entries that can utilize this build > add implementation at state store level. > For more than one mount point to be added, the goal of bulk addition should be > # To reduce frequent router calls > # To avoid frequent state store cache refreshers with each single mount > point addition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points
Viraj Jasani created HDFS-16978: --- Summary: RBF: New Router admin command to support bulk add of mount points Key: HDFS-16978 URL: https://issues.apache.org/jira/browse/HDFS-16978 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani All state store implementations support adding multiple state store records using single putAll() implementation. We should provide new router admin API to support bulk addition of mount table entries that can utilize this build add implementation at state store level. For more than one mount point to be added, the goal of bulk addition should be # To reduce frequent router calls # To avoid frequent state store cache refreshers with each single mount point addition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16973) RBF: MountTableResolver cache size lookup should take read lock
Viraj Jasani created HDFS-16973: --- Summary: RBF: MountTableResolver cache size lookup should take read lock Key: HDFS-16973 URL: https://issues.apache.org/jira/browse/HDFS-16973 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani Mount table resolver location cache gets invalidated by taking write lock as part of addEntry/removeEntry/refreshEntries calls. Since the write lock exclusively updates the cache, getDestinationForPath already takes read lock before accessing the cache. Similarly, retrieval of the cache size should also take the read lock. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16969) Restart DataNode but keep showing ClosedChannelException in DataNode
[ https://issues.apache.org/jira/browse/HDFS-16969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707459#comment-17707459 ] Viraj Jasani commented on HDFS-16969: - Could you try porting HDFS-16535 or if possible, deploy latest release 3.3.5? > Restart DataNode but keep showing ClosedChannelException in DataNode > - > > Key: HDFS-16969 > URL: https://issues.apache.org/jira/browse/HDFS-16969 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.2 >Reporter: Huibo Peng >Priority: Major > > We use Hadoop 3.3.2 + HBase 2.3.6 in production environment. When restarting > DataNode to enable some configs, the ClosedChannelException keep showing in > DataNode log. > {code:java} > 2023-03-09 12:00:42,456 WARN [ShortCircuitCache_SlotReleaser] > shortcircuit.DfsClientShmManager: > EndpointShmManager(DatanodeInfoWithStorage[10.22.128.111:9866,DS-d0865093-7868-4d6b-8163-252f2dd4a40c,DISK], > parent=ShortCircuitShmManager(250ff108)): error shutting down shm: got > IOException calling shutdown(SHUT_RDWR) > java.nio.channels.ClosedChannelException > at > org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57) > at > org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:393) > at > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.shutdown(DfsClientShmManager.java:362) > at > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:241) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16967) RBF: File based state stores should allow concurrent access to the records
Viraj Jasani created HDFS-16967: --- Summary: RBF: File based state stores should allow concurrent access to the records Key: HDFS-16967 URL: https://issues.apache.org/jira/browse/HDFS-16967 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani File based state store implementations (StateStoreFileImpl and StateStoreFileSystemImpl) should allow updating as well as reading of the state store records concurrently rather than serially. Concurrent access to the record files on the hdfs based store seems to be improving the state store cache loading performance by more than 10x. For instance, in order to maintain data integrity, when any mount table record(s) is updated, the cache is reloaded. This reload operation seems to be able to gain significant performance improvement by the concurrent access of the mount table records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16959) RBF: State store cache loading metrics
[ https://issues.apache.org/jira/browse/HDFS-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16959: Summary: RBF: State store cache loading metrics (was: RBF: state store cache loading metrics) > RBF: State store cache loading metrics > -- > > Key: HDFS-16959 > URL: https://issues.apache.org/jira/browse/HDFS-16959 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With increasing num of state store records (like mount points), it would be > good to be able to get the cache loading metrics like avg time for cache load > during refresh, num of times cache is loaded etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16959) RBF: state store cache loading metrics
Viraj Jasani created HDFS-16959: --- Summary: RBF: state store cache loading metrics Key: HDFS-16959 URL: https://issues.apache.org/jira/browse/HDFS-16959 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani With increasing num of state store records (like mount points), it would be good to be able to get the cache loading metrics like avg time for cache load during refresh, num of times cache is loaded etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt
[ https://issues.apache.org/jira/browse/HDFS-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701416#comment-17701416 ] Viraj Jasani commented on HDFS-16957: - Thanks [~elgoiri], created PR [https://github.com/apache/hadoop/pull/5487] > RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful > attempt > -- > > Key: HDFS-16957 > URL: https://issues.apache.org/jira/browse/HDFS-16957 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > DFS router admin returns non-zero status code for unsuccessful attempt to add > or update mount point. However, same is not the case with removal of mount > point. > For instance, > {code:java} > bin/hdfs dfsrouteradmin -add /data4 ns1 /data4 > .. > .. > Cannot add destination at ns1 /data4 > echo $? > 255 {code} > {code:java} > /hadoop/bin/hdfs dfsrouteradmin -rm /data4 > .. > .. > Cannot remove mount point /data4 > echo $? > 0{code} > Removal of mount point should stay consistent with other options and return > non-zero (unsuccessful) status code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt
[ https://issues.apache.org/jira/browse/HDFS-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701380#comment-17701380 ] Viraj Jasani edited comment on HDFS-16957 at 3/16/23 8:57 PM: -- [~ayushtkn] [~goiri] [~elgoiri] [~hexiaoqiao] could you please let me know if you agree with this? was (Author: vjasani): [~ayushtkn] [~goiri] [~hexiaoqiao] could you please let me know if you agree with this? > RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful > attempt > -- > > Key: HDFS-16957 > URL: https://issues.apache.org/jira/browse/HDFS-16957 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > DFS router admin returns non-zero status code for unsuccessful attempt to add > or update mount point. However, same is not the case with removal of mount > point. > For instance, > {code:java} > bin/hdfs dfsrouteradmin -add /data4 ns1 /data4 > .. > .. > Cannot add destination at ns1 /data4 > echo $? > 255 {code} > {code:java} > /hadoop/bin/hdfs dfsrouteradmin -rm /data4 > .. > .. > Cannot remove mount point /data4 > echo $? > 0{code} > Removal of mount point should stay consistent with other options and return > non-zero (unsuccessful) status code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt
[ https://issues.apache.org/jira/browse/HDFS-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701380#comment-17701380 ] Viraj Jasani commented on HDFS-16957: - [~ayushtkn] [~goiri] [~hexiaoqiao] could you please let me know if you agree with this? > RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful > attempt > -- > > Key: HDFS-16957 > URL: https://issues.apache.org/jira/browse/HDFS-16957 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > DFS router admin returns non-zero status code for unsuccessful attempt to add > or update mount point. However, same is not the case with removal of mount > point. > For instance, > {code:java} > bin/hdfs dfsrouteradmin -add /data4 ns1 /data4 > .. > .. > Cannot add destination at ns1 /data4 > echo $? > 255 {code} > {code:java} > /hadoop/bin/hdfs dfsrouteradmin -rm /data4 > .. > .. > Cannot remove mount point /data4 > echo $? > 0{code} > Removal of mount point should stay consistent with other options and return > non-zero (unsuccessful) status code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt
Viraj Jasani created HDFS-16957: --- Summary: RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt Key: HDFS-16957 URL: https://issues.apache.org/jira/browse/HDFS-16957 Project: Hadoop HDFS Issue Type: Bug Reporter: Viraj Jasani Assignee: Viraj Jasani DFS router admin returns non-zero status code for unsuccessful attempt to add or update mount point. However, same is not the case with removal of mount point. For instance, {code:java} bin/hdfs dfsrouteradmin -add /data4 ns1 /data4 .. .. Cannot add destination at ns1 /data4 echo $? 255 {code} {code:java} /hadoop/bin/hdfs dfsrouteradmin -rm /data4 .. .. Cannot remove mount point /data4 echo $? 0{code} Removal of mount point should stay consistent with other options and return non-zero (unsuccessful) status code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16953) RBF Mount table store APIs should update cache only if state store record is successfully updated
Viraj Jasani created HDFS-16953: --- Summary: RBF Mount table store APIs should update cache only if state store record is successfully updated Key: HDFS-16953 URL: https://issues.apache.org/jira/browse/HDFS-16953 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani RBF Mount table state store APIs addMountTableEntry, updateMountTableEntry and removeMountTableEntry performs cache refresh for all routers regardless of the actual record update result. If the record fails to get updated on zookeeper/file based store impl, reloading the cache for all routers would be unnecessary. For instance, simultaneously adding new mount point could lead to failure for the second call if first call has not added new entry by the time second call retrieves mount table entry from getMountTableEntries before attempting to call addMountTableEntry. {code:java} DEBUG [{cluster}/{ip}:8111] ipc.Client - IPC Client (1826699684) connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user}IPC Client (1826699684) connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user} sending #1 org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocol.addMountTableEntry DEBUG [{cluster}/{ip}:8111 from {user}] ipc.Client - IPC Client (1826699684) connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user} got value #1 DEBUG [main] ipc.ProtobufRpcEngine2 - Call: addMountTableEntry took 24ms DEBUG [{cluster}/{ip}:8111 from {user}] ipc.Client - IPC Client (1826699684) connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user}: closed DEBUG [{cluster}/{ip}:8111 from {user}] ipc.Client - IPC Client (1826699684) connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user}: stopped, remaining connections 0 TRACE [main] ipc.ProtobufRpcEngine2 - 1: Response <- nn-0-{ns}.{cluster}/{ip}:8111: addMountTableEntry {status: false} Cannot add mount point /data503 {code} The failure to write new record: {code:java} INFO [IPC Server handler 0 on default port 8111] impl.StateStoreZooKeeperImpl - Cannot write record "/hdfs-federation/MountTable/0SLASH0data503", it already exists {code} Since the successful call has already refreshed cache for all routers, second call that failed should not have refreshed cache for all routers again as everyone already has updated records in cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16947) RBF NamenodeHeartbeatService to report error for not being able to register namenode in state store
Viraj Jasani created HDFS-16947: --- Summary: RBF NamenodeHeartbeatService to report error for not being able to register namenode in state store Key: HDFS-16947 URL: https://issues.apache.org/jira/browse/HDFS-16947 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani Namenode heartbeat service should provide error with full stacktrace if it cannot register namenode in the state store. As of today, we only log info msg. For zookeeper based impl, this might mean either a) curator manager is not initialized or b) if it fails to write to znode after exhausting retries. For either of these cases, reporting only INFO log might not be good enough and we might have to look for errors elsewhere. Sample example: {code:java} 2023-02-20 23:10:33,714 DEBUG [NamenodeHeartbeatService {ns} nn0-0] router.NamenodeHeartbeatService - Received service state: ACTIVE from HA namenode: {ns}-nn0:nn-0-{ns}.{cluster}:9000 2023-02-20 23:10:33,731 INFO [NamenodeHeartbeatService {ns} nn0-0] impl.MembershipStoreImpl - Inserting new NN registration: nn-0.namenode.{cluster}:->{ns}:nn0:nn-0-{ns}.{cluster}:9000-ACTIVE 2023-02-20 23:10:33,731 INFO [NamenodeHeartbeatService {ns} nn0-0] router.NamenodeHeartbeatService - Cannot register namenode in the State Store {code} If we could log full stacktrace: {code:java} 2023-02-21 00:20:24,691 ERROR [NamenodeHeartbeatService {ns} nn0-0] router.NamenodeHeartbeatService - Cannot register namenode in the State Store org.apache.hadoop.hdfs.server.federation.store.StateStoreUnavailableException: State Store driver StateStoreZooKeeperImpl in nn-0.namenode.{cluster} is not ready. at org.apache.hadoop.hdfs.server.federation.store.driver.StateStoreDriver.verifyDriverReady(StateStoreDriver.java:158) at org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreZooKeeperImpl.putAll(StateStoreZooKeeperImpl.java:235) at org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreBaseImpl.put(StateStoreBaseImpl.java:74) at org.apache.hadoop.hdfs.server.federation.store.impl.MembershipStoreImpl.namenodeHeartbeat(MembershipStoreImpl.java:179) at org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:381) at org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:317) at org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.lambda$periodicInvoke$0(NamenodeHeartbeatService.java:244) ... ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16941) Path.suffix raises NullPointerException
[ https://issues.apache.org/jira/browse/HDFS-16941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696510#comment-17696510 ] Viraj Jasani commented on HDFS-16941: - Oh I meant moving the Jira itself to Hadoop (options: From "More", select "Move") :) But perhaps what you did is also fine. Thanks > Path.suffix raises NullPointerException > --- > > Key: HDFS-16941 > URL: https://issues.apache.org/jira/browse/HDFS-16941 > Project: Hadoop HDFS > Issue Type: Bug > Components: hadoop-client, hdfs >Affects Versions: 3.3.2 >Reporter: Patrick Grandjean >Priority: Minor > > Calling the Path.suffix method on root raises a NullPointerException. Tested > with hadoop-client-api 3.3.2 > Scenario: > {code:java} > import org.apache.hadoop.fs.* > Path root = new Path("/") > root.getParent == null // true > root.suffix("bar") // NPE is raised > {code} > Stack: > {code:none} > 23/03/03 15:13:18 ERROR Uncaught throwable from user code: > java.lang.NullPointerException > at org.apache.hadoop.fs.Path.(Path.java:104) > at org.apache.hadoop.fs.Path.(Path.java:93) > at org.apache.hadoop.fs.Path.suffix(Path.java:361) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16941) Path.suffix raises NullPointerException
[ https://issues.apache.org/jira/browse/HDFS-16941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696479#comment-17696479 ] Viraj Jasani commented on HDFS-16941: - We can also move this Jira to HADOOP as Path is used by all FileSystem implementations. > Path.suffix raises NullPointerException > --- > > Key: HDFS-16941 > URL: https://issues.apache.org/jira/browse/HDFS-16941 > Project: Hadoop HDFS > Issue Type: Bug > Components: hadoop-client, hdfs >Affects Versions: 3.3.2 >Reporter: Patrick Grandjean >Priority: Minor > > Calling the Path.suffix method on root raises a NullPointerException. Tested > with hadoop-client-api 3.3.2 > Scenario: > {code:java} > import org.apache.hadoop.fs.* > Path root = new Path("/") > root.getParent == null // true > root.suffix("bar") // NPE is raised > {code} > Stack: > {code:none} > 23/03/03 15:13:18 ERROR Uncaught throwable from user code: > java.lang.NullPointerException > at org.apache.hadoop.fs.Path.(Path.java:104) > at org.apache.hadoop.fs.Path.(Path.java:93) > at org.apache.hadoop.fs.Path.suffix(Path.java:361) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16938) Utility to trigger heartbeat and wait until BP thread queue is fully processed
Viraj Jasani created HDFS-16938: --- Summary: Utility to trigger heartbeat and wait until BP thread queue is fully processed Key: HDFS-16938 URL: https://issues.apache.org/jira/browse/HDFS-16938 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani As a follow-up to HDFS-16935, we should provide utility to trigger heartbeat and wait until BP thread queue is fully processed. This would ensure 100% consistency w.r.t active namenode being able to receive bad block reports from the given datanode. This utility would resolve flakes for the tests that rely on namenode's awareness of the reported bad blocks by datanodes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16935) TestFsDatasetImpl.testReportBadBlocks brittle
[ https://issues.apache.org/jira/browse/HDFS-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani reassigned HDFS-16935: --- Assignee: Viraj Jasani > TestFsDatasetImpl.testReportBadBlocks brittle > - > > Key: HDFS-16935 > URL: https://issues.apache.org/jira/browse/HDFS-16935 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.4.0, 3.3.5, 3.3.9 >Reporter: Steve Loughran >Assignee: Viraj Jasani >Priority: Minor > > jenkins failure as sleep() time not long enough > {code} > Failing for the past 1 build (Since #4 ) > Took 7.4 sec. > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > {code} > assert is after a 3s sleep waiting for reports coming in. > {code} > dataNode.reportBadBlocks(block, dataNode.getFSDataset() > .getFsVolumeReferences().get(0)); > Thread.sleep(3000); // 3s > sleep > BlockManagerTestUtil.updateState(cluster.getNamesystem() > .getBlockManager()); > // Verify the bad block has been reported to namenode > Assert.assertEquals(1, > cluster.getNamesystem().getCorruptReplicaBlocks()); // here > {code} > LambdaTestUtils.eventually() should be used around this assert, maybe with an > even shorter initial delay so on faster systems, test is faster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16935) TestFsDatasetImpl.testReportBadBlocks brittle
[ https://issues.apache.org/jira/browse/HDFS-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693371#comment-17693371 ] Viraj Jasani commented on HDFS-16935: - If we run this test in debug mode, it can reproduced locally too. > TestFsDatasetImpl.testReportBadBlocks brittle > - > > Key: HDFS-16935 > URL: https://issues.apache.org/jira/browse/HDFS-16935 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.4.0, 3.3.5, 3.3.9 >Reporter: Steve Loughran >Priority: Minor > > jenkins failure as sleep() time not long enough > {code} > Failing for the past 1 build (Since #4 ) > Took 7.4 sec. > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > {code} > assert is after a 3s sleep waiting for reports coming in. > {code} > dataNode.reportBadBlocks(block, dataNode.getFSDataset() > .getFsVolumeReferences().get(0)); > Thread.sleep(3000); // 3s > sleep > BlockManagerTestUtil.updateState(cluster.getNamesystem() > .getBlockManager()); > // Verify the bad block has been reported to namenode > Assert.assertEquals(1, > cluster.getNamesystem().getCorruptReplicaBlocks()); // here > {code} > LambdaTestUtils.eventually() should be used around this assert, maybe with an > even shorter initial delay so on faster systems, test is faster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16925) Namenode audit log to only include IP address of client
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16925: Description: With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to InetAddress#getHostAddress, to save host name with IPC Connection object. When we perform InetAddress#getHostName, toString() of InetAddress would automatically print \{hostName}/\{hostIPAddress} if hostname is already resolved: {code:java} /** * Converts this IP address to a {@code String}. The * string returned is of the form: hostname / literal IP * address. * * If the host name is unresolved, no reverse name service lookup * is performed. The hostname part will be represented by an empty string. * * @return a string representation of this IP address. */ public String toString() { String hostName = holder().getHostName(); return ((hostName != null) ? hostName : "") + "/" + getHostAddress(); }{code} For namenode audit logs, this means that when dfs client makes filesystem updates, the audit logs would also print host name in the audit logs in addition to ip address. In order to maintain the compatibility, the purpose of this Jira is to only let audit log retrieve IP address from InetAddress and print it. was: With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to InetAddress#getHostAddress, to save host name with IPC Connection object. When we perform InetAddress#getHostName, toString() of InetAddress would automatically print \{hostName}/\{hostIPAddress} if hostname is already resolved: {code:java} /** * Converts this IP address to a {@code String}. The * string returned is of the form: hostname / literal IP * address. * * If the host name is unresolved, no reverse name service lookup * is performed. The hostname part will be represented by an empty string. * * @return a string representation of this IP address. */ public String toString() { String hostName = holder().getHostName(); return ((hostName != null) ? hostName : "") + "/" + getHostAddress(); }{code} For namenode audit logs, this means that when dfs client makes filesystem updates, the audit logs would also print host name in the audit logs in addition to ip address. We have some tests that performs regex pattern matching to identify the log pattern of audit logs, we will have to change them to reflect the change in host address. > Namenode audit log to only include IP address of client > --- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. > In order to maintain the compatibility, the purpose of this Jira is to only > let audit log retrieve IP address from InetAddress and print it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16925) Namenode audit log to only include IP address of client
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16925: Summary: Namenode audit log to only include IP address of client (was: Fix regex pattern for namenode audit log tests) > Namenode audit log to only include IP address of client > --- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani resolved HDFS-16918. - Resolution: Won't Fix > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16918: Description: While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. was: While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. In a distributed system, it is essential to have robust fail-fast mechanisms in place to prevent issues related to network partitioning. The system must be designed to prevent further degradation of availability and consistency in the event of a network partition. Several distributed systems offer fail-safe approaches, and for some, partition tolerance is critical to the extent that even a few seconds of heartbeat loss can trigger the removal of an application server instance from the cluster. For instance, a majority of zooKeeper clients utilize the ephemeral nodes for this purpose to make system reliable, fault-tolerant and strongly consistent in the event of network partition. >From the hdfs architecture viewpoint, it is crucial to understand the critical >role that active and observer namenode play in file system operations. In a >large-scale cluster, if the datanodes holding the same block (primary and >replicas) lose connection to both active and observer namenodes for a >significant amount of time, delaying the process of shutting down such >datanodes and restarting it to re-establish the connection with the namenodes >(assuming the active namenode is alive, assumption is important in the even of >network partition to reestablish the connection) will further deteriorate the >availability of the service. This scenario underscores the importance of >resolving network partitioning. This is a real use case for hdfs and it is not prudent to assume that every deployment or cluster management application must be able to restart datanodes based on JM
[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16918: Description: While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. In a distributed system, it is essential to have robust fail-fast mechanisms in place to prevent issues related to network partitioning. The system must be designed to prevent further degradation of availability and consistency in the event of a network partition. Several distributed systems offer fail-safe approaches, and for some, partition tolerance is critical to the extent that even a few seconds of heartbeat loss can trigger the removal of an application server instance from the cluster. For instance, a majority of zooKeeper clients utilize the ephemeral nodes for this purpose to make system reliable, fault-tolerant and strongly consistent in the event of network partition. >From the hdfs architecture viewpoint, it is crucial to understand the critical >role that active and observer namenode play in file system operations. In a >large-scale cluster, if the datanodes holding the same block (primary and >replicas) lose connection to both active and observer namenodes for a >significant amount of time, delaying the process of shutting down such >datanodes and restarting it to re-establish the connection with the namenodes >(assuming the active namenode is alive, assumption is important in the even of >network partition to reestablish the connection) will further deteriorate the >availability of the service. This scenario underscores the importance of >resolving network partitioning. This is a real use case for hdfs and it is not prudent to assume that every deployment or cluster management application must be able to restart datanodes based on JMX metrics, as this would introduce another application to resolve the network partition impact of hdfs. Besides, popular cluster management applications are not typically used in all cloud-native env. Even if these cluster management applications are deployed, certain security constraints may restrict their access to JMX metrics and prevent them from interfering with hdfs operations. The applications that can only trigger alerts for users based on set parameters (for instance, missing blocks > 0) are allowed to access JMX metrics. was: While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namen
[jira] [Created] (HDFS-16925) Fix regex pattern for namenode audit log tests
Viraj Jasani created HDFS-16925: --- Summary: Fix regex pattern for namenode audit log tests Key: HDFS-16925 URL: https://issues.apache.org/jira/browse/HDFS-16925 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to InetAddress#getHostAddress, to save host name with IPC Connection object. When we perform InetAddress#getHostName, toString() of InetAddress would automatically print \{hostName}/\{hostIPAddress} if hostname is already resolved: {code:java} /** * Converts this IP address to a {@code String}. The * string returned is of the form: hostname / literal IP * address. * * If the host name is unresolved, no reverse name service lookup * is performed. The hostname part will be represented by an empty string. * * @return a string representation of this IP address. */ public String toString() { String hostName = holder().getHostName(); return ((hostName != null) ? hostName : "") + "/" + getHostAddress(); }{code} For namenode audit logs, this means that when dfs client makes filesystem updates, the audit logs would also print host name in the audit logs in addition to ip address. We have some tests that performs regex pattern matching to identify the log pattern of audit logs, we will have to change them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
Viraj Jasani created HDFS-16918: --- Summary: Optionally shut down datanode if it does not stay connected to active namenode Key: HDFS-16918 URL: https://issues.apache.org/jira/browse/HDFS-16918 Project: Hadoop HDFS Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16907) BP service actor LastHeartbeat is not sufficient to track realtime connection breaks
[ https://issues.apache.org/jira/browse/HDFS-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16907: Description: BP service actor LastHeartbeat is not sufficient to track realtime connection breaks. Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode that it is connected to. However, this is updated even if the connection to the namenode is broken. Suppose, the actor thread keeps heartbeating to namenode and suddenly the socket connection is broken. When this happens, until specific time duration, the actor thread consistently keeps updating _lastHeartbeatTime_ before even initiating heartbeat connection with namenode. If connection cannot be established even after RPC retries are exhausted, then IOException is thrown. This means that heartbeat response has not been received from the namenode. In the loop, the actor thread keeps trying connecting for heartbeat and the last heartbeat stays close to 1/2s even though in reality there is no response being received from namenode. Sample Exception from the BP service actor thread, during which LastHeartbeat stays very low: {code:java} 2023-02-03 22:34:55,725 WARN [xyz:9000] datanode.DataNode - IOException in offerService java.io.EOFException: End of File Exception between local host is: "dn-0"; destination host is: "nn-1":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) at org.apache.hadoop.ipc.Client.call(Client.java:1495) at org.apache.hadoop.ipc.Client.call(Client.java:1392) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129) at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code} Attaching screenshots of how last heartbeat value looks when the above error is consistently getting logged. Last heartbeat response time is important to initiate any auto-recovery from datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets updated if the BP service actor thread was successfully able to retrieve response from namenode. was: Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode that it is connected to. However, this is updated even if the connection to the namenode is broken. Suppose, the actor thread keeps heartbeating to namenode and suddenly the socket connection is broken. When this happens, until specific time duration, the actor thread consistently keeps updating _lastHeartbeatTime_ before even initiating heartbeat connection with namenode. If connection cannot be established even after RPC retries are exhausted, then IOException is thrown. This means that heartbeat response has not been received from the namenode. In the loop, the actor thread keeps trying connecting for heartbeat and the last heartbeat stays close to 1/2s even though in reality there is no response being received from namenode. Sample Exception from the BP service actor thread, during which LastHeartbeat stays very low: {code:java} 2023-02-03 22:34:55,725 WARN [xyz:9000] datanode.DataNode - IOException in offerService java.io.EOFException: End of File Exception between local host is: "dn-0"; destination host is: "nn-1":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect
[jira] [Updated] (HDFS-16907) Add LastHeartbeatResponseTime for BP service actor
[ https://issues.apache.org/jira/browse/HDFS-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16907: Summary: Add LastHeartbeatResponseTime for BP service actor (was: BP service actor LastHeartbeat is not sufficient to track realtime connection breaks) > Add LastHeartbeatResponseTime for BP service actor > -- > > Key: HDFS-16907 > URL: https://issues.apache.org/jira/browse/HDFS-16907 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2023-02-03 at 6.12.24 PM.png > > > BP service actor LastHeartbeat is not sufficient to track realtime connection > breaks. > Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode > that it is connected to. However, this is updated even if the connection to > the namenode is broken. > Suppose, the actor thread keeps heartbeating to namenode and suddenly the > socket connection is broken. When this happens, until specific time duration, > the actor thread consistently keeps updating _lastHeartbeatTime_ before even > initiating heartbeat connection with namenode. If connection cannot be > established even after RPC retries are exhausted, then IOException is thrown. > This means that heartbeat response has not been received from the namenode. > In the loop, the actor thread keeps trying connecting for heartbeat and the > last heartbeat stays close to 1/2s even though in reality there is no > response being received from namenode. > > Sample Exception from the BP service actor thread, during which LastHeartbeat > stays very low: > {code:java} > 2023-02-03 22:34:55,725 WARN [xyz:9000] datanode.DataNode - IOException in > offerService > java.io.EOFException: End of File Exception between local host is: "dn-0"; > destination host is: "nn-1":9000; : java.io.EOFException; For more details > see: http://wiki.apache.org/hadoop/EOFException > at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) > at org.apache.hadoop.ipc.Client.call(Client.java:1495) > at org.apache.hadoop.ipc.Client.call(Client.java:1392) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129) > at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code} > Attaching screenshots of how last heartbeat value looks when the above error > is consistently getting logged. > > Last heartbeat response time is important to initiate any auto-recovery from > datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets > updated if the BP service actor thread was successfully able to retrieve > response from namenode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16907) BP service actor LastHeartbeat is not sufficient to track realtime connection breaks
Viraj Jasani created HDFS-16907: --- Summary: BP service actor LastHeartbeat is not sufficient to track realtime connection breaks Key: HDFS-16907 URL: https://issues.apache.org/jira/browse/HDFS-16907 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani Attachments: Screenshot 2023-02-03 at 6.12.24 PM.png Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode that it is connected to. However, this is updated even if the connection to the namenode is broken. Suppose, the actor thread keeps heartbeating to namenode and suddenly the socket connection is broken. When this happens, until specific time duration, the actor thread consistently keeps updating _lastHeartbeatTime_ before even initiating heartbeat connection with namenode. If connection cannot be established even after RPC retries are exhausted, then IOException is thrown. This means that heartbeat response has not been received from the namenode. In the loop, the actor thread keeps trying connecting for heartbeat and the last heartbeat stays close to 1/2s even though in reality there is no response being received from namenode. Sample Exception from the BP service actor thread, during which LastHeartbeat stays very low: {code:java} 2023-02-03 22:34:55,725 WARN [xyz:9000] datanode.DataNode - IOException in offerService java.io.EOFException: End of File Exception between local host is: "dn-0"; destination host is: "nn-1":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) at org.apache.hadoop.ipc.Client.call(Client.java:1495) at org.apache.hadoop.ipc.Client.call(Client.java:1392) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129) at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code} Attaching screenshots of how last heartbeat value looks when the above error is consistently getting logged. Last heartbeat response time is important to initiate any auto-recovery from datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets updated if the BP service actor thread was successfully able to retrieve response from namenode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice
[ https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16902: Description: Recently came across an k8s environment where randomly some datanode pods are not able to stay connected to all namenode pods (e.g. last heartbeat time stays higher than 2 hr sometimes). When any standby namenode becomes active, any datanode that is not heartbeating to it for quite sometime would not be able to send any further block reports, leading to missing replicas immediately after namenode failover, which could only be resolved with datanode pod restart. While the issue seems env specific, BPServiceActor's offer service could use some logging improvements. It is also good to get namenode status exposed with BPServiceActorInfo to identify any lags from datanode side in recognizing updated Active namenode status with heartbeats. was: Recently came across an k8s environment where randomly some datanode pods are not able to stay connected to all namenode pods (e.g. last heartbeat time stays higher than 2 hr sometimes). When new namenode becomes active, any datanode that is not heartbeating to it would not be able to send any further block reports, leading to missing replicas sometimes, which would be resolved only with datanode pod restart. While the issue seems env specific, BPServiceActor's offer service could use some logging improvements. It is also good to get namenode status exposed with BPServiceActorInfo to identify any lags from datanode side in recognizing updated Active namenode status with heartbeats. > Add Namenode status to BPServiceActor metrics and improve logging in > offerservice > - > > Key: HDFS-16902 > URL: https://issues.apache.org/jira/browse/HDFS-16902 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > Recently came across an k8s environment where randomly some datanode pods are > not able to stay connected to all namenode pods (e.g. last heartbeat time > stays higher than 2 hr sometimes). When any standby namenode becomes active, > any datanode that is not heartbeating to it for quite sometime would not be > able to send any further block reports, leading to missing replicas > immediately after namenode failover, which could only be resolved with > datanode pod restart. > While the issue seems env specific, BPServiceActor's offer service could use > some logging improvements. It is also good to get namenode status exposed > with BPServiceActorInfo to identify any lags from datanode side in > recognizing updated Active namenode status with heartbeats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice
Viraj Jasani created HDFS-16902: --- Summary: Add Namenode status to BPServiceActor metrics and improve logging in offerservice Key: HDFS-16902 URL: https://issues.apache.org/jira/browse/HDFS-16902 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani Recently came across an k8s environment where randomly some datanode pods are not able to stay connected to all namenode pods (e.g. last heartbeat time stays higher than 2 hr sometimes). When new namenode becomes active, any datanode that is not heartbeating to it would not be able to send any further block reports, leading to missing replicas sometimes, which would be resolved only with datanode pod restart. While the issue seems env specific, BPServiceActor's offer service could use some logging improvements. It is also good to get namenode status exposed with BPServiceActorInfo to identify any lags from datanode side in recognizing updated Active namenode status with heartbeats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel
Viraj Jasani created HDFS-16891: --- Summary: Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel Key: HDFS-16891 URL: https://issues.apache.org/jira/browse/HDFS-16891 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.3.4 Reporter: Viraj Jasani Assignee: Viraj Jasani If we enable parallel loading and persisting of inodes from/to fs image, we get the benefit of improved performance. However, while loading sub-sections INODE_DIR_SUB and INODE_SUB, if we encounter any errors, we use copy-on-write list to maintain the list of exceptions. Since our usecase is not to iterate over this list while executor threads are adding new elements to the list, using copy-on-write is bit of an overhead for this usecase. It would be better to synchronize adding new elements to the list rather than having the list copy all elements over every time new element is added to the list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16887) Log start and end of phase/step in startup progress
Viraj Jasani created HDFS-16887: --- Summary: Log start and end of phase/step in startup progress Key: HDFS-16887 URL: https://issues.apache.org/jira/browse/HDFS-16887 Project: Hadoop HDFS Issue Type: Improvement Reporter: Viraj Jasani Assignee: Viraj Jasani As part of Namenode startup progress, we have multiple phases and steps within phase that are instantiated. While the startup progress view can be instantiated with the current view of phase/step, having at least DEBUG logs for startup progress would be helpful to identify when a particular step for LOADING_FSIMAGE/SAVING_CHECKPOINT/LOADING_EDITS was started and ended. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19
[ https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652123#comment-17652123 ] Viraj Jasani commented on HDFS-16652: - {quote}I can cherry-pick other branches too. Please let me know which all branches. only for branch-3.3.? {quote} Thank you [~brahmareddy]! IMHO back-porting to branch-3.3 would be great (we anyways had to keep this patch on 3.3 branch already due to the severity of the vulnerability reported). > Upgrade jquery datatable version references to v1.10.19 > --- > > Key: HDFS-16652 > URL: https://issues.apache.org/jira/browse/HDFS-16652 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-16652.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Upgrade jquery datatable version references in hdfs webapp to v1.10.19 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627290#comment-17627290 ] Viraj Jasani commented on HDFS-16829: - {quote}I think if there is one node only, we can give it to try that we have syncBlock always true in such cases {quote} If only one node, having syncBlock true can have much of latency impact? > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating > 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] > impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 > blk_3317546151_2244173147 file > /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 > {noformat} > > {noformat} > - Logs from nodeC - > 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeB:56486 > 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - > PacketResponder: BP-9588
[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19
[ https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17602007#comment-17602007 ] Viraj Jasani commented on HDFS-16652: - [~groot] I am talking about YARN-8854. I have commented on YARN-8854 as well to get clarification on title vs commit diff. This current Jira is good, my only request for the current Jira is that it would be good to backport [PR|https://github.com/apache/hadoop/pull/4562] to branch-3.3 also. > Upgrade jquery datatable version references to v1.10.19 > --- > > Key: HDFS-16652 > URL: https://issues.apache.org/jira/browse/HDFS-16652 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-16652.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Upgrade jquery datatable version references in hdfs webapp to v1.10.19 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19
[ https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16652: Target Version/s: 3.4.0, 3.3.9 > Upgrade jquery datatable version references to v1.10.19 > --- > > Key: HDFS-16652 > URL: https://issues.apache.org/jira/browse/HDFS-16652 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-16652.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Upgrade jquery datatable version references in hdfs webapp to v1.10.19 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19
[ https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601972#comment-17601972 ] Viraj Jasani commented on HDFS-16652: - Looks like YARN-8854 title says it upgraded datatable to 1.10.19 but the patch upgraded it to 1.10.18. Let me try to clarify on the Jira. > Upgrade jquery datatable version references to v1.10.19 > --- > > Key: HDFS-16652 > URL: https://issues.apache.org/jira/browse/HDFS-16652 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-16652.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Upgrade jquery datatable version references in hdfs webapp to v1.10.19 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19
[ https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601970#comment-17601970 ] Viraj Jasani commented on HDFS-16652: - FYI [~apurtell] reg jquery datatable vulnerability on 3.3 release line. It seems that HDFS-6407 added datatable 1.10.7 in HDFS and ever since, the version was not upgraded for HDFS. YARN-8854 did upgrade datatable to 1.10.18 but only for Yarn. > Upgrade jquery datatable version references to v1.10.19 > --- > > Key: HDFS-16652 > URL: https://issues.apache.org/jira/browse/HDFS-16652 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-16652.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Upgrade jquery datatable version references in hdfs webapp to v1.10.19 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19
[ https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601968#comment-17601968 ] Viraj Jasani commented on HDFS-16652: - [~dmmkr] thanks for this work, are you planning to create backport PR for branch-3.3 as well? > Upgrade jquery datatable version references to v1.10.19 > --- > > Key: HDFS-16652 > URL: https://issues.apache.org/jira/browse/HDFS-16652 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: HDFS-16652.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Upgrade jquery datatable version references in hdfs webapp to v1.10.19 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error
[ https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani reassigned HDFS-16702: --- Assignee: Steve Vaughan (was: Viraj Jasani) > MiniDFSCluster should report cause of exception in assertion error > -- > > Key: HDFS-16702 > URL: https://issues.apache.org/jira/browse/HDFS-16702 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: Tests running in the Hadoop dev environment image. >Reporter: Steve Vaughan >Assignee: Steve Vaughan >Priority: Minor > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > When the MiniDFSClsuter detects that an exception caused an exit, it should > include that exception as the cause for the AssertionError that it throws. > The current AssertError simply reports the message "Test resulted in an > unexpected exit" and provides a stack trace to the location of the check for > an exit exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error
[ https://issues.apache.org/jira/browse/HDFS-16702 ] Viraj Jasani deleted comment on HDFS-16702: - was (Author: vjasani): In fact, we can make a generic change to ExitException so that it's object always prints the cause for the ExitException. > MiniDFSCluster should report cause of exception in assertion error > -- > > Key: HDFS-16702 > URL: https://issues.apache.org/jira/browse/HDFS-16702 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: Tests running in the Hadoop dev environment image. >Reporter: Steve Vaughan >Assignee: Viraj Jasani >Priority: Minor > > When the MiniDFSClsuter detects that an exception caused an exit, it should > include that exception as the cause for the AssertionError that it throws. > The current AssertError simply reports the message "Test resulted in an > unexpected exit" and provides a stack trace to the location of the check for > an exit exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error
[ https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573963#comment-17573963 ] Viraj Jasani commented on HDFS-16702: - In fact, we can make a generic change to ExitException so that it's object always prints the cause for the ExitException. > MiniDFSCluster should report cause of exception in assertion error > -- > > Key: HDFS-16702 > URL: https://issues.apache.org/jira/browse/HDFS-16702 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: Tests running in the Hadoop dev environment image. >Reporter: Steve Vaughan >Assignee: Viraj Jasani >Priority: Minor > > When the MiniDFSClsuter detects that an exception caused an exit, it should > include that exception as the cause for the AssertionError that it throws. > The current AssertError simply reports the message "Test resulted in an > unexpected exit" and provides a stack trace to the location of the check for > an exit exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error
[ https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573962#comment-17573962 ] Viraj Jasani commented on HDFS-16702: - I did encounter this sometime back and had similar thought but somehow missed creating Jira. Let me take this up? > MiniDFSCluster should report cause of exception in assertion error > -- > > Key: HDFS-16702 > URL: https://issues.apache.org/jira/browse/HDFS-16702 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: Tests running in the Hadoop dev environment image. >Reporter: Steve Vaughan >Priority: Minor > > When the MiniDFSClsuter detects that an exception caused an exit, it should > include that exception as the cause for the AssertionError that it throws. > The current AssertError simply reports the message "Test resulted in an > unexpected exit" and provides a stack trace to the location of the check for > an exit exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error
[ https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani reassigned HDFS-16702: --- Assignee: Viraj Jasani > MiniDFSCluster should report cause of exception in assertion error > -- > > Key: HDFS-16702 > URL: https://issues.apache.org/jira/browse/HDFS-16702 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: Tests running in the Hadoop dev environment image. >Reporter: Steve Vaughan >Assignee: Viraj Jasani >Priority: Minor > > When the MiniDFSClsuter detects that an exception caused an exit, it should > include that exception as the cause for the AssertionError that it throws. > The current AssertError simply reports the message "Test resulted in an > unexpected exit" and provides a stack trace to the location of the check for > an exit exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16637) TestHDFSCLI#testAll consistently failing
[ https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187 ] Viraj Jasani edited comment on HDFS-16637 at 6/20/22 7:43 PM: -- No worries at all [~jianghuazhu], it happens with everyone, build results are sometimes ignored in the hurry and we learn from it later :) Thanks for your contributions! was (Author: vjasani): No worries at all [~jianghuazhu], it happens with everyone, build results are sometimes ignored in the hurry and we learn from it later :) > TestHDFSCLI#testAll consistently failing > > > Key: HDFS-16637 > URL: https://issues.apache.org/jira/browse/HDFS-16637 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > The failure seems to have been caused by output change introduced by > HDFS-16581. > {code:java} > 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(146)) - Detailed results: > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(147)) - > --2022-06-19 15:41:16,184 [Listener at > localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(156)) - > --- > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(157)) - Test ID: [629] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(158)) - Test Description: > [printTopology: verifying that the topology map is what we expect] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(159)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs > hdfs://localhost:51486 -printTopology] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(167)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(174)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(178)) - Comparator: > [RegexpAcrossOutputComparator] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(180)) - Comparision result: > [fail] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(182)) - Expected output: > [^Rack: > \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] > 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(184)) - Actual output: > [Rack: /rack1 > 127.0.0.1:51487 (localhost) In Service > 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 > 127.0.0.1:51500 (localhost) In Service > 127.0.0.1:51496 (localhost) In Service > 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 > 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 > 127.0.0.1:51512 (localhost) In Service > 127.0.0.1:51516 (localhost) In Service] > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16637) TestHDFSCLI#testAll consistently failing
[ https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187 ] Viraj Jasani edited comment on HDFS-16637 at 6/20/22 7:43 PM: -- No worries at all [~jianghuazhu], it happens with everyone, build results are sometimes ignored in the hurry and we learn from it later :) was (Author: vjasani): No worries at all [~jianghuazhu], it happens with everyone, build results are sometimes ignored in the hurry and we learn from it :) > TestHDFSCLI#testAll consistently failing > > > Key: HDFS-16637 > URL: https://issues.apache.org/jira/browse/HDFS-16637 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > The failure seems to have been caused by output change introduced by > HDFS-16581. > {code:java} > 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(146)) - Detailed results: > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(147)) - > --2022-06-19 15:41:16,184 [Listener at > localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(156)) - > --- > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(157)) - Test ID: [629] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(158)) - Test Description: > [printTopology: verifying that the topology map is what we expect] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(159)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs > hdfs://localhost:51486 -printTopology] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(167)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(174)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(178)) - Comparator: > [RegexpAcrossOutputComparator] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(180)) - Comparision result: > [fail] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(182)) - Expected output: > [^Rack: > \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] > 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(184)) - Actual output: > [Rack: /rack1 > 127.0.0.1:51487 (localhost) In Service > 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 > 127.0.0.1:51500 (localhost) In Service > 127.0.0.1:51496 (localhost) In Service > 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 > 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 > 127.0.0.1:51512 (localhost) In Service > 127.0.0.1:51516 (localhost) In Service] > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16637) TestHDFSCLI#testAll consistently failing
[ https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187 ] Viraj Jasani edited comment on HDFS-16637 at 6/20/22 7:02 PM: -- No worries at all [~jianghuazhu], it happens with everyone, build results are sometimes ignored in the hurry and we learn from it :) was (Author: vjasani): No worries at all [~jianghuazhu], this is not carelessness at all, it happens with everyone :) > TestHDFSCLI#testAll consistently failing > > > Key: HDFS-16637 > URL: https://issues.apache.org/jira/browse/HDFS-16637 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > The failure seems to have been caused by output change introduced by > HDFS-16581. > {code:java} > 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(146)) - Detailed results: > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(147)) - > --2022-06-19 15:41:16,184 [Listener at > localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(156)) - > --- > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(157)) - Test ID: [629] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(158)) - Test Description: > [printTopology: verifying that the topology map is what we expect] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(159)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs > hdfs://localhost:51486 -printTopology] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(167)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(174)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(178)) - Comparator: > [RegexpAcrossOutputComparator] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(180)) - Comparision result: > [fail] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(182)) - Expected output: > [^Rack: > \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] > 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(184)) - Actual output: > [Rack: /rack1 > 127.0.0.1:51487 (localhost) In Service > 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 > 127.0.0.1:51500 (localhost) In Service > 127.0.0.1:51496 (localhost) In Service > 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 > 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 > 127.0.0.1:51512 (localhost) In Service > 127.0.0.1:51516 (localhost) In Service] > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16637) TestHDFSCLI#testAll consistently failing
[ https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187 ] Viraj Jasani commented on HDFS-16637: - No worries at all [~jianghuazhu], this is not carelessness at all, it happens with everyone :) > TestHDFSCLI#testAll consistently failing > > > Key: HDFS-16637 > URL: https://issues.apache.org/jira/browse/HDFS-16637 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The failure seems to have been caused by output change introduced by > HDFS-16581. > {code:java} > 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(146)) - Detailed results: > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(147)) - > --2022-06-19 15:41:16,184 [Listener at > localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(156)) - > --- > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(157)) - Test ID: [629] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(158)) - Test Description: > [printTopology: verifying that the topology map is what we expect] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(159)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs > hdfs://localhost:51486 -printTopology] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(167)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(174)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(178)) - Comparator: > [RegexpAcrossOutputComparator] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(180)) - Comparision result: > [fail] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(182)) - Expected output: > [^Rack: > \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] > 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(184)) - Actual output: > [Rack: /rack1 > 127.0.0.1:51487 (localhost) In Service > 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 > 127.0.0.1:51500 (localhost) In Service > 127.0.0.1:51496 (localhost) In Service > 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 > 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 > 127.0.0.1:51512 (localhost) In Service > 127.0.0.1:51516 (localhost) In Service] > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16637) TestHDFSCLI#testAll consistently failing
Viraj Jasani created HDFS-16637: --- Summary: TestHDFSCLI#testAll consistently failing Key: HDFS-16637 URL: https://issues.apache.org/jira/browse/HDFS-16637 Project: Hadoop HDFS Issue Type: Test Reporter: Viraj Jasani Assignee: Viraj Jasani The failure seems to have been caused by output change introduced by HDFS-16581. {code:java} 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(146)) - Detailed results: 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(147)) - --2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(156)) - --- 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(157)) - Test ID: [629] 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(158)) - Test Description: [printTopology: verifying that the topology map is what we expect] 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(159)) - 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs hdfs://localhost:51486 -printTopology] 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(167)) - 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(174)) - 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(178)) - Comparator: [RegexpAcrossOutputComparator] 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(180)) - Comparision result: [fail] 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(182)) - Expected output: [^Rack: \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper (CLITestHelper.java:displayResults(184)) - Actual output: [Rack: /rack1 127.0.0.1:51487 (localhost) In Service 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 127.0.0.1:51500 (localhost) In Service 127.0.0.1:51496 (localhost) In Service 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 127.0.0.1:51512 (localhost) In Service 127.0.0.1:51516 (localhost) In Service] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16634) Dynamically adjust slow peer report size on JMX metrics
Viraj Jasani created HDFS-16634: --- Summary: Dynamically adjust slow peer report size on JMX metrics Key: HDFS-16634 URL: https://issues.apache.org/jira/browse/HDFS-16634 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani On a busy cluster, sometimes it takes bit of time for deleted node(from the cluster)'s "slow node report" to get removed from slow peer json report on Namenode JMX metrics. In the meantime, user should be able to browse through more entries in the report by adjusting i.e. reconfiguring "dfs.datanode.max.nodes.to.report" so that the list size can be adjusted without user having to bounce active Namenode just for this purpose. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15982) Deleted data using HTTP API should be saved to the trash
[ https://issues.apache.org/jira/browse/HDFS-15982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani reassigned HDFS-15982: --- Assignee: (was: Viraj Jasani) > Deleted data using HTTP API should be saved to the trash > > > Key: HDFS-15982 > URL: https://issues.apache.org/jira/browse/HDFS-15982 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs, hdfs-client, httpfs, webhdfs >Reporter: Bhavik Patel >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2021-04-23 at 4.19.42 PM.png, Screenshot > 2021-04-23 at 4.36.57 PM.png > > Time Spent: 13h 20m > Remaining Estimate: 0h > > If we delete the data from the Web UI then it should be first moved to > configured/default Trash directory and after the trash interval time, it > should be removed. currently, data directly removed from the system[This > behavior should be the same as CLI cmd] > This can be helpful when the user accidentally deletes data from the Web UI. > Similarly we should provide "Skip Trash" option in HTTP API as well which > should be accessible through Web UI. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16618) sync_file_range error should include more volume and file info
[ https://issues.apache.org/jira/browse/HDFS-16618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16618: Priority: Minor (was: Major) > sync_file_range error should include more volume and file info > -- > > Key: HDFS-16618 > URL: https://issues.apache.org/jira/browse/HDFS-16618 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Minor > > Having seen multiple sync_file_range errors recently with Bad file > descriptor, it would be good to include more volume stats as well as file > offset/length info with the error log to get some more insights. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16618) sync_file_range error should include more volume and file info
Viraj Jasani created HDFS-16618: --- Summary: sync_file_range error should include more volume and file info Key: HDFS-16618 URL: https://issues.apache.org/jira/browse/HDFS-16618 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani Having seen multiple sync_file_range errors recently with Bad file descriptor, it would be good to include more volume stats as well as file offset/length info with the error log to get some more insights. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits
[ https://issues.apache.org/jira/browse/HDFS-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16595: Release Note: Namenode metrics that represent Slownode Json now include three important factors (median, median absolute deviation, upper latency limit) that can help user determine how urgently a given slownode requires manual intervention. > Slow peer metrics - add median, mad and upper latency limits > > > Key: HDFS-16595 > URL: https://issues.apache.org/jira/browse/HDFS-16595 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Slow datanode metrics include slow node and it's reporting node details. With > HDFS-16582, we added the aggregate latency that is perceived by the reporting > nodes. > In order to get more insights into how the outlier slownode's latencies > differ from the rest of the nodes, we should also expose median, median > absolute deviation and the calculated upper latency limit details. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits
Viraj Jasani created HDFS-16595: --- Summary: Slow peer metrics - add median, mad and upper latency limits Key: HDFS-16595 URL: https://issues.apache.org/jira/browse/HDFS-16595 Project: Hadoop HDFS Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani Slow datanode metrics include slow node and it's reporting node details. With HDFS-16582, we added the aggregate latency that is perceived by the reporting nodes. In order to get more insights into how the outlier slownode's latencies differ from the rest of the nodes, we should also expose median, median absolute deviation and the calculated upper latency limit details. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16582) Expose aggregate latency of slow node as perceived by the reporting node
[ https://issues.apache.org/jira/browse/HDFS-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538569#comment-17538569 ] Viraj Jasani commented on HDFS-16582: - FYI [~stack] > Expose aggregate latency of slow node as perceived by the reporting node > > > Key: HDFS-16582 > URL: https://issues.apache.org/jira/browse/HDFS-16582 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > When any datanode is reported to be slower by another node, we expose the > slow node as well as the reporting nodes list for the slow node. However, we > don't provide latency numbers of the slownode as reported by the reporting > node. Having the latency exposed in the metrics would be really helpful for > operators to keep a track of how far behind a given slow node is performing > compared to the rest of the nodes in the cluster. > The operator should be able to gather aggregated latencies of all slow nodes > with their reporting nodes in Namenode metrics. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16582) Expose aggregate latency of slow node as perceived by the reporting node
Viraj Jasani created HDFS-16582: --- Summary: Expose aggregate latency of slow node as perceived by the reporting node Key: HDFS-16582 URL: https://issues.apache.org/jira/browse/HDFS-16582 Project: Hadoop HDFS Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani When any datanode is reported to be slower by another node, we expose the slow node as well as the reporting nodes list for the slow node. However, we don't provide latency numbers of the slownode as reported by the reporting node. Having the latency exposed in the metrics would be really helpful for operators to keep a track of how far behind a given slow node is performing compared to the rest of the nodes in the cluster. The operator should be able to gather aggregated latencies of all slow nodes with their reporting nodes in Namenode metrics. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16568) dfsadmin -reconfig option to start/query reconfig on all live datanodes
[ https://issues.apache.org/jira/browse/HDFS-16568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16568: Target Version/s: 3.4.0, 3.3.4 (was: 3.4.0) > dfsadmin -reconfig option to start/query reconfig on all live datanodes > --- > > Key: HDFS-16568 > URL: https://issues.apache.org/jira/browse/HDFS-16568 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > DFSAdmin provides option to initiate or query the status of reconfiguration > operation on only specific host based on host:port provided by user. It would > be good to provide an ability to initiate such operations in bulk, on all > live datanodes. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16521: Target Version/s: 3.4.0, 3.3.4 (was: 3.4.0) > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 6h 40m > Remaining Estimate: 0h > > Providing DFS API to retrieve slow nodes would help add an additional option > to "dfsadmin -report" that lists slow datanodes info for operators to take a > look, specifically useful filter for larger clusters. > The other purpose of such API is for HDFS downstreamers without direct access > to namenode http port (only rpc port accessible) to retrieve slownodes. > Moreover, > [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] > in HBase currently has to rely on it's own way of marking and excluding slow > nodes while 1) creating pipelines and 2) handling ack, based on factors like > the data length of the packet, processing time with last ack timestamp, > whether flush to replicas is finished etc. If it can utilize slownode API > from HDFS to exclude nodes appropriately while writing block, a lot of it's > own post-ack computation of slow nodes can be _saved_ or _improved_ or based > on further experiment, we could find _better solution_ to manage slow node > detection logic both in HDFS and HBase. However, in order to collect more > data points and run more POC around this area, HDFS should provide API for > downstreamers to efficiently utilize slownode info for such critical > low-latency use-case (like writing WALs). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16568) dfsadmin -reconfig option to start/query reconfig on all live datanodes
Viraj Jasani created HDFS-16568: --- Summary: dfsadmin -reconfig option to start/query reconfig on all live datanodes Key: HDFS-16568 URL: https://issues.apache.org/jira/browse/HDFS-16568 Project: Hadoop HDFS Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani DFSAdmin provides option to initiate or query the status of reconfiguration operation on only specific host based on host:port provided by user. It would be good to provide an ability to initiate such operations in bulk, on all live datanodes. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16528) Reconfigure slow peer enable for Namenode
[ https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16528: Fix Version/s: 3.3.4 (was: 3.3.0) > Reconfigure slow peer enable for Namenode > - > > Key: HDFS-16528 > URL: https://issues.apache.org/jira/browse/HDFS-16528 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.4 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > HDFS-16396 provides reconfig options for several configs associated with > slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some > slownodes related configs as the reconfig options in Namenode. > The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as > reconfigurable option for Namenode (similar to how HDFS-16396 has included it > for Datanode). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16528) Reconfigure slow peer enable for Namenode
[ https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530574#comment-17530574 ] Viraj Jasani commented on HDFS-16528: - Thank you for the review [~tomscut] ! > Reconfigure slow peer enable for Namenode > - > > Key: HDFS-16528 > URL: https://issues.apache.org/jira/browse/HDFS-16528 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > HDFS-16396 provides reconfig options for several configs associated with > slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some > slownodes related configs as the reconfig options in Namenode. > The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as > reconfigurable option for Namenode (similar to how HDFS-16396 has included it > for Datanode). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16528) Reconfigure slow peer enable for Namenode
[ https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16528 started by Viraj Jasani. --- > Reconfigure slow peer enable for Namenode > - > > Key: HDFS-16528 > URL: https://issues.apache.org/jira/browse/HDFS-16528 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > HDFS-16396 provides reconfig options for several configs associated with > slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some > slownodes related configs as the reconfig options in Namenode. > The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as > reconfigurable option for Namenode (similar to how HDFS-16396 has included it > for Datanode). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16528) Reconfigure slow peer enable for Namenode
[ https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16528: Status: Patch Available (was: In Progress) > Reconfigure slow peer enable for Namenode > - > > Key: HDFS-16528 > URL: https://issues.apache.org/jira/browse/HDFS-16528 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > HDFS-16396 provides reconfig options for several configs associated with > slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some > slownodes related configs as the reconfig options in Namenode. > The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as > reconfigurable option for Namenode (similar to how HDFS-16396 has included it > for Datanode). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16521: Description: Providing DFS API to retrieve slow nodes would help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. The other purpose of such API is for HDFS downstreamers without direct access to namenode http port (only rpc port accessible) to retrieve slownodes. Moreover, [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] in HBase currently has to rely on it's own way of marking and excluding slow nodes while 1) creating pipelines and 2) handling ack, based on factors like the data length of the packet, processing time with last ack timestamp, whether flush to replicas is finished etc. If it can utilize slownode API from HDFS to exclude nodes appropriately while writing block, a lot of it's own post-ack computation of slow nodes can be _saved_ or _improved_ or based on further experiment, we could find _better solution_ to manage slow node detection logic both in HDFS and HBase. However, in order to collect more data points and run more POC around this area, HDFS should provide API for downstreamers to efficiently utilize slownode info for such critical low-latency use-case (like writing WALs). was: Providing DFS API to retrieve slow nodes would help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. The other purpose of such API is for HDFS downstreamers without direct access to namenode http port (only rpc port accessible) to retrieve slownodes. Moreover, [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] in HBase currently has to rely on it's own way of marking and excluding slow nodes while 1) creating pipelines and 2) handling ack, based on factors like the data length of the packet, processing time with last ack timestamp, whether flush to replicas is finished etc. If it can utilize slownode API from HDFS to exclude nodes appropriately while writing block, a lot of it's own post-ack computation of slow nodes can be _saved_ or _improved_ or based on further experiment, we could find _better solution_ to manage slow node detection logic both in HDFS and HBase. However, in order to collect more data points and run more POC around this area, at least we should expect HDFS to provide API for downstreamers to efficiently utilize slownode info for such critical low-latency use-case (like writing WALs). > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Providing DFS API to retrieve slow nodes would help add an additional option > to "dfsadmin -report" that lists slow datanodes info for operators to take a > look, specifically useful filter for larger clusters. > The other purpose of such API is for HDFS downstreamers without direct access > to namenode http port (only rpc port accessible) to retrieve slownodes. > Moreover, > [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] > in HBase currently has to rely on it's own way of marking and excluding slow > nodes while 1) creating pipelines and 2) handling ack, based on factors like > the data length of the packet, processing time with last ack timestamp, > whether flush to replicas is finished etc. If it can utilize slownode API > from HDFS to exclude nodes appropriately while writing block, a lot of it's > own post-ack computation of slow nodes can be _saved_ or _improved_ or based > on further experiment, we could find _better solution_ to manage slow node > detection logic both in HDFS and HBase. However, in order to collect more > data points and run more POC around this area, HDFS should provide API for > downstreamers to efficiently utilize slownode info for such critical > low-latency use-case (like writing WALs). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16521: Description: Providing DFS API to retrieve slow nodes would help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. The other purpose of such API is for HDFS downstreamers without direct access to namenode http port (only rpc port accessible) to retrieve slownodes. Moreover, [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] in HBase currently has to rely on it's own way of marking and excluding slow nodes while 1) creating pipelines and 2) handling ack, based on factors like the data length of the packet, processing time with last ack timestamp, whether flush to replicas is finished etc. If it can utilize slownode API from HDFS to exclude nodes appropriately while writing block, a lot of it's own post-ack computation of slow nodes can be _saved_ or _improved_ or based on further experiment, we could find _better solution_ to manage slow node detection logic both in HDFS and HBase. However, in order to collect more data points and run more POC around this area, at least we should expect HDFS to provide API for downstreamers to efficiently utilize slownode info for such critical low-latency use-case (like writing WALs). was: Providing DFS API to retrieve slow nodes would help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. The other purpose of such API is for HDFS downstreamers without direct access to namenode http port (only rpc port accessible) to retrieve slownodes. > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Providing DFS API to retrieve slow nodes would help add an additional option > to "dfsadmin -report" that lists slow datanodes info for operators to take a > look, specifically useful filter for larger clusters. > The other purpose of such API is for HDFS downstreamers without direct access > to namenode http port (only rpc port accessible) to retrieve slownodes. > Moreover, > [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java] > in HBase currently has to rely on it's own way of marking and excluding slow > nodes while 1) creating pipelines and 2) handling ack, based on factors like > the data length of the packet, processing time with last ack timestamp, > whether flush to replicas is finished etc. If it can utilize slownode API > from HDFS to exclude nodes appropriately while writing block, a lot of it's > own post-ack computation of slow nodes can be _saved_ or _improved_ or based > on further experiment, we could find _better solution_ to manage slow node > detection logic both in HDFS and HBase. However, in order to collect more > data points and run more POC around this area, at least we should expect HDFS > to provide API for downstreamers to efficiently utilize slownode info for > such critical low-latency use-case (like writing WALs). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16521: Description: Providing DFS API to retrieve slow nodes would help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. The other purpose of such API is for HDFS downstreamers without direct access to namenode http port (only rpc port accessible) to retrieve slownodes. was: In order to build some automation around slow datanodes that regularly show up in the slow peer tracking report, e.g. decommission such nodes and queue them up for external processing and add them back later to the cluster after fixing issues etc, we should expose DFS API to retrieve all slow nodes at a given time. Providing such API would also help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > Providing DFS API to retrieve slow nodes would help add an additional option > to "dfsadmin -report" that lists slow datanodes info for operators to take a > look, specifically useful filter for larger clusters. > The other purpose of such API is for HDFS downstreamers without direct access > to namenode http port (only rpc port accessible) to retrieve slownodes. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16521: Target Version/s: 3.4.0 > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > In order to build some automation around slow datanodes that regularly show > up in the slow peer tracking report, e.g. decommission such nodes and queue > them up for external processing and add them back later to the cluster after > fixing issues etc, we should expose DFS API to retrieve all slow nodes at a > given time. > Providing such API would also help add an additional option to "dfsadmin > -report" that lists slow datanodes info for operators to take a look, > specifically useful filter for larger clusters. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16481) Provide support to set Http and Rpc ports in MiniJournalCluster
[ https://issues.apache.org/jira/browse/HDFS-16481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519332#comment-17519332 ] Viraj Jasani commented on HDFS-16481: - Thanks [~aajisaka]! > Provide support to set Http and Rpc ports in MiniJournalCluster > --- > > Key: HDFS-16481 > URL: https://issues.apache.org/jira/browse/HDFS-16481 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.3 > > Time Spent: 6h 10m > Remaining Estimate: 0h > > We should provide support for clients to set Http and Rpc ports of > JournalNodes in MiniJournalCluster. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16528) Reconfigure slow peer enable for Namenode
Viraj Jasani created HDFS-16528: --- Summary: Reconfigure slow peer enable for Namenode Key: HDFS-16528 URL: https://issues.apache.org/jira/browse/HDFS-16528 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani HDFS-16396 provides reconfig options for several configs associated with slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some slownodes related configs as the reconfig options in Namenode. The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as reconfigurable option for Namenode (similar to how HDFS-16396 has included it for Datanode). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16522) Set Http and Ipc ports for Datanodes in MiniDFSCluster
[ https://issues.apache.org/jira/browse/HDFS-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16522: Status: Patch Available (was: In Progress) > Set Http and Ipc ports for Datanodes in MiniDFSCluster > -- > > Key: HDFS-16522 > URL: https://issues.apache.org/jira/browse/HDFS-16522 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > We should provide options to set Http and Ipc ports for Datanodes in > MiniDFSCluster. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16521 started by Viraj Jasani. --- > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In order to build some automation around slow datanodes that regularly show > up in the slow peer tracking report, e.g. decommission such nodes and queue > them up for external processing and add them back later to the cluster after > fixing issues etc, we should expose DFS API to retrieve all slow nodes at a > given time. > Providing such API would also help add an additional option to "dfsadmin > -report" that lists slow datanodes info for operators to take a look, > specifically useful filter for larger clusters. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes
[ https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16521: Status: Patch Available (was: In Progress) > DFS API to retrieve slow datanodes > -- > > Key: HDFS-16521 > URL: https://issues.apache.org/jira/browse/HDFS-16521 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In order to build some automation around slow datanodes that regularly show > up in the slow peer tracking report, e.g. decommission such nodes and queue > them up for external processing and add them back later to the cluster after > fixing issues etc, we should expose DFS API to retrieve all slow nodes at a > given time. > Providing such API would also help add an additional option to "dfsadmin > -report" that lists slow datanodes info for operators to take a look, > specifically useful filter for larger clusters. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16522) Set Http and Ipc ports for Datanodes in MiniDFSCluster
[ https://issues.apache.org/jira/browse/HDFS-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16522 started by Viraj Jasani. --- > Set Http and Ipc ports for Datanodes in MiniDFSCluster > -- > > Key: HDFS-16522 > URL: https://issues.apache.org/jira/browse/HDFS-16522 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > We should provide options to set Http and Ipc ports for Datanodes in > MiniDFSCluster. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16522) Set Http and Ipc ports for Datanodes in MiniDFSCluster
Viraj Jasani created HDFS-16522: --- Summary: Set Http and Ipc ports for Datanodes in MiniDFSCluster Key: HDFS-16522 URL: https://issues.apache.org/jira/browse/HDFS-16522 Project: Hadoop HDFS Issue Type: Task Reporter: Viraj Jasani Assignee: Viraj Jasani We should provide options to set Http and Ipc ports for Datanodes in MiniDFSCluster. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16521) DFS API to retrieve slow datanodes
Viraj Jasani created HDFS-16521: --- Summary: DFS API to retrieve slow datanodes Key: HDFS-16521 URL: https://issues.apache.org/jira/browse/HDFS-16521 Project: Hadoop HDFS Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani In order to build some automation around slow datanodes that regularly show up in the slow peer tracking report, e.g. decommission such nodes and queue them up for external processing and add them back later to the cluster after fixing issues etc, we should expose DFS API to retrieve all slow nodes at a given time. Providing such API would also help add an additional option to "dfsadmin -report" that lists slow datanodes info for operators to take a look, specifically useful filter for larger clusters. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16502) Reconfigure Block Invalidate limit
[ https://issues.apache.org/jira/browse/HDFS-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16502: Description: Based on the cluster load, it would be helpful to consider tuning block invalidate limit (dfs.block.invalidate.limit). The only way we can do this without restarting Namenode as of today is by reconfiguring heartbeat interval {code:java} Math.max(heartbeatInt*20, blockInvalidateLimit){code} , this logic is not straightforward and operators are usually not aware of it (lack of documentation), also updating heartbeat interval is not desired in all the cases. We should provide the ability to alter block invalidation limit without affecting heartbeat interval on the live cluster to adjust some load at Datanode level. We should also take this opportunity to keep (heartbeatInterval * 20) computation logic in a common method. was: Based on the cluster load, it would be helpful to consider tuning block invalidate limit (dfs.block.invalidate.limit). The only way we can do this without restarting Namenode as of today is by reconfiguring heartbeat interval {code:java} Math.max(heartbeatInt*20, blockInvalidateLimit){code} , this logic is not straightforward and operators are usually not aware of it (lack of documentation), also updating heartbeat interval is not desired in all the cases. We should provide the ability to alter block invalidation limit without affecting heartbeat interval on the live cluster to adjust some load at Datanode level. > Reconfigure Block Invalidate limit > -- > > Key: HDFS-16502 > URL: https://issues.apache.org/jira/browse/HDFS-16502 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > Based on the cluster load, it would be helpful to consider tuning block > invalidate limit (dfs.block.invalidate.limit). The only way we can do this > without restarting Namenode as of today is by reconfiguring heartbeat > interval > {code:java} > Math.max(heartbeatInt*20, blockInvalidateLimit){code} > , this logic is not straightforward and operators are usually not aware of it > (lack of documentation), also updating heartbeat interval is not desired in > all the cases. > We should provide the ability to alter block invalidation limit without > affecting heartbeat interval on the live cluster to adjust some load at > Datanode level. > We should also take this opportunity to keep (heartbeatInterval * 20) > computation logic in a common method. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org