[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2023-11-05 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783072#comment-17783072
 ] 

Viraj Jasani commented on HDFS-16016:
-

I think so too. Do you think you might be able to share logs in the meantime? I 
just want to get some more clarity on the sequence and correlate with namenode 
processing the report. If you might not be able to share the logs, that's fine 
too.

Thanks

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2023-11-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782674#comment-17782674
 ] 

Viraj Jasani commented on HDFS-16016:
-

If possible, could you please also share DN and NN logs for the affected block 
and block reports?

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2023-11-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782670#comment-17782670
 ] 

Viraj Jasani commented on HDFS-16016:
-

Interesting, thanks for reporting this [~yuanbo], let me try reproducing this 
on some heavy test env. btw you might also be interested in HDFS-17121 and 
HDFS-17129.

We are also using this patch in prod for quite some time now.

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16938) Utility to trigger heartbeat and wait until BP thread queue is fully processed

2023-09-19 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HDFS-16938:
---

Assignee: (was: Viraj Jasani)

> Utility to trigger heartbeat and wait until BP thread queue is fully processed
> --
>
> Key: HDFS-16938
> URL: https://issues.apache.org/jira/browse/HDFS-16938
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> As a follow-up to HDFS-16935, we should provide utility to trigger heartbeat 
> and wait until BP thread queue is fully processed. This would ensure 100% 
> consistency w.r.t active namenode being able to receive bad block reports 
> from the given datanode. This utility would resolve flakes for the tests that 
> rely on namenode's awareness of the reported bad blocks by datanodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17129) mis-order of ibr and fbr on datanode

2023-07-26 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747712#comment-17747712
 ] 

Viraj Jasani commented on HDFS-17129:
-

Thanks for filing this [~liuguanghua], as discussed on the PR, are we planning 
to use lock to prevent mis-order?

> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Priority: Major
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17041) RBF: Fix putAll impl for mysql and file based state stores

2023-06-07 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-17041:
---

 Summary: RBF: Fix putAll impl for mysql and file based state stores
 Key: HDFS-17041
 URL: https://issues.apache.org/jira/browse/HDFS-17041
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Only zookeeper based state store allows all records to be inserted even though 
only few of them already exists and "errorIfExists" is true, however file/fs as 
well as mysql based putAll fails the whole putAll operation immediately after 
encountering single record that already exists in the records and 
"errorIfExists" is provided true (which is the case while inserting records for 
the first time).

For all implementations, we should allow inserts of the records that do not 
already exist and report any record as failure that already exists, rather than 
failing the whole operation and not trying to insert valid records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724137#comment-17724137
 ] 

Viraj Jasani commented on HDFS-17017:
-

{quote}btw. lets fix something when it is broken, or give suggestions on how 
the fix could be better rather than telling ok, this is a minor thing, it won't 
happen, or it was the tests fault old people didn't cover, anybody can miss 
anything in the code, lets not make the one who found a valid bug even small 
feel any less when it is very valid use case, not dragging it further, saw this 
as a repetitive occurrence, just a friendly 2 cents rest upto you
{quote}
Ayush, I am not sure why you would think that I am making anyone feel bad for 
this, I really appreciate this fix and I have suggested test change to make the 
fix even better with solid test. It has never been my intention to say that old 
ppl didn't cover something, I would never say that. Hadoop is massive codebase, 
things can be missed. For this fix, i agree that it was a miss from HDFS-16521. 
I am not denying it.

I hope I have not offended you and if you felt that way, I apologize, that was 
never my purpose, i was just providing my viewpoint that while testing the 
changes, i only used it with "-live" because my usecase was "live but not slow" 
but you are right that usecase could be different too.

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121
 ] 

Viraj Jasani edited comment on HDFS-17017 at 5/19/23 5:55 AM:
--

anyways, in order to prevent "any new argument for any of the existing 
commands" from running into similar "exhausting max arguments" case, what can 
we do? can we write a test that can parse all possible arguments for the given 
command ("-report" in this case) and pass them all and ensure that the output 
return code/exit code still remains 0?

if we have such test, then whenever someone introduces a new argument in 
future, the test will automatically pass the argument to the command and the 
test would fail, forcing dev to handle the "max argument" case.

 

[~haiyang Hu] i have attached the patch on the PR to make the test more robust, 
and cover the missing case of identifying whether we have exceeded max 
arguments and need to adjust max arguments allowed for -report command. Thank 
you.


was (Author: vjasani):
anyways, in order to prevent "any new argument for any of the existing 
commands" from running into similar "exhausting max arguments" case, what can 
we do? can we write a test that can parse all possible arguments for the given 
command ("-report" in this case) and pass them all and ensure that the output 
return code/exit code still remains 0?

if we have such test, then whenever someone introduces a new argument in 
future, the test will automatically pass the argument to the command and the 
test would fail, forcing dev to handle the "max argument" case.

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121
 ] 

Viraj Jasani edited comment on HDFS-17017 at 5/19/23 5:25 AM:
--

anyways, in order to prevent "any new argument for any of the existing 
commands" from running into similar "exhausting max arguments" case, what can 
we do? can we write a test that can parse all possible arguments for the given 
command ("-report" in this case) and pass them all and ensure that the output 
return code/exit code still remains 0?

if we have such test, then whenever someone introduces a new argument in 
future, the test will automatically pass the argument to the command and the 
test would fail, forcing dev to handle the "max argument" case.


was (Author: vjasani):
anyways, in order to prevent any new argument for any of the existing commands 
to get into similar case of exhausting max arguments, what can we do? can we 
write a test that can parse all possible arguments for the given command 
("-report" in this case) and pass them all and ensure that the output return 
code/exit code still remains 0?

if we have such test, then whenever someone introduces a new argument in 
future, the test will automatically pass the argument to the command and the 
test would likely fail, forcing dev to handle the "max argument" case.

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121
 ] 

Viraj Jasani edited comment on HDFS-17017 at 5/19/23 5:24 AM:
--

anyways, in order to prevent any new argument for any of the existing commands 
to get into similar case of exhausting max arguments, what can we do? can we 
write a test that can parse all possible arguments for the given command 
("-report" in this case) and pass them all and ensure that the output return 
code/exit code still remains 0?

if we have such test, then whenever someone introduces a new argument in 
future, the test will automatically pass the argument to the command and the 
test would likely fail, forcing dev to handle the "max argument" case.


was (Author: vjasani):
anyways, in order to prevent any new argument for -report to get into similar 
case, what can we do? can we write a test that can parse all possible arguments 
for the given command ("-report" in this case) and pass them all and ensure 
that the output return code/exit code remains 0?

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724121#comment-17724121
 ] 

Viraj Jasani commented on HDFS-17017:
-

anyways, in order to prevent any new argument for -report to get into similar 
case, what can we do? can we write a test that can parse all possible arguments 
for the given command ("-report" in this case) and pass them all and ensure 
that the output return code/exit code remains 0?

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724119#comment-17724119
 ] 

Viraj Jasani commented on HDFS-17017:
-

for the command output, it's not an intersection as such, but from slow node 
usecase perspecitve, we want to know how many nodes are live and how many are 
slow among them so the usecase of getting slow nodes is usually coupled with 
live nodes (intention is for user/client to keep only "live - slow" nodes for 
dfs ops i.e. live nodes that are not slow), but i agree that from the general 
usability viewpoint, user should be able to print all or any combination of 
categories.

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17017) Fix the issue of arguments number limit in report command in DFSAdmin.

2023-05-18 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724101#comment-17724101
 ] 

Viraj Jasani commented on HDFS-17017:
-

Functionally this change makes sense because all of them are arguments, however 
in practice "-slownodes" is mostly only meant to be used with "-live". We 
anyways don't have "slow" and "dead or decommissioned or inmaintenance" nodes.

Thanks for the attempting to fix the max argument [~haiyang Hu]!

> Fix the issue of arguments number limit in report command in DFSAdmin.
> --
>
> Key: HDFS-17017
> URL: https://issues.apache.org/jira/browse/HDFS-17017
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, the DFSAdmin report command should support a maximum number of 
> arguments of 7, such as :
> hdfs dfsadmin [-report] [-live] [-dead] [-decommissioning] 
> [-enteringmaintenance] [-inmaintenance] [-slownodes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17020) RBF: mount table addAll should print failed records in std error

2023-05-18 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-17020:
---

 Summary: RBF: mount table addAll should print failed records in 
std error
 Key: HDFS-17020
 URL: https://issues.apache.org/jira/browse/HDFS-17020
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Now that state store putAll supports returning failed records keys, addAll 
command for mount entries should also support printing failed records in the 
standard error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17009) RBF: state store putAll should also return failed records

2023-05-17 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723566#comment-17723566
 ] 

Viraj Jasani commented on HDFS-17009:
-

Looks like the Jira to Github link didn't work so let me link the PR manually.

> RBF: state store putAll should also return failed records
> -
>
> Key: HDFS-17009
> URL: https://issues.apache.org/jira/browse/HDFS-17009
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.4.0
>
>
> State store implementations allow adding/updating multiple records using 
> putAll. The implementation returns whether all records were successfully 
> added or updated. We should also allow the implementation to return which 
> records failed to get updated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17009) RBF: state store putAll should also return failed records

2023-05-11 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-17009:
---

 Summary: RBF: state store putAll should also return failed records
 Key: HDFS-17009
 URL: https://issues.apache.org/jira/browse/HDFS-17009
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


State store implementations allow adding/updating multiple records using 
putAll. The implementation returns whether all records were successfully added 
or updated. We should also allow the implementation to return which records 
failed to get updated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17008) Fix rbf jdk 11 javadoc warnings

2023-05-11 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-17008:
---

 Summary: Fix rbf jdk 11 javadoc warnings
 Key: HDFS-17008
 URL: https://issues.apache.org/jira/browse/HDFS-17008
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


HDFS-16978 excluded proto packages from maven-javadoc-plugin for rbf, hence now 
we have JDK 11 javadoc warnings (e.g. 
[here|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5554/14/artifact/out/results-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1.txt]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16978) RBF: Admin command to support bulk add of mount points

2023-05-10 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721628#comment-17721628
 ] 

Viraj Jasani commented on HDFS-16978:
-

will create follow-up jiras soon

> RBF: Admin command to support bulk add of mount points
> --
>
> Key: HDFS-16978
> URL: https://issues.apache.org/jira/browse/HDFS-16978
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> All state store implementations support adding multiple state store records 
> using single putAll() implementation. We should provide new router admin API 
> to support bulk addition of mount table entries that can utilize this build 
> add implementation at state store level.
> For more than one mount point to be added, the goal of bulk addition should be
>  # To reduce frequent router calls
>  # To avoid frequent state store cache refreshers with each single mount 
> point addition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16978) RBF: Admin command to support bulk add of mount points

2023-05-10 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721626#comment-17721626
 ] 

Viraj Jasani commented on HDFS-16978:
-

Thanks again [~ayushtkn] [~elgoiri] [~simbadzina] !!!

> RBF: Admin command to support bulk add of mount points
> --
>
> Key: HDFS-16978
> URL: https://issues.apache.org/jira/browse/HDFS-16978
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> All state store implementations support adding multiple state store records 
> using single putAll() implementation. We should provide new router admin API 
> to support bulk addition of mount table entries that can utilize this build 
> add implementation at state store level.
> For more than one mount point to be added, the goal of bulk addition should be
>  # To reduce frequent router calls
>  # To avoid frequent state store cache refreshers with each single mount 
> point addition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11063) Set NameNode RPC server handler thread name with more descriptive information about the RPC call.

2023-05-06 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720248#comment-17720248
 ] 

Viraj Jasani commented on HDFS-11063:
-

Thanks [~cnauroth], i was going through similar observation on thread dumps and 
with some search, was able to find this old Jira, glad to see some discussion 
is already present.

Do you think it is still worth pursuing this today? Maybe we can make this an 
opt-in behavior just in case any user would be in favor of disabling it to 
avoid redundant info in logs? At least this would be quite helpful for 
debugging thread dumps.

> Set NameNode RPC server handler thread name with more descriptive information 
> about the RPC call.
> -
>
> Key: HDFS-11063
> URL: https://issues.apache.org/jira/browse/HDFS-11063
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chris Nauroth
>Priority: Major
>
> We often run {{jstack}} on a NameNode process as a troubleshooting step if it 
> is suffering high load or appears to be hanging.  By reading the stack trace, 
> we can identify if a caller is blocked inside an expensive operation.  This 
> would be even more helpful if we updated the RPC server handler thread name 
> with more descriptive information about the RPC call.  This could include the 
> calling user, the called RPC method, and the most significant argument to 
> that method (most likely the path).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16998) RBF: Add ops metrics for getSlowDatanodeReport in RouterClientActivity

2023-05-03 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16998:
---

 Summary: RBF: Add ops metrics for getSlowDatanodeReport in 
RouterClientActivity
 Key: HDFS-16998
 URL: https://issues.apache.org/jira/browse/HDFS-16998
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points

2023-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711710#comment-17711710
 ] 

Viraj Jasani commented on HDFS-16978:
-

Reg the usecase, I had to migrate mount points from zk to hdfs (due to zk being 
hotspot for multiple usecases and adding router is just additional load) and 
during that time, I realized that we have no way to bulk add all mount points 
in one shot, hence I thought of adding this improvement.
{quote}anyway, should be adjusted in an existing commands like router -add 
 ;  and like -update ,  and 
so on
{quote}
That still adds each point separately right? We are still not adding/updating 
mount points in one shot. It's all about using putAll() at state store level 
impl.

 

If you are not fine with me pursuing this, please let me know and I will not 
create PR.

> RBF: New Router admin command to support bulk add of mount points
> -
>
> Key: HDFS-16978
> URL: https://issues.apache.org/jira/browse/HDFS-16978
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
>
> All state store implementations support adding multiple state store records 
> using single putAll() implementation. We should provide new router admin API 
> to support bulk addition of mount table entries that can utilize this build 
> add implementation at state store level.
> For more than one mount point to be added, the goal of bulk addition should be
>  # To reduce frequent router calls
>  # To avoid frequent state store cache refreshers with each single mount 
> point addition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points

2023-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711692#comment-17711692
 ] 

Viraj Jasani edited comment on HDFS-16978 at 4/13/23 6:05 AM:
--

{quote}Mount table operations are admin operations not user operations.
{quote}
I understand but having an admin endpoint for adding multiple mount entries as 
part of "single router admin command" rather than "multiple -add commands" is 
only optimization for reducing multiple router calls as well as reducing state 
store cache refreshes.

We already have putAll() that all state store implements so why not use it by 
router admin? The goal of this Jira is meant to be an optimization for admin 
operation.

 

[~ayushtkn] [~goiri] [~elgoiri] 


was (Author: vjasani):
{quote}Mount table operations are admin operations not user operations.
{quote}
I understand but having an admin endpoint for adding multiple mount entries as 
part of "single router admin command" rather than "multiple -add commands" is 
only optimization for reducing multiple router calls as well as reducing state 
store cache refreshes.

We already have putAll() that all state store implements so why not use it by 
router admin? The goal of this Jira is meant to be an optimization for admin 
operation.

 

[~ayushtkn] [~goiri] [~inigoiri] 

> RBF: New Router admin command to support bulk add of mount points
> -
>
> Key: HDFS-16978
> URL: https://issues.apache.org/jira/browse/HDFS-16978
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> All state store implementations support adding multiple state store records 
> using single putAll() implementation. We should provide new router admin API 
> to support bulk addition of mount table entries that can utilize this build 
> add implementation at state store level.
> For more than one mount point to be added, the goal of bulk addition should be
>  # To reduce frequent router calls
>  # To avoid frequent state store cache refreshers with each single mount 
> point addition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points

2023-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711692#comment-17711692
 ] 

Viraj Jasani commented on HDFS-16978:
-

{quote}Mount table operations are admin operations not user operations.
{quote}
I understand but having an admin endpoint for adding multiple mount entries as 
part of "single router admin command" rather than "multiple -add commands" is 
only optimization for reducing multiple router calls as well as reducing state 
store cache refreshes.

We already have putAll() that all state store implements so why not use it by 
router admin? The goal of this Jira is meant to be an optimization for admin 
operation.

 

[~ayushtkn] [~goiri] [~inigoiri] 

> RBF: New Router admin command to support bulk add of mount points
> -
>
> Key: HDFS-16978
> URL: https://issues.apache.org/jira/browse/HDFS-16978
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> All state store implementations support adding multiple state store records 
> using single putAll() implementation. We should provide new router admin API 
> to support bulk addition of mount table entries that can utilize this build 
> add implementation at state store level.
> For more than one mount point to be added, the goal of bulk addition should be
>  # To reduce frequent router calls
>  # To avoid frequent state store cache refreshers with each single mount 
> point addition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16978) RBF: New Router admin command to support bulk add of mount points

2023-04-12 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16978:
---

 Summary: RBF: New Router admin command to support bulk add of 
mount points
 Key: HDFS-16978
 URL: https://issues.apache.org/jira/browse/HDFS-16978
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


All state store implementations support adding multiple state store records 
using single putAll() implementation. We should provide new router admin API to 
support bulk addition of mount table entries that can utilize this build add 
implementation at state store level.

For more than one mount point to be added, the goal of bulk addition should be
 # To reduce frequent router calls
 # To avoid frequent state store cache refreshers with each single mount point 
addition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16973) RBF: MountTableResolver cache size lookup should take read lock

2023-04-04 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16973:
---

 Summary: RBF: MountTableResolver cache size lookup should take 
read lock
 Key: HDFS-16973
 URL: https://issues.apache.org/jira/browse/HDFS-16973
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Mount table resolver location cache gets invalidated by taking write lock as 
part of addEntry/removeEntry/refreshEntries calls. Since the write lock 
exclusively updates the cache, getDestinationForPath already takes read lock 
before accessing the cache. Similarly, retrieval of the cache size should also 
take the read lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16969) Restart DataNode but keep showing ClosedChannelException in DataNode

2023-03-31 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707459#comment-17707459
 ] 

Viraj Jasani commented on HDFS-16969:
-

Could you try porting HDFS-16535 or if possible, deploy latest release 3.3.5?

> Restart DataNode but keep showing ClosedChannelException in DataNode 
> -
>
> Key: HDFS-16969
> URL: https://issues.apache.org/jira/browse/HDFS-16969
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.2
>Reporter: Huibo Peng
>Priority: Major
>
> We use Hadoop 3.3.2 + HBase 2.3.6 in production environment. When restarting 
> DataNode to enable some configs, the ClosedChannelException keep showing in 
> DataNode log.
> {code:java}
> 2023-03-09 12:00:42,456 WARN  [ShortCircuitCache_SlotReleaser] 
> shortcircuit.DfsClientShmManager: 
> EndpointShmManager(DatanodeInfoWithStorage[10.22.128.111:9866,DS-d0865093-7868-4d6b-8163-252f2dd4a40c,DISK],
>  parent=ShortCircuitShmManager(250ff108)): error shutting down shm: got 
> IOException calling shutdown(SHUT_RDWR)
> java.nio.channels.ClosedChannelException
>         at 
> org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
>         at 
> org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:393)
>         at 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.shutdown(DfsClientShmManager.java:362)
>         at 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:241)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16967) RBF: File based state stores should allow concurrent access to the records

2023-03-30 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16967:
---

 Summary: RBF: File based state stores should allow concurrent 
access to the records
 Key: HDFS-16967
 URL: https://issues.apache.org/jira/browse/HDFS-16967
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


File based state store implementations (StateStoreFileImpl and 
StateStoreFileSystemImpl) should allow updating as well as reading of the state 
store records concurrently rather than serially. Concurrent access to the 
record files on the hdfs based store seems to be improving the state store 
cache loading performance by more than 10x.

For instance, in order to maintain data integrity, when any mount table 
record(s) is updated, the cache is reloaded. This reload operation seems to be 
able to gain significant performance improvement by the concurrent access of 
the mount table records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16959) RBF: State store cache loading metrics

2023-03-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16959:

Summary: RBF: State store cache loading metrics  (was: RBF: state store 
cache loading metrics)

> RBF: State store cache loading metrics
> --
>
> Key: HDFS-16959
> URL: https://issues.apache.org/jira/browse/HDFS-16959
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> With increasing num of state store records (like mount points), it would be 
> good to be able to get the cache loading metrics like avg time for cache load 
> during refresh, num of times cache is loaded etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16959) RBF: state store cache loading metrics

2023-03-20 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16959:
---

 Summary: RBF: state store cache loading metrics
 Key: HDFS-16959
 URL: https://issues.apache.org/jira/browse/HDFS-16959
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


With increasing num of state store records (like mount points), it would be 
good to be able to get the cache loading metrics like avg time for cache load 
during refresh, num of times cache is loaded etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt

2023-03-16 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701416#comment-17701416
 ] 

Viraj Jasani commented on HDFS-16957:
-

Thanks [~elgoiri], created PR [https://github.com/apache/hadoop/pull/5487]

> RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful 
> attempt
> --
>
> Key: HDFS-16957
> URL: https://issues.apache.org/jira/browse/HDFS-16957
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> DFS router admin returns non-zero status code for unsuccessful attempt to add 
> or update mount point. However, same is not the case with removal of mount 
> point.
> For instance,
> {code:java}
> bin/hdfs dfsrouteradmin -add /data4 ns1 /data4
> ..
> ..
> Cannot add destination at ns1 /data4
> echo $?
> 255 {code}
> {code:java}
> /hadoop/bin/hdfs dfsrouteradmin -rm /data4
> ..
> ..
> Cannot remove mount point /data4
> echo $?
> 0{code}
> Removal of mount point should stay consistent with other options and return 
> non-zero (unsuccessful) status code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt

2023-03-16 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701380#comment-17701380
 ] 

Viraj Jasani edited comment on HDFS-16957 at 3/16/23 8:57 PM:
--

[~ayushtkn] [~goiri] [~elgoiri] [~hexiaoqiao] could you please let me know if 
you agree with this?


was (Author: vjasani):
[~ayushtkn] [~goiri] [~hexiaoqiao] could you please let me know if you agree 
with this?

> RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful 
> attempt
> --
>
> Key: HDFS-16957
> URL: https://issues.apache.org/jira/browse/HDFS-16957
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> DFS router admin returns non-zero status code for unsuccessful attempt to add 
> or update mount point. However, same is not the case with removal of mount 
> point.
> For instance,
> {code:java}
> bin/hdfs dfsrouteradmin -add /data4 ns1 /data4
> ..
> ..
> Cannot add destination at ns1 /data4
> echo $?
> 255 {code}
> {code:java}
> /hadoop/bin/hdfs dfsrouteradmin -rm /data4
> ..
> ..
> Cannot remove mount point /data4
> echo $?
> 0{code}
> Removal of mount point should stay consistent with other options and return 
> non-zero (unsuccessful) status code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt

2023-03-16 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701380#comment-17701380
 ] 

Viraj Jasani commented on HDFS-16957:
-

[~ayushtkn] [~goiri] [~hexiaoqiao] could you please let me know if you agree 
with this?

> RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful 
> attempt
> --
>
> Key: HDFS-16957
> URL: https://issues.apache.org/jira/browse/HDFS-16957
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> DFS router admin returns non-zero status code for unsuccessful attempt to add 
> or update mount point. However, same is not the case with removal of mount 
> point.
> For instance,
> {code:java}
> bin/hdfs dfsrouteradmin -add /data4 ns1 /data4
> ..
> ..
> Cannot add destination at ns1 /data4
> echo $?
> 255 {code}
> {code:java}
> /hadoop/bin/hdfs dfsrouteradmin -rm /data4
> ..
> ..
> Cannot remove mount point /data4
> echo $?
> 0{code}
> Removal of mount point should stay consistent with other options and return 
> non-zero (unsuccessful) status code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16957) RBF: Exit status of dfsrouteradmin -rm should be non-zero for unsuccessful attempt

2023-03-16 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16957:
---

 Summary: RBF: Exit status of dfsrouteradmin -rm should be non-zero 
for unsuccessful attempt
 Key: HDFS-16957
 URL: https://issues.apache.org/jira/browse/HDFS-16957
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Viraj Jasani
Assignee: Viraj Jasani


DFS router admin returns non-zero status code for unsuccessful attempt to add 
or update mount point. However, same is not the case with removal of mount 
point.

For instance,
{code:java}
bin/hdfs dfsrouteradmin -add /data4 ns1 /data4
..
..

Cannot add destination at ns1 /data4


echo $?
255 {code}
{code:java}
/hadoop/bin/hdfs dfsrouteradmin -rm /data4
..
..
Cannot remove mount point /data4


echo $?
0{code}
Removal of mount point should stay consistent with other options and return 
non-zero (unsuccessful) status code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16953) RBF Mount table store APIs should update cache only if state store record is successfully updated

2023-03-15 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16953:
---

 Summary: RBF Mount table store APIs should update cache only if 
state store record is successfully updated
 Key: HDFS-16953
 URL: https://issues.apache.org/jira/browse/HDFS-16953
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


RBF Mount table state store APIs addMountTableEntry, updateMountTableEntry and 
removeMountTableEntry performs cache refresh for all routers regardless of the 
actual record update result. If the record fails to get updated on 
zookeeper/file based store impl, reloading the cache for all routers would be 
unnecessary.

 

For instance, simultaneously adding new mount point could lead to failure for 
the second call if first call has not added new entry by the time second call 
retrieves mount table entry from getMountTableEntries before attempting to call 
addMountTableEntry.
{code:java}
DEBUG [{cluster}/{ip}:8111] ipc.Client - IPC Client (1826699684) connection to 
nn-0-{ns}.{cluster}/{ip}:8111 from {user}IPC Client (1826699684) connection to 
nn-0-{ns}.{cluster}/{ip}:8111 from {user} sending #1 
org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocol.addMountTableEntry

DEBUG [{cluster}/{ip}:8111 from {user}] ipc.Client - IPC Client (1826699684) 
connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user} got value #1

DEBUG [main] ipc.ProtobufRpcEngine2 - Call: addMountTableEntry took 24ms

DEBUG [{cluster}/{ip}:8111 from {user}] ipc.Client - IPC Client (1826699684) 
connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user}: closed

DEBUG [{cluster}/{ip}:8111 from {user}] ipc.Client - IPC Client (1826699684) 
connection to nn-0-{ns}.{cluster}/{ip}:8111 from {user}: stopped, remaining 
connections 0

TRACE [main] ipc.ProtobufRpcEngine2 - 1: Response <- 
nn-0-{ns}.{cluster}/{ip}:8111: addMountTableEntry {status: false}

Cannot add mount point /data503 {code}
The failure to write new record:
{code:java}
INFO  [IPC Server handler 0 on default port 8111] impl.StateStoreZooKeeperImpl 
- Cannot write record "/hdfs-federation/MountTable/0SLASH0data503", it already 
exists {code}
Since the successful call has already refreshed cache for all routers, second 
call that failed should not have refreshed cache for all routers again as 
everyone already has updated records in cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16947) RBF NamenodeHeartbeatService to report error for not being able to register namenode in state store

2023-03-09 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16947:
---

 Summary: RBF NamenodeHeartbeatService to report error for not 
being able to register namenode in state store
 Key: HDFS-16947
 URL: https://issues.apache.org/jira/browse/HDFS-16947
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Namenode heartbeat service should provide error with full stacktrace if it 
cannot register namenode in the state store. As of today, we only log info msg.

For zookeeper based impl, this might mean either a) curator manager is not 
initialized or b) if it fails to write to znode after exhausting retries. For 
either of these cases, reporting only INFO log might not be good enough and we 
might have to look for errors elsewhere.

 

Sample example:
{code:java}
2023-02-20 23:10:33,714 DEBUG [NamenodeHeartbeatService {ns} nn0-0] 
router.NamenodeHeartbeatService - Received service state: ACTIVE from HA 
namenode: {ns}-nn0:nn-0-{ns}.{cluster}:9000
2023-02-20 23:10:33,731 INFO  [NamenodeHeartbeatService {ns} nn0-0] 
impl.MembershipStoreImpl - Inserting new NN registration: 
nn-0.namenode.{cluster}:->{ns}:nn0:nn-0-{ns}.{cluster}:9000-ACTIVE
2023-02-20 23:10:33,731 INFO  [NamenodeHeartbeatService {ns} nn0-0] 
router.NamenodeHeartbeatService - Cannot register namenode in the State Store
 {code}
If we could log full stacktrace:
{code:java}
2023-02-21 00:20:24,691 ERROR [NamenodeHeartbeatService {ns} nn0-0] 
router.NamenodeHeartbeatService - Cannot register namenode in the State Store
org.apache.hadoop.hdfs.server.federation.store.StateStoreUnavailableException: 
State Store driver StateStoreZooKeeperImpl in nn-0.namenode.{cluster} is not 
ready.
        at 
org.apache.hadoop.hdfs.server.federation.store.driver.StateStoreDriver.verifyDriverReady(StateStoreDriver.java:158)
        at 
org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreZooKeeperImpl.putAll(StateStoreZooKeeperImpl.java:235)
        at 
org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreBaseImpl.put(StateStoreBaseImpl.java:74)
        at 
org.apache.hadoop.hdfs.server.federation.store.impl.MembershipStoreImpl.namenodeHeartbeat(MembershipStoreImpl.java:179)
        at 
org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:381)
        at 
org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:317)
        at 
org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.lambda$periodicInvoke$0(NamenodeHeartbeatService.java:244)
...
... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16941) Path.suffix raises NullPointerException

2023-03-04 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696510#comment-17696510
 ] 

Viraj Jasani commented on HDFS-16941:
-

Oh I meant moving the Jira itself to Hadoop (options: From "More", select 
"Move") :)

But perhaps what you did is also fine. Thanks

> Path.suffix raises NullPointerException
> ---
>
> Key: HDFS-16941
> URL: https://issues.apache.org/jira/browse/HDFS-16941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hadoop-client, hdfs
>Affects Versions: 3.3.2
>Reporter: Patrick Grandjean
>Priority: Minor
>
> Calling the Path.suffix method on root raises a NullPointerException. Tested 
> with hadoop-client-api 3.3.2
> Scenario:
> {code:java}
> import org.apache.hadoop.fs.*
> Path root = new Path("/")
> root.getParent == null  // true
> root.suffix("bar")  // NPE is raised
> {code}
> Stack:
> {code:none}
> 23/03/03 15:13:18 ERROR Uncaught throwable from user code: 
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.(Path.java:104)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at org.apache.hadoop.fs.Path.suffix(Path.java:361)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16941) Path.suffix raises NullPointerException

2023-03-04 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696479#comment-17696479
 ] 

Viraj Jasani commented on HDFS-16941:
-

We can also move this Jira to HADOOP as Path is used by all FileSystem 
implementations.

> Path.suffix raises NullPointerException
> ---
>
> Key: HDFS-16941
> URL: https://issues.apache.org/jira/browse/HDFS-16941
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hadoop-client, hdfs
>Affects Versions: 3.3.2
>Reporter: Patrick Grandjean
>Priority: Minor
>
> Calling the Path.suffix method on root raises a NullPointerException. Tested 
> with hadoop-client-api 3.3.2
> Scenario:
> {code:java}
> import org.apache.hadoop.fs.*
> Path root = new Path("/")
> root.getParent == null  // true
> root.suffix("bar")  // NPE is raised
> {code}
> Stack:
> {code:none}
> 23/03/03 15:13:18 ERROR Uncaught throwable from user code: 
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.(Path.java:104)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at org.apache.hadoop.fs.Path.suffix(Path.java:361)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16938) Utility to trigger heartbeat and wait until BP thread queue is fully processed

2023-03-01 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16938:
---

 Summary: Utility to trigger heartbeat and wait until BP thread 
queue is fully processed
 Key: HDFS-16938
 URL: https://issues.apache.org/jira/browse/HDFS-16938
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


As a follow-up to HDFS-16935, we should provide utility to trigger heartbeat 
and wait until BP thread queue is fully processed. This would ensure 100% 
consistency w.r.t active namenode being able to receive bad block reports from 
the given datanode. This utility would resolve flakes for the tests that rely 
on namenode's awareness of the reported bad blocks by datanodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16935) TestFsDatasetImpl.testReportBadBlocks brittle

2023-02-24 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HDFS-16935:
---

Assignee: Viraj Jasani

> TestFsDatasetImpl.testReportBadBlocks brittle
> -
>
> Key: HDFS-16935
> URL: https://issues.apache.org/jira/browse/HDFS-16935
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.4.0, 3.3.5, 3.3.9
>Reporter: Steve Loughran
>Assignee: Viraj Jasani
>Priority: Minor
>
> jenkins failure as sleep() time not long enough
> {code}
> Failing for the past 1 build (Since #4 )
> Took 7.4 sec.
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at org.junit.Assert.assertEquals(Assert.java:633)
> {code}
> assert is after a 3s sleep waiting for reports coming in.
> {code}
>   dataNode.reportBadBlocks(block, dataNode.getFSDataset()
>   .getFsVolumeReferences().get(0));
>   Thread.sleep(3000);   // 3s 
> sleep
>   BlockManagerTestUtil.updateState(cluster.getNamesystem()
>   .getBlockManager());
>   // Verify the bad block has been reported to namenode
>   Assert.assertEquals(1, 
> cluster.getNamesystem().getCorruptReplicaBlocks());  // here
> {code}
> LambdaTestUtils.eventually() should be used around this assert, maybe with an 
> even shorter initial delay so on faster systems, test is faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16935) TestFsDatasetImpl.testReportBadBlocks brittle

2023-02-24 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693371#comment-17693371
 ] 

Viraj Jasani commented on HDFS-16935:
-

If we run this test in debug mode, it can reproduced locally too.

> TestFsDatasetImpl.testReportBadBlocks brittle
> -
>
> Key: HDFS-16935
> URL: https://issues.apache.org/jira/browse/HDFS-16935
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.4.0, 3.3.5, 3.3.9
>Reporter: Steve Loughran
>Priority: Minor
>
> jenkins failure as sleep() time not long enough
> {code}
> Failing for the past 1 build (Since #4 )
> Took 7.4 sec.
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at org.junit.Assert.assertEquals(Assert.java:633)
> {code}
> assert is after a 3s sleep waiting for reports coming in.
> {code}
>   dataNode.reportBadBlocks(block, dataNode.getFSDataset()
>   .getFsVolumeReferences().get(0));
>   Thread.sleep(3000);   // 3s 
> sleep
>   BlockManagerTestUtil.updateState(cluster.getNamesystem()
>   .getBlockManager());
>   // Verify the bad block has been reported to namenode
>   Assert.assertEquals(1, 
> cluster.getNamesystem().getCorruptReplicaBlocks());  // here
> {code}
> LambdaTestUtils.eventually() should be used around this assert, maybe with an 
> even shorter initial delay so on faster systems, test is faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16925) Namenode audit log to only include IP address of client

2023-02-18 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16925:

Description: 
With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to 
InetAddress#getHostAddress, to save host name with IPC Connection object. When 
we perform InetAddress#getHostName, toString() of InetAddress would 
automatically print \{hostName}/\{hostIPAddress} if hostname is already 
resolved:
{code:java}
/**
 * Converts this IP address to a {@code String}. The
 * string returned is of the form: hostname / literal IP
 * address.
 *
 * If the host name is unresolved, no reverse name service lookup
 * is performed. The hostname part will be represented by an empty string.
 *
 * @return  a string representation of this IP address.
 */
public String toString() {
String hostName = holder().getHostName();
return ((hostName != null) ? hostName : "")
+ "/" + getHostAddress();
}{code}
 

For namenode audit logs, this means that when dfs client makes filesystem 
updates, the audit logs would also print host name in the audit logs in 
addition to ip address.

In order to maintain the compatibility, the purpose of this Jira is to only let 
audit log retrieve IP address from InetAddress and print it.

  was:
With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to 
InetAddress#getHostAddress, to save host name with IPC Connection object. When 
we perform InetAddress#getHostName, toString() of InetAddress would 
automatically print \{hostName}/\{hostIPAddress} if hostname is already 
resolved:
{code:java}
/**
 * Converts this IP address to a {@code String}. The
 * string returned is of the form: hostname / literal IP
 * address.
 *
 * If the host name is unresolved, no reverse name service lookup
 * is performed. The hostname part will be represented by an empty string.
 *
 * @return  a string representation of this IP address.
 */
public String toString() {
String hostName = holder().getHostName();
return ((hostName != null) ? hostName : "")
+ "/" + getHostAddress();
}{code}
 

For namenode audit logs, this means that when dfs client makes filesystem 
updates, the audit logs would also print host name in the audit logs in 
addition to ip address. We have some tests that performs regex pattern matching 
to identify the log pattern of audit logs, we will have to change them to 
reflect the change in host address.


> Namenode audit log to only include IP address of client
> ---
>
> Key: HDFS-16925
> URL: https://issues.apache.org/jira/browse/HDFS-16925
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to 
> InetAddress#getHostAddress, to save host name with IPC Connection object. 
> When we perform InetAddress#getHostName, toString() of InetAddress would 
> automatically print \{hostName}/\{hostIPAddress} if hostname is already 
> resolved:
> {code:java}
> /**
>  * Converts this IP address to a {@code String}. The
>  * string returned is of the form: hostname / literal IP
>  * address.
>  *
>  * If the host name is unresolved, no reverse name service lookup
>  * is performed. The hostname part will be represented by an empty string.
>  *
>  * @return  a string representation of this IP address.
>  */
> public String toString() {
> String hostName = holder().getHostName();
> return ((hostName != null) ? hostName : "")
> + "/" + getHostAddress();
> }{code}
>  
> For namenode audit logs, this means that when dfs client makes filesystem 
> updates, the audit logs would also print host name in the audit logs in 
> addition to ip address.
> In order to maintain the compatibility, the purpose of this Jira is to only 
> let audit log retrieve IP address from InetAddress and print it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16925) Namenode audit log to only include IP address of client

2023-02-16 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16925:

Summary: Namenode audit log to only include IP address of client  (was: Fix 
regex pattern for namenode audit log tests)

> Namenode audit log to only include IP address of client
> ---
>
> Key: HDFS-16925
> URL: https://issues.apache.org/jira/browse/HDFS-16925
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to 
> InetAddress#getHostAddress, to save host name with IPC Connection object. 
> When we perform InetAddress#getHostName, toString() of InetAddress would 
> automatically print \{hostName}/\{hostIPAddress} if hostname is already 
> resolved:
> {code:java}
> /**
>  * Converts this IP address to a {@code String}. The
>  * string returned is of the form: hostname / literal IP
>  * address.
>  *
>  * If the host name is unresolved, no reverse name service lookup
>  * is performed. The hostname part will be represented by an empty string.
>  *
>  * @return  a string representation of this IP address.
>  */
> public String toString() {
> String hostName = holder().getHostName();
> return ((hostName != null) ? hostName : "")
> + "/" + getHostAddress();
> }{code}
>  
> For namenode audit logs, this means that when dfs client makes filesystem 
> updates, the audit logs would also print host name in the audit logs in 
> addition to ip address. We have some tests that performs regex pattern 
> matching to identify the log pattern of audit logs, we will have to change 
> them to reflect the change in host address.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-16 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HDFS-16918.
-
Resolution: Won't Fix

> Optionally shut down datanode if it does not stay connected to active namenode
> --
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-16 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16918:

Description: 
While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.

The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.

Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namenode 
in configurable time duration, it should terminate itself to avoid impacting 
the availability SLA. This is specifically helpful when the underlying 
deployment or observability framework (e.g. K8S) can start up the datanode 
automatically upon it's shutdown (unless it is being restarted as part of 
rolling upgrade) and help the newly brought up datanode (in case of k8s, a new 
pod with dynamically changing nodes) establish new socket connection to active 
and standby namenodes. This should be an opt-in behavior and not default one.

  was:
While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.

The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.

Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namenode 
in configurable time duration, it should terminate itself to avoid impacting 
the availability SLA. This is specifically helpful when the underlying 
deployment or observability framework (e.g. K8S) can start up the datanode 
automatically upon it's shutdown (unless it is being restarted as part of 
rolling upgrade) and help the newly brought up datanode (in case of k8s, a new 
pod with dynamically changing nodes) establish new socket connection to active 
and standby namenodes. This should be an opt-in behavior and not default one.

 

In a distributed system, it is essential to have robust fail-fast mechanisms in 
place to prevent issues related to network partitioning. The system must be 
designed to prevent further degradation of availability and consistency in the 
event of a network partition. Several distributed systems offer fail-safe 
approaches, and for some, partition tolerance is critical to the extent that 
even a few seconds of heartbeat loss can trigger the removal of an application 
server instance from the cluster. For instance, a majority of zooKeeper clients 
utilize the ephemeral nodes for this purpose to make system reliable, 
fault-tolerant and strongly consistent in the event of network partition.

>From the hdfs architecture viewpoint, it is crucial to understand the critical 
>role that active and observer namenode play in file system operations. In a 
>large-scale cluster, if the datanodes holding the same block (primary and 
>replicas) lose connection to both active and observer namenodes for a 
>significant amount of time, delaying the process of shutting down such 
>datanodes and restarting it to re-establish the connection with the namenodes 
>(assuming the active namenode is alive, assumption is important in the even of 
>network partition to reestablish the connection) will further deteriorate the 
>availability of the service. This scenario underscores the importance of 
>resolving network partitioning.

This is a real use case for hdfs and it is not prudent to assume that every 
deployment or cluster management application must be able to restart datanodes 
based on JM

[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-15 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16918:

Description: 
While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.

The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.

Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namenode 
in configurable time duration, it should terminate itself to avoid impacting 
the availability SLA. This is specifically helpful when the underlying 
deployment or observability framework (e.g. K8S) can start up the datanode 
automatically upon it's shutdown (unless it is being restarted as part of 
rolling upgrade) and help the newly brought up datanode (in case of k8s, a new 
pod with dynamically changing nodes) establish new socket connection to active 
and standby namenodes. This should be an opt-in behavior and not default one.

 

In a distributed system, it is essential to have robust fail-fast mechanisms in 
place to prevent issues related to network partitioning. The system must be 
designed to prevent further degradation of availability and consistency in the 
event of a network partition. Several distributed systems offer fail-safe 
approaches, and for some, partition tolerance is critical to the extent that 
even a few seconds of heartbeat loss can trigger the removal of an application 
server instance from the cluster. For instance, a majority of zooKeeper clients 
utilize the ephemeral nodes for this purpose to make system reliable, 
fault-tolerant and strongly consistent in the event of network partition.

>From the hdfs architecture viewpoint, it is crucial to understand the critical 
>role that active and observer namenode play in file system operations. In a 
>large-scale cluster, if the datanodes holding the same block (primary and 
>replicas) lose connection to both active and observer namenodes for a 
>significant amount of time, delaying the process of shutting down such 
>datanodes and restarting it to re-establish the connection with the namenodes 
>(assuming the active namenode is alive, assumption is important in the even of 
>network partition to reestablish the connection) will further deteriorate the 
>availability of the service. This scenario underscores the importance of 
>resolving network partitioning.

This is a real use case for hdfs and it is not prudent to assume that every 
deployment or cluster management application must be able to restart datanodes 
based on JMX metrics, as this would introduce another application to resolve 
the network partition impact of hdfs. Besides, popular cluster management 
applications are not typically used in all cloud-native env. Even if these 
cluster management applications are deployed, certain security constraints may 
restrict their access to JMX metrics and prevent them from interfering with 
hdfs operations. The applications that can only trigger alerts for users based 
on set parameters (for instance, missing blocks > 0) are allowed to access JMX 
metrics.

  was:
While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.

The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.

Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namen

[jira] [Created] (HDFS-16925) Fix regex pattern for namenode audit log tests

2023-02-15 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16925:
---

 Summary: Fix regex pattern for namenode audit log tests
 Key: HDFS-16925
 URL: https://issues.apache.org/jira/browse/HDFS-16925
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to 
InetAddress#getHostAddress, to save host name with IPC Connection object. When 
we perform InetAddress#getHostName, toString() of InetAddress would 
automatically print \{hostName}/\{hostIPAddress} if hostname is already 
resolved:
{code:java}
/**
 * Converts this IP address to a {@code String}. The
 * string returned is of the form: hostname / literal IP
 * address.
 *
 * If the host name is unresolved, no reverse name service lookup
 * is performed. The hostname part will be represented by an empty string.
 *
 * @return  a string representation of this IP address.
 */
public String toString() {
String hostName = holder().getHostName();
return ((hostName != null) ? hostName : "")
+ "/" + getHostAddress();
}{code}
 

For namenode audit logs, this means that when dfs client makes filesystem 
updates, the audit logs would also print host name in the audit logs in 
addition to ip address. We have some tests that performs regex pattern matching 
to identify the log pattern of audit logs, we will have to change them to 
reflect the change in host address.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16918:
---

 Summary: Optionally shut down datanode if it does not stay 
connected to active namenode
 Key: HDFS-16918
 URL: https://issues.apache.org/jira/browse/HDFS-16918
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.

The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.

Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namenode 
in configurable time duration, it should terminate itself to avoid impacting 
the availability SLA. This is specifically helpful when the underlying 
deployment or observability framework (e.g. K8S) can start up the datanode 
automatically upon it's shutdown (unless it is being restarted as part of 
rolling upgrade) and help the newly brought up datanode (in case of k8s, a new 
pod with dynamically changing nodes) establish new socket connection to active 
and standby namenodes. This should be an opt-in behavior and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16907) BP service actor LastHeartbeat is not sufficient to track realtime connection breaks

2023-02-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16907:

Description: 
BP service actor LastHeartbeat is not sufficient to track realtime connection 
breaks.

Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode 
that it is connected to. However, this is updated even if the connection to the 
namenode is broken.

Suppose, the actor thread keeps heartbeating to namenode and suddenly the 
socket connection is broken. When this happens, until specific time duration, 
the actor thread consistently keeps updating _lastHeartbeatTime_ before even 
initiating heartbeat connection with namenode. If connection cannot be 
established even after RPC retries are exhausted, then IOException is thrown. 
This means that heartbeat response has not been received from the namenode. In 
the loop, the actor thread keeps trying connecting for heartbeat and the last 
heartbeat stays close to 1/2s even though in reality there is no response being 
received from namenode.

 

Sample Exception from the BP service actor thread, during which LastHeartbeat 
stays very low:
{code:java}
2023-02-03 22:34:55,725 WARN  [xyz:9000] datanode.DataNode - IOException in 
offerService
java.io.EOFException: End of File Exception between local host is: "dn-0"; 
destination host is: "nn-1":9000; : java.io.EOFException; For more details see: 
 http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
    at org.apache.hadoop.ipc.Client.call(Client.java:1495)
    at org.apache.hadoop.ipc.Client.call(Client.java:1392)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
    at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source)
    at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884)
    at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code}
Attaching screenshots of how last heartbeat value looks when the above error is 
consistently getting logged.

 

Last heartbeat response time is important to initiate any auto-recovery from 
datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets 
updated if the BP service actor thread was successfully able to retrieve 
response from namenode.

  was:
Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode 
that it is connected to. However, this is updated even if the connection to the 
namenode is broken.

Suppose, the actor thread keeps heartbeating to namenode and suddenly the 
socket connection is broken. When this happens, until specific time duration, 
the actor thread consistently keeps updating _lastHeartbeatTime_ before even 
initiating heartbeat connection with namenode. If connection cannot be 
established even after RPC retries are exhausted, then IOException is thrown. 
This means that heartbeat response has not been received from the namenode. In 
the loop, the actor thread keeps trying connecting for heartbeat and the last 
heartbeat stays close to 1/2s even though in reality there is no response being 
received from namenode.

 

Sample Exception from the BP service actor thread, during which LastHeartbeat 
stays very low:
{code:java}
2023-02-03 22:34:55,725 WARN  [xyz:9000] datanode.DataNode - IOException in 
offerService
java.io.EOFException: End of File Exception between local host is: "dn-0"; 
destination host is: "nn-1":9000; : java.io.EOFException; For more details see: 
 http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect

[jira] [Updated] (HDFS-16907) Add LastHeartbeatResponseTime for BP service actor

2023-02-03 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16907:

Summary: Add LastHeartbeatResponseTime for BP service actor  (was: BP 
service actor LastHeartbeat is not sufficient to track realtime connection 
breaks)

> Add LastHeartbeatResponseTime for BP service actor
> --
>
> Key: HDFS-16907
> URL: https://issues.apache.org/jira/browse/HDFS-16907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2023-02-03 at 6.12.24 PM.png
>
>
> BP service actor LastHeartbeat is not sufficient to track realtime connection 
> breaks.
> Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode 
> that it is connected to. However, this is updated even if the connection to 
> the namenode is broken.
> Suppose, the actor thread keeps heartbeating to namenode and suddenly the 
> socket connection is broken. When this happens, until specific time duration, 
> the actor thread consistently keeps updating _lastHeartbeatTime_ before even 
> initiating heartbeat connection with namenode. If connection cannot be 
> established even after RPC retries are exhausted, then IOException is thrown. 
> This means that heartbeat response has not been received from the namenode. 
> In the loop, the actor thread keeps trying connecting for heartbeat and the 
> last heartbeat stays close to 1/2s even though in reality there is no 
> response being received from namenode.
>  
> Sample Exception from the BP service actor thread, during which LastHeartbeat 
> stays very low:
> {code:java}
> 2023-02-03 22:34:55,725 WARN  [xyz:9000] datanode.DataNode - IOException in 
> offerService
> java.io.EOFException: End of File Exception between local host is: "dn-0"; 
> destination host is: "nn-1":9000; : java.io.EOFException; For more details 
> see:  http://wiki.apache.org/hadoop/EOFException
>     at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913)
>     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862)
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1495)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1392)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
>     at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.io.EOFException
>     at java.io.DataInputStream.readInt(DataInputStream.java:392)
>     at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884)
>     at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176)
>     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code}
> Attaching screenshots of how last heartbeat value looks when the above error 
> is consistently getting logged.
>  
> Last heartbeat response time is important to initiate any auto-recovery from 
> datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets 
> updated if the BP service actor thread was successfully able to retrieve 
> response from namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16907) BP service actor LastHeartbeat is not sufficient to track realtime connection breaks

2023-02-03 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16907:
---

 Summary: BP service actor LastHeartbeat is not sufficient to track 
realtime connection breaks
 Key: HDFS-16907
 URL: https://issues.apache.org/jira/browse/HDFS-16907
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani
 Attachments: Screenshot 2023-02-03 at 6.12.24 PM.png

Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode 
that it is connected to. However, this is updated even if the connection to the 
namenode is broken.

Suppose, the actor thread keeps heartbeating to namenode and suddenly the 
socket connection is broken. When this happens, until specific time duration, 
the actor thread consistently keeps updating _lastHeartbeatTime_ before even 
initiating heartbeat connection with namenode. If connection cannot be 
established even after RPC retries are exhausted, then IOException is thrown. 
This means that heartbeat response has not been received from the namenode. In 
the loop, the actor thread keeps trying connecting for heartbeat and the last 
heartbeat stays close to 1/2s even though in reality there is no response being 
received from namenode.

 

Sample Exception from the BP service actor thread, during which LastHeartbeat 
stays very low:
{code:java}
2023-02-03 22:34:55,725 WARN  [xyz:9000] datanode.DataNode - IOException in 
offerService
java.io.EOFException: End of File Exception between local host is: "dn-0"; 
destination host is: "nn-1":9000; : java.io.EOFException; For more details see: 
 http://wiki.apache.org/hadoop/EOFException
    at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
    at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
    at org.apache.hadoop.ipc.Client.call(Client.java:1495)
    at org.apache.hadoop.ipc.Client.call(Client.java:1392)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
    at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source)
    at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884)
    at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code}
Attaching screenshots of how last heartbeat value looks when the above error is 
consistently getting logged.

 

Last heartbeat response time is important to initiate any auto-recovery from 
datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets 
updated if the BP service actor thread was successfully able to retrieve 
response from namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2023-01-31 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16902:

Description: 
Recently came across an k8s environment where randomly some datanode pods are 
not able to stay connected to all namenode pods (e.g. last heartbeat time stays 
higher than 2 hr sometimes). When any standby namenode becomes active, any 
datanode that is not heartbeating to it for quite sometime would not be able to 
send any further block reports, leading to missing replicas immediately after 
namenode failover, which could only be resolved with datanode pod restart.

While the issue seems env specific, BPServiceActor's offer service could use 
some logging improvements. It is also good to get namenode status exposed with 
BPServiceActorInfo to identify any lags from datanode side in recognizing 
updated Active namenode status with heartbeats.

  was:
Recently came across an k8s environment where randomly some datanode pods are 
not able to stay connected to all namenode pods (e.g. last heartbeat time stays 
higher than 2 hr sometimes). When new namenode becomes active, any datanode 
that is not heartbeating to it would not be able to send any further block 
reports, leading to missing replicas sometimes, which would be resolved only 
with datanode pod restart.

While the issue seems env specific, BPServiceActor's offer service could use 
some logging improvements. It is also good to get namenode status exposed with 
BPServiceActorInfo to identify any lags from datanode side in recognizing 
updated Active namenode status with heartbeats.


> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> -
>
> Key: HDFS-16902
> URL: https://issues.apache.org/jira/browse/HDFS-16902
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2023-01-31 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16902:
---

 Summary: Add Namenode status to BPServiceActor metrics and improve 
logging in offerservice
 Key: HDFS-16902
 URL: https://issues.apache.org/jira/browse/HDFS-16902
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Recently came across an k8s environment where randomly some datanode pods are 
not able to stay connected to all namenode pods (e.g. last heartbeat time stays 
higher than 2 hr sometimes). When new namenode becomes active, any datanode 
that is not heartbeating to it would not be able to send any further block 
reports, leading to missing replicas sometimes, which would be resolved only 
with datanode pod restart.

While the issue seems env specific, BPServiceActor's offer service could use 
some logging improvements. It is also good to get namenode status exposed with 
BPServiceActorInfo to identify any lags from datanode side in recognizing 
updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel

2023-01-13 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16891:
---

 Summary: Avoid the overhead of copy-on-write exception list while 
loading inodes sub sections in parallel
 Key: HDFS-16891
 URL: https://issues.apache.org/jira/browse/HDFS-16891
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.4
Reporter: Viraj Jasani
Assignee: Viraj Jasani


If we enable parallel loading and persisting of inodes from/to fs image, we get 
the benefit of improved performance. However, while loading sub-sections 
INODE_DIR_SUB and INODE_SUB, if we encounter any errors, we use copy-on-write 
list to maintain the list of exceptions. Since our usecase is not to iterate 
over this list while executor threads are adding new elements to the list, 
using copy-on-write is bit of an overhead for this usecase.

It would be better to synchronize adding new elements to the list rather than 
having the list copy all elements over every time new element is added to the 
list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16887) Log start and end of phase/step in startup progress

2023-01-11 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16887:
---

 Summary: Log start and end of phase/step in startup progress
 Key: HDFS-16887
 URL: https://issues.apache.org/jira/browse/HDFS-16887
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani


As part of Namenode startup progress, we have multiple phases and steps within 
phase that are instantiated. While the startup progress view can be 
instantiated with the current view of phase/step, having at least DEBUG logs 
for startup progress would be helpful to identify when a particular step for 
LOADING_FSIMAGE/SAVING_CHECKPOINT/LOADING_EDITS was started and ended.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19

2022-12-26 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652123#comment-17652123
 ] 

Viraj Jasani commented on HDFS-16652:
-

{quote}I can cherry-pick other branches too. Please let me know which all 
branches. only for branch-3.3.?
{quote}
Thank you [~brahmareddy]!

IMHO back-porting to branch-3.3 would be great (we anyways had to keep this 
patch on 3.3 branch already due to the severity of the vulnerability reported).

> Upgrade jquery datatable version references to v1.10.19
> ---
>
> Key: HDFS-16652
> URL: https://issues.apache.org/jira/browse/HDFS-16652
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16652.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Upgrade jquery datatable version references in hdfs webapp to v1.10.19



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

2022-11-01 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627290#comment-17627290
 ] 

Viraj Jasani commented on HDFS-16829:
-

{quote}I think if there is one node only, we can give it to try that we have 
syncBlock always true in such cases
{quote}
If only one node, having syncBlock true can have much of latency impact?

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: 
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>   at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
>  Logs from namenode  -
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for 
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile:  is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -  Logs from nodeB -
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>  
> {noformat}
> -  Logs from nodeC -
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeB:56486
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> PacketResponder: BP-9588

[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19

2022-09-08 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17602007#comment-17602007
 ] 

Viraj Jasani commented on HDFS-16652:
-

[~groot] I am talking about YARN-8854. I have commented on YARN-8854 as well to 
get clarification on title vs commit diff.

This current Jira is good, my only request for the current Jira is that it 
would be good to backport [PR|https://github.com/apache/hadoop/pull/4562] to 
branch-3.3 also.

> Upgrade jquery datatable version references to v1.10.19
> ---
>
> Key: HDFS-16652
> URL: https://issues.apache.org/jira/browse/HDFS-16652
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16652.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Upgrade jquery datatable version references in hdfs webapp to v1.10.19



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19

2022-09-08 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16652:

Target Version/s: 3.4.0, 3.3.9

> Upgrade jquery datatable version references to v1.10.19
> ---
>
> Key: HDFS-16652
> URL: https://issues.apache.org/jira/browse/HDFS-16652
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16652.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Upgrade jquery datatable version references in hdfs webapp to v1.10.19



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19

2022-09-08 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601972#comment-17601972
 ] 

Viraj Jasani commented on HDFS-16652:
-

Looks like YARN-8854 title says it upgraded datatable to 1.10.19 but the patch 
upgraded it to 1.10.18. Let me try to clarify on the Jira.

> Upgrade jquery datatable version references to v1.10.19
> ---
>
> Key: HDFS-16652
> URL: https://issues.apache.org/jira/browse/HDFS-16652
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16652.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Upgrade jquery datatable version references in hdfs webapp to v1.10.19



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19

2022-09-08 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601970#comment-17601970
 ] 

Viraj Jasani commented on HDFS-16652:
-

FYI [~apurtell] reg jquery datatable vulnerability on 3.3 release line. It 
seems that HDFS-6407 added datatable 1.10.7 in HDFS and ever since, the version 
was not upgraded for HDFS. YARN-8854 did upgrade datatable to 1.10.18 but only 
for Yarn.

> Upgrade jquery datatable version references to v1.10.19
> ---
>
> Key: HDFS-16652
> URL: https://issues.apache.org/jira/browse/HDFS-16652
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16652.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Upgrade jquery datatable version references in hdfs webapp to v1.10.19



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16652) Upgrade jquery datatable version references to v1.10.19

2022-09-08 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601968#comment-17601968
 ] 

Viraj Jasani commented on HDFS-16652:
-

[~dmmkr] thanks for this work, are you planning to create backport PR for 
branch-3.3 as well?

> Upgrade jquery datatable version references to v1.10.19
> ---
>
> Key: HDFS-16652
> URL: https://issues.apache.org/jira/browse/HDFS-16652
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16652.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Upgrade jquery datatable version references in hdfs webapp to v1.10.19



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error

2022-08-04 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HDFS-16702:
---

Assignee: Steve Vaughan  (was: Viraj Jasani)

> MiniDFSCluster should report cause of exception in assertion error
> --
>
> Key: HDFS-16702
> URL: https://issues.apache.org/jira/browse/HDFS-16702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: Tests running in the Hadoop dev environment image.
>Reporter: Steve Vaughan
>Assignee: Steve Vaughan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> When the MiniDFSClsuter detects that an exception caused an exit, it should 
> include that exception as the cause for the AssertionError that it throws.  
> The current AssertError simply reports the message "Test resulted in an 
> unexpected exit" and provides a stack trace to the location of the check for 
> an exit exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error

2022-08-01 Thread Viraj Jasani (Jira)


[ https://issues.apache.org/jira/browse/HDFS-16702 ]


Viraj Jasani deleted comment on HDFS-16702:
-

was (Author: vjasani):
In fact, we can make a generic change to ExitException so that it's object 
always prints the cause for the ExitException.

> MiniDFSCluster should report cause of exception in assertion error
> --
>
> Key: HDFS-16702
> URL: https://issues.apache.org/jira/browse/HDFS-16702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: Tests running in the Hadoop dev environment image.
>Reporter: Steve Vaughan
>Assignee: Viraj Jasani
>Priority: Minor
>
> When the MiniDFSClsuter detects that an exception caused an exit, it should 
> include that exception as the cause for the AssertionError that it throws.  
> The current AssertError simply reports the message "Test resulted in an 
> unexpected exit" and provides a stack trace to the location of the check for 
> an exit exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error

2022-08-01 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573963#comment-17573963
 ] 

Viraj Jasani commented on HDFS-16702:
-

In fact, we can make a generic change to ExitException so that it's object 
always prints the cause for the ExitException.

> MiniDFSCluster should report cause of exception in assertion error
> --
>
> Key: HDFS-16702
> URL: https://issues.apache.org/jira/browse/HDFS-16702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: Tests running in the Hadoop dev environment image.
>Reporter: Steve Vaughan
>Assignee: Viraj Jasani
>Priority: Minor
>
> When the MiniDFSClsuter detects that an exception caused an exit, it should 
> include that exception as the cause for the AssertionError that it throws.  
> The current AssertError simply reports the message "Test resulted in an 
> unexpected exit" and provides a stack trace to the location of the check for 
> an exit exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error

2022-08-01 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17573962#comment-17573962
 ] 

Viraj Jasani commented on HDFS-16702:
-

I did encounter this sometime back and had similar thought but somehow missed 
creating Jira. Let me take this up?

> MiniDFSCluster should report cause of exception in assertion error
> --
>
> Key: HDFS-16702
> URL: https://issues.apache.org/jira/browse/HDFS-16702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: Tests running in the Hadoop dev environment image.
>Reporter: Steve Vaughan
>Priority: Minor
>
> When the MiniDFSClsuter detects that an exception caused an exit, it should 
> include that exception as the cause for the AssertionError that it throws.  
> The current AssertError simply reports the message "Test resulted in an 
> unexpected exit" and provides a stack trace to the location of the check for 
> an exit exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error

2022-08-01 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HDFS-16702:
---

Assignee: Viraj Jasani

> MiniDFSCluster should report cause of exception in assertion error
> --
>
> Key: HDFS-16702
> URL: https://issues.apache.org/jira/browse/HDFS-16702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: Tests running in the Hadoop dev environment image.
>Reporter: Steve Vaughan
>Assignee: Viraj Jasani
>Priority: Minor
>
> When the MiniDFSClsuter detects that an exception caused an exit, it should 
> include that exception as the cause for the AssertionError that it throws.  
> The current AssertError simply reports the message "Test resulted in an 
> unexpected exit" and provides a stack trace to the location of the check for 
> an exit exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-20 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187
 ] 

Viraj Jasani edited comment on HDFS-16637 at 6/20/22 7:43 PM:
--

No worries at all [~jianghuazhu], it happens with everyone, build results are 
sometimes ignored in the hurry and we learn from it later :)

Thanks for your contributions!


was (Author: vjasani):
No worries at all [~jianghuazhu], it happens with everyone, build results are 
sometimes ignored in the hurry and we learn from it later :)

> TestHDFSCLI#testAll consistently failing
> 
>
> Key: HDFS-16637
> URL: https://issues.apache.org/jira/browse/HDFS-16637
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The failure seems to have been caused by output change introduced by 
> HDFS-16581.
> {code:java}
> 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(146)) - Detailed results:
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(147)) - 
> --2022-06-19 15:41:16,184 [Listener at 
> localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(156)) - 
> ---
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(158)) -            Test Description: 
> [printTopology: verifying that the topology map is what we expect]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(159)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
> hdfs://localhost:51486 -printTopology]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(167)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(174)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(178)) -                  Comparator: 
> [RegexpAcrossOutputComparator]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(180)) -          Comparision result:   
> [fail]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(182)) -             Expected output:   
> [^Rack: 
> \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
> 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(184)) -               Actual output:   
> [Rack: /rack1
>    127.0.0.1:51487 (localhost) In Service
>    127.0.0.1:51491 (localhost) In ServiceRack: /rack2
>    127.0.0.1:51500 (localhost) In Service
>    127.0.0.1:51496 (localhost) In Service
>    127.0.0.1:51504 (localhost) In ServiceRack: /rack3
>    127.0.0.1:51508 (localhost) In ServiceRack: /rack4
>    127.0.0.1:51512 (localhost) In Service
>    127.0.0.1:51516 (localhost) In Service]
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-20 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187
 ] 

Viraj Jasani edited comment on HDFS-16637 at 6/20/22 7:43 PM:
--

No worries at all [~jianghuazhu], it happens with everyone, build results are 
sometimes ignored in the hurry and we learn from it later :)


was (Author: vjasani):
No worries at all [~jianghuazhu], it happens with everyone, build results are 
sometimes ignored in the hurry and we learn from it :)

> TestHDFSCLI#testAll consistently failing
> 
>
> Key: HDFS-16637
> URL: https://issues.apache.org/jira/browse/HDFS-16637
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The failure seems to have been caused by output change introduced by 
> HDFS-16581.
> {code:java}
> 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(146)) - Detailed results:
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(147)) - 
> --2022-06-19 15:41:16,184 [Listener at 
> localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(156)) - 
> ---
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(158)) -            Test Description: 
> [printTopology: verifying that the topology map is what we expect]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(159)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
> hdfs://localhost:51486 -printTopology]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(167)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(174)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(178)) -                  Comparator: 
> [RegexpAcrossOutputComparator]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(180)) -          Comparision result:   
> [fail]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(182)) -             Expected output:   
> [^Rack: 
> \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
> 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(184)) -               Actual output:   
> [Rack: /rack1
>    127.0.0.1:51487 (localhost) In Service
>    127.0.0.1:51491 (localhost) In ServiceRack: /rack2
>    127.0.0.1:51500 (localhost) In Service
>    127.0.0.1:51496 (localhost) In Service
>    127.0.0.1:51504 (localhost) In ServiceRack: /rack3
>    127.0.0.1:51508 (localhost) In ServiceRack: /rack4
>    127.0.0.1:51512 (localhost) In Service
>    127.0.0.1:51516 (localhost) In Service]
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-20 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187
 ] 

Viraj Jasani edited comment on HDFS-16637 at 6/20/22 7:02 PM:
--

No worries at all [~jianghuazhu], it happens with everyone, build results are 
sometimes ignored in the hurry and we learn from it :)


was (Author: vjasani):
No worries at all [~jianghuazhu], this is not carelessness at all, it happens 
with everyone :)

> TestHDFSCLI#testAll consistently failing
> 
>
> Key: HDFS-16637
> URL: https://issues.apache.org/jira/browse/HDFS-16637
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The failure seems to have been caused by output change introduced by 
> HDFS-16581.
> {code:java}
> 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(146)) - Detailed results:
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(147)) - 
> --2022-06-19 15:41:16,184 [Listener at 
> localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(156)) - 
> ---
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(158)) -            Test Description: 
> [printTopology: verifying that the topology map is what we expect]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(159)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
> hdfs://localhost:51486 -printTopology]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(167)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(174)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(178)) -                  Comparator: 
> [RegexpAcrossOutputComparator]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(180)) -          Comparision result:   
> [fail]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(182)) -             Expected output:   
> [^Rack: 
> \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
> 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(184)) -               Actual output:   
> [Rack: /rack1
>    127.0.0.1:51487 (localhost) In Service
>    127.0.0.1:51491 (localhost) In ServiceRack: /rack2
>    127.0.0.1:51500 (localhost) In Service
>    127.0.0.1:51496 (localhost) In Service
>    127.0.0.1:51504 (localhost) In ServiceRack: /rack3
>    127.0.0.1:51508 (localhost) In ServiceRack: /rack4
>    127.0.0.1:51512 (localhost) In Service
>    127.0.0.1:51516 (localhost) In Service]
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-19 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556187#comment-17556187
 ] 

Viraj Jasani commented on HDFS-16637:
-

No worries at all [~jianghuazhu], this is not carelessness at all, it happens 
with everyone :)

> TestHDFSCLI#testAll consistently failing
> 
>
> Key: HDFS-16637
> URL: https://issues.apache.org/jira/browse/HDFS-16637
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The failure seems to have been caused by output change introduced by 
> HDFS-16581.
> {code:java}
> 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(146)) - Detailed results:
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(147)) - 
> --2022-06-19 15:41:16,184 [Listener at 
> localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(156)) - 
> ---
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(158)) -            Test Description: 
> [printTopology: verifying that the topology map is what we expect]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(159)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
> hdfs://localhost:51486 -printTopology]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(167)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(174)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(178)) -                  Comparator: 
> [RegexpAcrossOutputComparator]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(180)) -          Comparision result:   
> [fail]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(182)) -             Expected output:   
> [^Rack: 
> \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
> 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(184)) -               Actual output:   
> [Rack: /rack1
>    127.0.0.1:51487 (localhost) In Service
>    127.0.0.1:51491 (localhost) In ServiceRack: /rack2
>    127.0.0.1:51500 (localhost) In Service
>    127.0.0.1:51496 (localhost) In Service
>    127.0.0.1:51504 (localhost) In ServiceRack: /rack3
>    127.0.0.1:51508 (localhost) In ServiceRack: /rack4
>    127.0.0.1:51512 (localhost) In Service
>    127.0.0.1:51516 (localhost) In Service]
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-19 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16637:
---

 Summary: TestHDFSCLI#testAll consistently failing
 Key: HDFS-16637
 URL: https://issues.apache.org/jira/browse/HDFS-16637
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Viraj Jasani
Assignee: Viraj Jasani


The failure seems to have been caused by output change introduced by HDFS-16581.
{code:java}
2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(146)) - Detailed results:
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(147)) - 
--2022-06-19 15:41:16,184 [Listener at 
localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(156)) - 
---
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(158)) -            Test Description: 
[printTopology: verifying that the topology map is what we expect]
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(159)) - 
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
hdfs://localhost:51486 -printTopology]
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(167)) - 
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(174)) - 
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(178)) -                  Comparator: 
[RegexpAcrossOutputComparator]
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(180)) -          Comparision result:   [fail]
2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(182)) -             Expected output:   
[^Rack: 
\/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
(CLITestHelper.java:displayResults(184)) -               Actual output:   
[Rack: /rack1
   127.0.0.1:51487 (localhost) In Service
   127.0.0.1:51491 (localhost) In ServiceRack: /rack2
   127.0.0.1:51500 (localhost) In Service
   127.0.0.1:51496 (localhost) In Service
   127.0.0.1:51504 (localhost) In ServiceRack: /rack3
   127.0.0.1:51508 (localhost) In ServiceRack: /rack4
   127.0.0.1:51512 (localhost) In Service
   127.0.0.1:51516 (localhost) In Service]
 {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16634) Dynamically adjust slow peer report size on JMX metrics

2022-06-16 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16634:
---

 Summary: Dynamically adjust slow peer report size on JMX metrics
 Key: HDFS-16634
 URL: https://issues.apache.org/jira/browse/HDFS-16634
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


On a busy cluster, sometimes it takes bit of time for deleted node(from the 
cluster)'s "slow node report" to get removed from slow peer json report on 
Namenode JMX metrics. In the meantime, user should be able to browse through 
more entries in the report by adjusting i.e. reconfiguring 
"dfs.datanode.max.nodes.to.report" so that the list size can be adjusted 
without user having to bounce active Namenode just for this purpose.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15982) Deleted data using HTTP API should be saved to the trash

2022-06-14 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned HDFS-15982:
---

Assignee: (was: Viraj Jasani)

> Deleted data using HTTP API should be saved to the trash
> 
>
> Key: HDFS-15982
> URL: https://issues.apache.org/jira/browse/HDFS-15982
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs, hdfs-client, httpfs, webhdfs
>Reporter: Bhavik Patel
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2021-04-23 at 4.19.42 PM.png, Screenshot 
> 2021-04-23 at 4.36.57 PM.png
>
>  Time Spent: 13h 20m
>  Remaining Estimate: 0h
>
> If we delete the data from the Web UI then it should be first moved to 
> configured/default Trash directory and after the trash interval time, it 
> should be removed. currently, data directly removed from the system[This 
> behavior should be the same as CLI cmd]
> This can be helpful when the user accidentally deletes data from the Web UI.
> Similarly we should provide "Skip Trash" option in HTTP API as well which 
> should be accessible through Web UI.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16618) sync_file_range error should include more volume and file info

2022-06-04 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16618:

Priority: Minor  (was: Major)

> sync_file_range error should include more volume and file info
> --
>
> Key: HDFS-16618
> URL: https://issues.apache.org/jira/browse/HDFS-16618
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
>
> Having seen multiple sync_file_range errors recently with Bad file 
> descriptor, it would be good to include more volume stats as well as file 
> offset/length info with the error log to get some more insights.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16618) sync_file_range error should include more volume and file info

2022-06-04 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16618:
---

 Summary: sync_file_range error should include more volume and file 
info
 Key: HDFS-16618
 URL: https://issues.apache.org/jira/browse/HDFS-16618
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Having seen multiple sync_file_range errors recently with Bad file descriptor, 
it would be good to include more volume stats as well as file offset/length 
info with the error log to get some more insights.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-06-04 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16595:

Release Note: Namenode metrics that represent Slownode Json now include 
three important factors (median, median absolute deviation, upper latency 
limit) that can help user determine how urgently a given slownode requires 
manual intervention.

> Slow peer metrics - add median, mad and upper latency limits
> 
>
> Key: HDFS-16595
> URL: https://issues.apache.org/jira/browse/HDFS-16595
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Slow datanode metrics include slow node and it's reporting node details. With 
> HDFS-16582, we added the aggregate latency that is perceived by the reporting 
> nodes.
> In order to get more insights into how the outlier slownode's latencies 
> differ from the rest of the nodes, we should also expose median, median 
> absolute deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-05-25 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16595:
---

 Summary: Slow peer metrics - add median, mad and upper latency 
limits
 Key: HDFS-16595
 URL: https://issues.apache.org/jira/browse/HDFS-16595
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Slow datanode metrics include slow node and it's reporting node details. With 
HDFS-16582, we added the aggregate latency that is perceived by the reporting 
nodes.

In order to get more insights into how the outlier slownode's latencies differ 
from the rest of the nodes, we should also expose median, median absolute 
deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16582) Expose aggregate latency of slow node as perceived by the reporting node

2022-05-17 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538569#comment-17538569
 ] 

Viraj Jasani commented on HDFS-16582:
-

FYI [~stack] 

> Expose aggregate latency of slow node as perceived by the reporting node
> 
>
> Key: HDFS-16582
> URL: https://issues.apache.org/jira/browse/HDFS-16582
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> When any datanode is reported to be slower by another node, we expose the 
> slow node as well as the reporting nodes list for the slow node. However, we 
> don't provide latency numbers of the slownode as reported by the reporting 
> node. Having the latency exposed in the metrics would be really helpful for 
> operators to keep a track of how far behind a given slow node is performing 
> compared to the rest of the nodes in the cluster.
> The operator should be able to gather aggregated latencies of all slow nodes 
> with their reporting nodes in Namenode metrics.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16582) Expose aggregate latency of slow node as perceived by the reporting node

2022-05-17 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16582:
---

 Summary: Expose aggregate latency of slow node as perceived by the 
reporting node
 Key: HDFS-16582
 URL: https://issues.apache.org/jira/browse/HDFS-16582
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


When any datanode is reported to be slower by another node, we expose the slow 
node as well as the reporting nodes list for the slow node. However, we don't 
provide latency numbers of the slownode as reported by the reporting node. 
Having the latency exposed in the metrics would be really helpful for operators 
to keep a track of how far behind a given slow node is performing compared to 
the rest of the nodes in the cluster.

The operator should be able to gather aggregated latencies of all slow nodes 
with their reporting nodes in Namenode metrics.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16568) dfsadmin -reconfig option to start/query reconfig on all live datanodes

2022-05-05 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16568:

Target Version/s: 3.4.0, 3.3.4  (was: 3.4.0)

> dfsadmin -reconfig option to start/query reconfig on all live datanodes
> ---
>
> Key: HDFS-16568
> URL: https://issues.apache.org/jira/browse/HDFS-16568
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> DFSAdmin provides option to initiate or query the status of reconfiguration 
> operation on only specific host based on host:port provided by user. It would 
> be good to provide an ability to initiate such operations in bulk, on all 
> live datanodes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes

2022-05-05 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16521:

Target Version/s: 3.4.0, 3.3.4  (was: 3.4.0)

> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Providing DFS API to retrieve slow nodes would help add an additional option 
> to "dfsadmin -report" that lists slow datanodes info for operators to take a 
> look, specifically useful filter for larger clusters.
> The other purpose of such API is for HDFS downstreamers without direct access 
> to namenode http port (only rpc port accessible) to retrieve slownodes.
> Moreover, 
> [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
>  in HBase currently has to rely on it's own way of marking and excluding slow 
> nodes while 1) creating pipelines and 2) handling ack, based on factors like 
> the data length of the packet, processing time with last ack timestamp, 
> whether flush to replicas is finished etc. If it can utilize slownode API 
> from HDFS to exclude nodes appropriately while writing block, a lot of it's 
> own post-ack computation of slow nodes can be _saved_ or _improved_ or based 
> on further experiment, we could find _better solution_ to manage slow node 
> detection logic both in HDFS and HBase. However, in order to collect more 
> data points and run more POC around this area, HDFS should provide API for 
> downstreamers to efficiently utilize slownode info for such critical 
> low-latency use-case (like writing WALs).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16568) dfsadmin -reconfig option to start/query reconfig on all live datanodes

2022-05-04 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16568:
---

 Summary: dfsadmin -reconfig option to start/query reconfig on all 
live datanodes
 Key: HDFS-16568
 URL: https://issues.apache.org/jira/browse/HDFS-16568
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


DFSAdmin provides option to initiate or query the status of reconfiguration 
operation on only specific host based on host:port provided by user. It would 
be good to provide an ability to initiate such operations in bulk, on all live 
datanodes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16528) Reconfigure slow peer enable for Namenode

2022-05-01 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16528:

Fix Version/s: 3.3.4
   (was: 3.3.0)

> Reconfigure slow peer enable for Namenode
> -
>
> Key: HDFS-16528
> URL: https://issues.apache.org/jira/browse/HDFS-16528
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> HDFS-16396 provides reconfig options for several configs associated with 
> slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some 
> slownodes related configs as the reconfig options in Namenode.
> The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as 
> reconfigurable option for Namenode (similar to how HDFS-16396 has included it 
> for Datanode).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16528) Reconfigure slow peer enable for Namenode

2022-05-01 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530574#comment-17530574
 ] 

Viraj Jasani commented on HDFS-16528:
-

Thank you for the review [~tomscut] !

> Reconfigure slow peer enable for Namenode
> -
>
> Key: HDFS-16528
> URL: https://issues.apache.org/jira/browse/HDFS-16528
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> HDFS-16396 provides reconfig options for several configs associated with 
> slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some 
> slownodes related configs as the reconfig options in Namenode.
> The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as 
> reconfigurable option for Namenode (similar to how HDFS-16396 has included it 
> for Datanode).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16528) Reconfigure slow peer enable for Namenode

2022-04-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16528 started by Viraj Jasani.
---
> Reconfigure slow peer enable for Namenode
> -
>
> Key: HDFS-16528
> URL: https://issues.apache.org/jira/browse/HDFS-16528
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> HDFS-16396 provides reconfig options for several configs associated with 
> slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some 
> slownodes related configs as the reconfig options in Namenode.
> The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as 
> reconfigurable option for Namenode (similar to how HDFS-16396 has included it 
> for Datanode).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16528) Reconfigure slow peer enable for Namenode

2022-04-26 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16528:

Status: Patch Available  (was: In Progress)

> Reconfigure slow peer enable for Namenode
> -
>
> Key: HDFS-16528
> URL: https://issues.apache.org/jira/browse/HDFS-16528
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> HDFS-16396 provides reconfig options for several configs associated with 
> slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some 
> slownodes related configs as the reconfig options in Namenode.
> The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as 
> reconfigurable option for Namenode (similar to how HDFS-16396 has included it 
> for Datanode).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes

2022-04-10 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16521:

Description: 
Providing DFS API to retrieve slow nodes would help add an additional option to 
"dfsadmin -report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.

The other purpose of such API is for HDFS downstreamers without direct access 
to namenode http port (only rpc port accessible) to retrieve slownodes.

Moreover, 
[FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
 in HBase currently has to rely on it's own way of marking and excluding slow 
nodes while 1) creating pipelines and 2) handling ack, based on factors like 
the data length of the packet, processing time with last ack timestamp, whether 
flush to replicas is finished etc. If it can utilize slownode API from HDFS to 
exclude nodes appropriately while writing block, a lot of it's own post-ack 
computation of slow nodes can be _saved_ or _improved_ or based on further 
experiment, we could find _better solution_ to manage slow node detection logic 
both in HDFS and HBase. However, in order to collect more data points and run 
more POC around this area, HDFS should provide API for downstreamers to 
efficiently utilize slownode info for such critical low-latency use-case (like 
writing WALs).

  was:
Providing DFS API to retrieve slow nodes would help add an additional option to 
"dfsadmin -report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.

The other purpose of such API is for HDFS downstreamers without direct access 
to namenode http port (only rpc port accessible) to retrieve slownodes.

Moreover, 
[FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
 in HBase currently has to rely on it's own way of marking and excluding slow 
nodes while 1) creating pipelines and 2) handling ack, based on factors like 
the data length of the packet, processing time with last ack timestamp, whether 
flush to replicas is finished etc. If it can utilize slownode API from HDFS to 
exclude nodes appropriately while writing block, a lot of it's own post-ack 
computation of slow nodes can be _saved_ or _improved_ or based on further 
experiment, we could find _better solution_ to manage slow node detection logic 
both in HDFS and HBase. However, in order to collect more data points and run 
more POC around this area, at least we should expect HDFS to provide API for 
downstreamers to efficiently utilize slownode info for such critical 
low-latency use-case (like writing WALs).


> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Providing DFS API to retrieve slow nodes would help add an additional option 
> to "dfsadmin -report" that lists slow datanodes info for operators to take a 
> look, specifically useful filter for larger clusters.
> The other purpose of such API is for HDFS downstreamers without direct access 
> to namenode http port (only rpc port accessible) to retrieve slownodes.
> Moreover, 
> [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
>  in HBase currently has to rely on it's own way of marking and excluding slow 
> nodes while 1) creating pipelines and 2) handling ack, based on factors like 
> the data length of the packet, processing time with last ack timestamp, 
> whether flush to replicas is finished etc. If it can utilize slownode API 
> from HDFS to exclude nodes appropriately while writing block, a lot of it's 
> own post-ack computation of slow nodes can be _saved_ or _improved_ or based 
> on further experiment, we could find _better solution_ to manage slow node 
> detection logic both in HDFS and HBase. However, in order to collect more 
> data points and run more POC around this area, HDFS should provide API for 
> downstreamers to efficiently utilize slownode info for such critical 
> low-latency use-case (like writing WALs).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail:

[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes

2022-04-10 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16521:

Description: 
Providing DFS API to retrieve slow nodes would help add an additional option to 
"dfsadmin -report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.

The other purpose of such API is for HDFS downstreamers without direct access 
to namenode http port (only rpc port accessible) to retrieve slownodes.

Moreover, 
[FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
 in HBase currently has to rely on it's own way of marking and excluding slow 
nodes while 1) creating pipelines and 2) handling ack, based on factors like 
the data length of the packet, processing time with last ack timestamp, whether 
flush to replicas is finished etc. If it can utilize slownode API from HDFS to 
exclude nodes appropriately while writing block, a lot of it's own post-ack 
computation of slow nodes can be _saved_ or _improved_ or based on further 
experiment, we could find _better solution_ to manage slow node detection logic 
both in HDFS and HBase. However, in order to collect more data points and run 
more POC around this area, at least we should expect HDFS to provide API for 
downstreamers to efficiently utilize slownode info for such critical 
low-latency use-case (like writing WALs).

  was:
Providing DFS API to retrieve slow nodes would help add an additional option to 
"dfsadmin -report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.

The other purpose of such API is for HDFS downstreamers without direct access 
to namenode http port (only rpc port accessible) to retrieve slownodes.


> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Providing DFS API to retrieve slow nodes would help add an additional option 
> to "dfsadmin -report" that lists slow datanodes info for operators to take a 
> look, specifically useful filter for larger clusters.
> The other purpose of such API is for HDFS downstreamers without direct access 
> to namenode http port (only rpc port accessible) to retrieve slownodes.
> Moreover, 
> [FanOutOneBlockAsyncDFSOutput|https://github.com/apache/hbase/blob/master/hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutput.java]
>  in HBase currently has to rely on it's own way of marking and excluding slow 
> nodes while 1) creating pipelines and 2) handling ack, based on factors like 
> the data length of the packet, processing time with last ack timestamp, 
> whether flush to replicas is finished etc. If it can utilize slownode API 
> from HDFS to exclude nodes appropriately while writing block, a lot of it's 
> own post-ack computation of slow nodes can be _saved_ or _improved_ or based 
> on further experiment, we could find _better solution_ to manage slow node 
> detection logic both in HDFS and HBase. However, in order to collect more 
> data points and run more POC around this area, at least we should expect HDFS 
> to provide API for downstreamers to efficiently utilize slownode info for 
> such critical low-latency use-case (like writing WALs).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes

2022-04-08 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16521:

Description: 
Providing DFS API to retrieve slow nodes would help add an additional option to 
"dfsadmin -report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.

The other purpose of such API is for HDFS downstreamers without direct access 
to namenode http port (only rpc port accessible) to retrieve slownodes.

  was:
In order to build some automation around slow datanodes that regularly show up 
in the slow peer tracking report, e.g. decommission such nodes and queue them 
up for external processing and add them back later to the cluster after fixing 
issues etc, we should expose DFS API to retrieve all slow nodes at a given time.

Providing such API would also help add an additional option to "dfsadmin 
-report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.


> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Providing DFS API to retrieve slow nodes would help add an additional option 
> to "dfsadmin -report" that lists slow datanodes info for operators to take a 
> look, specifically useful filter for larger clusters.
> The other purpose of such API is for HDFS downstreamers without direct access 
> to namenode http port (only rpc port accessible) to retrieve slownodes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes

2022-04-08 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16521:

Target Version/s: 3.4.0

> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In order to build some automation around slow datanodes that regularly show 
> up in the slow peer tracking report, e.g. decommission such nodes and queue 
> them up for external processing and add them back later to the cluster after 
> fixing issues etc, we should expose DFS API to retrieve all slow nodes at a 
> given time.
> Providing such API would also help add an additional option to "dfsadmin 
> -report" that lists slow datanodes info for operators to take a look, 
> specifically useful filter for larger clusters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16481) Provide support to set Http and Rpc ports in MiniJournalCluster

2022-04-07 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519332#comment-17519332
 ] 

Viraj Jasani commented on HDFS-16481:
-

Thanks [~aajisaka]!

> Provide support to set Http and Rpc ports in MiniJournalCluster
> ---
>
> Key: HDFS-16481
> URL: https://issues.apache.org/jira/browse/HDFS-16481
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> We should provide support for clients to set Http and Rpc ports of 
> JournalNodes in MiniJournalCluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16528) Reconfigure slow peer enable for Namenode

2022-04-01 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16528:
---

 Summary: Reconfigure slow peer enable for Namenode
 Key: HDFS-16528
 URL: https://issues.apache.org/jira/browse/HDFS-16528
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


HDFS-16396 provides reconfig options for several configs associated with 
slownodes in Datanode. Similarly, HDFS-16287 and HDFS-16327 have added some 
slownodes related configs as the reconfig options in Namenode.

The purpose of this Jira is to add DFS_DATANODE_PEER_STATS_ENABLED_KEY as 
reconfigurable option for Namenode (similar to how HDFS-16396 has included it 
for Datanode).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16522) Set Http and Ipc ports for Datanodes in MiniDFSCluster

2022-03-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16522:

Status: Patch Available  (was: In Progress)

> Set Http and Ipc ports for Datanodes in MiniDFSCluster
> --
>
> Key: HDFS-16522
> URL: https://issues.apache.org/jira/browse/HDFS-16522
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We should provide options to set Http and Ipc ports for Datanodes in 
> MiniDFSCluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16521) DFS API to retrieve slow datanodes

2022-03-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16521 started by Viraj Jasani.
---
> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In order to build some automation around slow datanodes that regularly show 
> up in the slow peer tracking report, e.g. decommission such nodes and queue 
> them up for external processing and add them back later to the cluster after 
> fixing issues etc, we should expose DFS API to retrieve all slow nodes at a 
> given time.
> Providing such API would also help add an additional option to "dfsadmin 
> -report" that lists slow datanodes info for operators to take a look, 
> specifically useful filter for larger clusters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16521) DFS API to retrieve slow datanodes

2022-03-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16521:

Status: Patch Available  (was: In Progress)

> DFS API to retrieve slow datanodes
> --
>
> Key: HDFS-16521
> URL: https://issues.apache.org/jira/browse/HDFS-16521
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In order to build some automation around slow datanodes that regularly show 
> up in the slow peer tracking report, e.g. decommission such nodes and queue 
> them up for external processing and add them back later to the cluster after 
> fixing issues etc, we should expose DFS API to retrieve all slow nodes at a 
> given time.
> Providing such API would also help add an additional option to "dfsadmin 
> -report" that lists slow datanodes info for operators to take a look, 
> specifically useful filter for larger clusters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16522) Set Http and Ipc ports for Datanodes in MiniDFSCluster

2022-03-27 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16522 started by Viraj Jasani.
---
> Set Http and Ipc ports for Datanodes in MiniDFSCluster
> --
>
> Key: HDFS-16522
> URL: https://issues.apache.org/jira/browse/HDFS-16522
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We should provide options to set Http and Ipc ports for Datanodes in 
> MiniDFSCluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16522) Set Http and Ipc ports for Datanodes in MiniDFSCluster

2022-03-25 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16522:
---

 Summary: Set Http and Ipc ports for Datanodes in MiniDFSCluster
 Key: HDFS-16522
 URL: https://issues.apache.org/jira/browse/HDFS-16522
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Viraj Jasani
Assignee: Viraj Jasani


We should provide options to set Http and Ipc ports for Datanodes in 
MiniDFSCluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16521) DFS API to retrieve slow datanodes

2022-03-25 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16521:
---

 Summary: DFS API to retrieve slow datanodes
 Key: HDFS-16521
 URL: https://issues.apache.org/jira/browse/HDFS-16521
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


In order to build some automation around slow datanodes that regularly show up 
in the slow peer tracking report, e.g. decommission such nodes and queue them 
up for external processing and add them back later to the cluster after fixing 
issues etc, we should expose DFS API to retrieve all slow nodes at a given time.

Providing such API would also help add an additional option to "dfsadmin 
-report" that lists slow datanodes info for operators to take a look, 
specifically useful filter for larger clusters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16502) Reconfigure Block Invalidate limit

2022-03-12 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16502:

Description: 
Based on the cluster load, it would be helpful to consider tuning block 
invalidate limit (dfs.block.invalidate.limit). The only way we can do this 
without restarting Namenode as of today is by reconfiguring heartbeat interval 
{code:java}
Math.max(heartbeatInt*20, blockInvalidateLimit){code}
, this logic is not straightforward and operators are usually not aware of it 
(lack of documentation), also updating heartbeat interval is not desired in all 
the cases.

We should provide the ability to alter block invalidation limit without 
affecting heartbeat interval on the live cluster to adjust some load at 
Datanode level.

We should also take this opportunity to keep (heartbeatInterval * 20) 
computation logic in a common method.

  was:
Based on the cluster load, it would be helpful to consider tuning block 
invalidate limit (dfs.block.invalidate.limit). The only way we can do this 
without restarting Namenode as of today is by reconfiguring heartbeat interval 
{code:java}
Math.max(heartbeatInt*20, blockInvalidateLimit){code}
, this logic is not straightforward and operators are usually not aware of it 
(lack of documentation), also updating heartbeat interval is not desired in all 
the cases.

We should provide the ability to alter block invalidation limit without 
affecting heartbeat interval on the live cluster to adjust some load at 
Datanode level.


> Reconfigure Block Invalidate limit
> --
>
> Key: HDFS-16502
> URL: https://issues.apache.org/jira/browse/HDFS-16502
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> Based on the cluster load, it would be helpful to consider tuning block 
> invalidate limit (dfs.block.invalidate.limit). The only way we can do this 
> without restarting Namenode as of today is by reconfiguring heartbeat 
> interval 
> {code:java}
> Math.max(heartbeatInt*20, blockInvalidateLimit){code}
> , this logic is not straightforward and operators are usually not aware of it 
> (lack of documentation), also updating heartbeat interval is not desired in 
> all the cases.
> We should provide the ability to alter block invalidation limit without 
> affecting heartbeat interval on the live cluster to adjust some load at 
> Datanode level.
> We should also take this opportunity to keep (heartbeatInterval * 20) 
> computation logic in a common method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   >