[jira] [Updated] (HDFS-16923) The getListing RPC will throw NPE if the path does not exist

2023-03-03 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16923:
---
Fix Version/s: 3.3.5

> The getListing RPC will throw NPE if the path does not exist
> 
>
> Key: HDFS-16923
> URL: https://issues.apache.org/jira/browse/HDFS-16923
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5, 3.3.6
>
>
> The getListing RPC will throw NPE if the path does not exist. And the stack 
> as bellow:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4195)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:1421)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:783)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16923) The getListing RPC will throw NPE if the path does not exist

2023-03-03 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16923:
---
Fix Version/s: (was: 3.3.6)

> The getListing RPC will throw NPE if the path does not exist
> 
>
> Key: HDFS-16923
> URL: https://issues.apache.org/jira/browse/HDFS-16923
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> The getListing RPC will throw NPE if the path does not exist. And the stack 
> as bellow:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4195)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:1421)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:783)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16923) The getListing RPC will throw NPE if the path does not exist

2023-03-01 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16923:
---
Fix Version/s: 3.4.0
   3.3.6

> The getListing RPC will throw NPE if the path does not exist
> 
>
> Key: HDFS-16923
> URL: https://issues.apache.org/jira/browse/HDFS-16923
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> The getListing RPC will throw NPE if the path does not exist. And the stack 
> as bellow:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4195)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:1421)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:783)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16923) The getListing RPC will throw NPE if the path does not exist

2023-03-01 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16923.

Resolution: Fixed

> The getListing RPC will throw NPE if the path does not exist
> 
>
> Key: HDFS-16923
> URL: https://issues.apache.org/jira/browse/HDFS-16923
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> The getListing RPC will throw NPE if the path does not exist. And the stack 
> as bellow:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4195)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:1421)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:783)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16923) The getListing RPC will throw NPE if the path does not exist

2023-03-01 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16923:
---
Priority: Critical  (was: Major)

> The getListing RPC will throw NPE if the path does not exist
> 
>
> Key: HDFS-16923
> URL: https://issues.apache.org/jira/browse/HDFS-16923
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
>
> The getListing RPC will throw NPE if the path does not exist. And the stack 
> as bellow:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): 
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4195)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:1421)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:783)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16764) ObserverNamenode handles addBlock rpc and throws a FileNotFoundException

2023-01-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16764.

Resolution: Fixed

> ObserverNamenode handles addBlock rpc and throws a FileNotFoundException 
> -
>
> Key: HDFS-16764
> URL: https://issues.apache.org/jira/browse/HDFS-16764
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> ObserverNameNode currently can handle the addBlockLocation RPC, but it may 
> throw a FileNotFoundException when it contains stale txid.
>  * AddBlock is not a coordinated method, so Observer will not check the 
> statId.
>  * AddBlock does the validation with checkOperation(OperationCategory.READ)
> So the observer can handle the addBlock rpc. If this observer cannot replay 
> the edit of create file, it will throw a FileNotFoundException during doing 
> validation.
> The related code as follows:
> {code:java}
> checkOperation(OperationCategory.READ);
> final FSPermissionChecker pc = getPermissionChecker();
> FSPermissionChecker.setOperationType(operationName);
> readLock();
> try {
>   checkOperation(OperationCategory.READ);
>   r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
> previous, onRetryBlock);
> } finally {
>   readUnlock(operationName);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16764) ObserverNamenode handles addBlock rpc and throws a FileNotFoundException

2023-01-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16764:
---
Fix Version/s: 3.3.9
   (was: 3.3.6)

> ObserverNamenode handles addBlock rpc and throws a FileNotFoundException 
> -
>
> Key: HDFS-16764
> URL: https://issues.apache.org/jira/browse/HDFS-16764
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> ObserverNameNode currently can handle the addBlockLocation RPC, but it may 
> throw a FileNotFoundException when it contains stale txid.
>  * AddBlock is not a coordinated method, so Observer will not check the 
> statId.
>  * AddBlock does the validation with checkOperation(OperationCategory.READ)
> So the observer can handle the addBlock rpc. If this observer cannot replay 
> the edit of create file, it will throw a FileNotFoundException during doing 
> validation.
> The related code as follows:
> {code:java}
> checkOperation(OperationCategory.READ);
> final FSPermissionChecker pc = getPermissionChecker();
> FSPermissionChecker.setOperationType(operationName);
> readLock();
> try {
>   checkOperation(OperationCategory.READ);
>   r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
> previous, onRetryBlock);
> } finally {
>   readUnlock(operationName);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members

2023-01-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16872:
---
Fix Version/s: 3.3.9
   (was: 3.3.6)

> Fix log throttling by declaring LogThrottlingHelper as static members
> -
>
> Key: HDFS-16872
> URL: https://issues.apache.org/jira/browse/HDFS-16872
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Chengbing Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.9
>
>
> In our production cluster with Observer NameNode enabled, we have plenty of 
> logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
> {{LogThrottlingHelper}} doesn't seem to work.
> {noformat}
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688]' to transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
> transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
> 17686250688], ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
> 1.0, total load time 0.0 ms
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693]' to transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
> transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
> 17686250693], ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
> 5.0, total load time 1.0 ms
> {noformat}
> After some digging, I found the cause is that {{LogThrottlingHelper}}'s are 
> declared as instance variables of all the enclosing classes, including 
> {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. 
> Therefore the logging frequency will not be limited across different 
> instances. For classes with only limited number of instances, such as 
> {{FSImage}}, this is fine. For others whose instances are created frequently, 
> such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will 
> result in plenty of logs.
> This can be fixed by declaring {{LogThrottlingHelper}}'s as static members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members

2023-01-10 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16872.

Resolution: Fixed

> Fix log throttling by declaring LogThrottlingHelper as static members
> -
>
> Key: HDFS-16872
> URL: https://issues.apache.org/jira/browse/HDFS-16872
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Chengbing Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.6
>
>
> In our production cluster with Observer NameNode enabled, we have plenty of 
> logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
> {{LogThrottlingHelper}} doesn't seem to work.
> {noformat}
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688]' to transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
> transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
> 17686250688], ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
> 1.0, total load time 0.0 ms
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693]' to transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
> transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
> 17686250693], ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
> 5.0, total load time 1.0 ms
> {noformat}
> After some digging, I found the cause is that {{LogThrottlingHelper}}'s are 
> declared as instance variables of all the enclosing classes, including 
> {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. 
> Therefore the logging frequency will not be limited across different 
> instances. For classes with only limited number of instances, such as 
> {{FSImage}}, this is fine. For others whose instances are created frequently, 
> such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will 
> result in plenty of logs.
> This can be fixed by declaring {{LogThrottlingHelper}}'s as static members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members

2023-01-10 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16872:
---
Fix Version/s: 3.4.0
   3.2.5
   3.3.6

> Fix log throttling by declaring LogThrottlingHelper as static members
> -
>
> Key: HDFS-16872
> URL: https://issues.apache.org/jira/browse/HDFS-16872
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Chengbing Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.6
>
>
> In our production cluster with Observer NameNode enabled, we have plenty of 
> logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
> {{LogThrottlingHelper}} doesn't seem to work.
> {noformat}
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688]' to transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
> transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
> 17686250688], ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
> 1.0, total load time 0.0 ms
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693]' to transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
> transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
> 17686250693], ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
> 5.0, total load time 1.0 ms
> {noformat}
> After some digging, I found the cause is that {{LogThrottlingHelper}}'s are 
> declared as instance variables of all the enclosing classes, including 
> {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. 
> Therefore the logging frequency will not be limited across different 
> instances. For classes with only limited number of instances, such as 
> {{FSImage}}, this is fine. For others whose instances are created frequently, 
> such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will 
> result in plenty of logs.
> This can be fixed by declaring {{LogThrottlingHelper}}'s as static members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer

2022-12-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16689.

Resolution: Fixed

> Standby NameNode crashes when transitioning to Active with in-progress tailer
> -
>
> Key: HDFS-16689
> URL: https://issues.apache.org/jira/browse/HDFS-16689
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Standby NameNode crashes when transitioning to Active with a in-progress 
> tailer. And the error message like blew:
> {code:java}
> Caused by: java.lang.IllegalStateException: Cannot start writing at txid X 
> when there is a stream available for read: ByteStringEditLog[X, Y], 
> ByteStringEditLog[X, 0]
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
>   ... 36 more
> {code}
> After tracing and found there is a critical bug in 
> *EditlogTailer#catchupDuringFailover()* when 
> *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()* 
> try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*. 
> It may cannot replay any edits when they are some abnormal JournalNodes. 
> Reproduce method, suppose:
> - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode 
> is Active, Standby respectively. And there are 3 JournalNodes, namely JN0, 
> JN1 and JN2. 
> - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully 
> synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or 
> restarted.
> - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover 
> active from NN0 to NN1. 
> - NN1 only got two responses from JN0 and JN1 when it try to selecting 
> inputStreams with *fromTxnId=3*  and *onlyDurableTxns=true*, and the count 
> txid of response is 0, 3 respectively. JN2 is abnormal, such as GC,  bad 
> network or restarted.
> - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes 
> because the *maxAllowedTxns* is 0.
> So I think Standby NameNode should *catchupDuringFailover()* with 
> *onlyDurableTxns=false* , so that it can replay all missed edits from 
> JournalNode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

2022-12-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650959#comment-17650959
 ] 

Erik Krogen commented on HDFS-16853:


[~omalley]  any thoughts on the PR? 

> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> because HADOOP-18324
> ---
>
> Key: HDFS-16853
> URL: https://issues.apache.org/jira/browse/HDFS-16853
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> with error message: Waiting for cluster to become active. And the blocking 
> jstack as bellows:
> {code:java}
> "BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
> localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
> 7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0007430a9ec0> (a 
> java.util.concurrent.SynchronousQueue$TransferQueue)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
>         at 
> java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
>         at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1482)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>         at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
> SideTranslatorPB.java:168)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
>         at java.lang.Thread.run(Thread.java:748)  {code}
> After looking into the code and found that this bug is imported by 
> HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
> rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16764) ObserverNamenode handles addBlock rpc and throws a FileNotFoundException

2022-12-20 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16764:
---
Fix Version/s: 3.4.0
   3.3.6

> ObserverNamenode handles addBlock rpc and throws a FileNotFoundException 
> -
>
> Key: HDFS-16764
> URL: https://issues.apache.org/jira/browse/HDFS-16764
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> ObserverNameNode currently can handle the addBlockLocation RPC, but it may 
> throw a FileNotFoundException when it contains stale txid.
>  * AddBlock is not a coordinated method, so Observer will not check the 
> statId.
>  * AddBlock does the validation with checkOperation(OperationCategory.READ)
> So the observer can handle the addBlock rpc. If this observer cannot replay 
> the edit of create file, it will throw a FileNotFoundException during doing 
> validation.
> The related code as follows:
> {code:java}
> checkOperation(OperationCategory.READ);
> final FSPermissionChecker pc = getPermissionChecker();
> FSPermissionChecker.setOperationType(operationName);
> readLock();
> try {
>   checkOperation(OperationCategory.READ);
>   r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
> previous, onRetryBlock);
> } finally {
>   readUnlock(operationName);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-16 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16852.

Resolution: Fixed

> Register the shutdown hook only when not in shutdown for KeyProviderCache 
> constructor
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.3, 3.3.6
>
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHook

[jira] [Updated] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-16 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16852:
---
Fix Version/s: 2.10.3

> Register the shutdown hook only when not in shutdown for KeyProviderCache 
> constructor
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.3, 3.3.6
>
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHo

[jira] [Updated] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-16 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16852:
---
Fix Version/s: 3.3.6
   (was: 3.3.5)

> Register the shutdown hook only when not in shutdown for KeyProviderCache 
> constructor
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookMan

[jira] [Updated] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-16 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16852:
---
Fix Version/s: 3.3.5

> Register the shutdown hook only when not in shutdown for KeyProviderCache 
> constructor
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager

[jira] [Updated] (HDFS-16852) Register the shutdown hook only when not in shutdown for KeyProviderCache constructor

2022-12-16 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16852:
---
Fix Version/s: 3.4.0

> Register the shutdown hook only when not in shutdown for KeyProviderCache 
> constructor
> -
>
> Key: HDFS-16852
> URL: https://issues.apache.org/jira/browse/HDFS-16852
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When an HDFS client is created, it will register a shutdownhook to 
> shutdownHookManager. ShutdownHookManager doesn't allow adding a new 
> shutdownHook when the process is already in shutdown and throws an 
> IllegalStateException.
> This behavior is not ideal, when a spark program failed during pre-launch. In 
> that case, during shutdown, spark would call cleanStagingDir() to clean the 
> staging dir. In cleanStagingDir(), it will create a FileSystem object to talk 
> to HDFS. However, since this would be the first time to use a filesystem 
> object in that process, it will need to create an hdfs client and register 
> the shutdownHook. Then, we will hit the IllegalStateException. This 
> illegalStateException will mask the actual exception which causes the spark 
> program to fail during pre-launch.
> We propose to swallow IllegalStateException in KeyProviderCache and log a 
> warning. The TCP connection between the client and NameNode should be closed 
> by the OS when the process is shutdown. 
> Example stacktrace
> {code:java}
> 13-09-2022 14:39:42 PDT INFO - 22/09/13 21:39:41 ERROR util.Utils: Uncaught 
> exception in thread shutdown-hook-0   
> 13-09-2022 14:39:42 PDT INFO - java.lang.IllegalStateException: Shutdown in 
> progress, cannot add a shutdownHook    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.KeyProviderCache.(KeyProviderCache.java:71)      
>     
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.(ClientContext.java:130)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.ClientContext.get(ClientContext.java:167)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:383)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:287)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:159)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3261)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)       
>    
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.cleanupStagingDir(ApplicationMaster.scala:675)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.deploy.yarn.ApplicationMaster.$anonfun$run$2(ApplicationMaster.scala:259)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)    
>       
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2023)          
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
>   
> 13-09-2022 14:39:42 PDT INFO - at scala.util.Try$.apply(Try.scala:213)        
>   
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>           
> 13-09-2022 14:39:42 PDT INFO - at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:

[jira] [Resolved] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-11-30 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16550.

Fix Version/s: 3.4.0
   Resolution: Fixed

> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory 
> is used up.
> 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 12ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>   DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>   "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, we should not set the {{cache size}} to a fixed value, but to the ratio 
> of maximum memory, which is 0.2 by default.
> This avoids the problem of too large cache size. In addition, users can 
> actively adjust the heap size when they need to increase the cache size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16832) [SBN READ] Fix NPE when check the block location of empty directory

2022-11-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636809#comment-17636809
 ] 

Erik Krogen commented on HDFS-16832:


Merged to branch-3.3 as well since that has HDFS-16732 also.

> [SBN READ] Fix NPE when check the block location of empty directory
> ---
>
> Key: HDFS-16832
> URL: https://issues.apache.org/jira/browse/HDFS-16832
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> HDFS-16732 is introduced for check block location when getListing or 
> getFileInfo. But When we check block location of empty directory will throw 
> NPE.
> Exception stack on tez client are below:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1492)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1389)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy12.getListing(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678)
>   at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy13.getListing(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1671)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1212)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1195)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1140)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1136)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1154)
>   at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2054)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:159)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:279)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:270)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:254)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
> 

[jira] [Resolved] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfered to observer state

2022-11-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16547.

Resolution: Fixed

> [SBN read] Namenode in safe mode should not be transfered to observer state
> ---
>
> Key: HDFS-16547
> URL: https://issues.apache.org/jira/browse/HDFS-16547
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, when a Namenode is in safemode(under starting or enter safemode 
> manually), we can transfer this Namenode to Observer by command. This 
> Observer node may receive many requests and then throw a SafemodeException, 
> this causes unnecessary failover on the client.
> So Namenode in safe mode should not be transfer to observer state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16547) [SBN read] Namenode in safe mode should not be transfered to observer state

2022-11-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16547:
---
Fix Version/s: 3.4.0

> [SBN read] Namenode in safe mode should not be transfered to observer state
> ---
>
> Key: HDFS-16547
> URL: https://issues.apache.org/jira/browse/HDFS-16547
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, when a Namenode is in safemode(under starting or enter safemode 
> manually), we can transfer this Namenode to Observer by command. This 
> Observer node may receive many requests and then throw a SafemodeException, 
> this causes unnecessary failover on the client.
> So Namenode in safe mode should not be transfer to observer state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16832) [SBN READ] Fix NPE when check the block location of empty directory

2022-11-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16832.

Resolution: Fixed

> [SBN READ] Fix NPE when check the block location of empty directory
> ---
>
> Key: HDFS-16832
> URL: https://issues.apache.org/jira/browse/HDFS-16832
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> HDFS-16732 is introduced for check block location when getListing or 
> getFileInfo. But When we check block location of empty directory will throw 
> NPE.
> Exception stack on tez client are below:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1492)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1389)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy12.getListing(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678)
>   at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy13.getListing(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1671)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1212)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1195)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1140)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1136)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1154)
>   at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2054)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:159)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:279)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:270)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:254)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenable

[jira] [Updated] (HDFS-16832) [SBN READ] Fix NPE when check the block location of empty directory

2022-11-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16832:
---
Fix Version/s: 3.3.5

> [SBN READ] Fix NPE when check the block location of empty directory
> ---
>
> Key: HDFS-16832
> URL: https://issues.apache.org/jira/browse/HDFS-16832
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> HDFS-16732 is introduced for check block location when getListing or 
> getFileInfo. But When we check block location of empty directory will throw 
> NPE.
> Exception stack on tez client are below:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1492)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1389)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy12.getListing(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678)
>   at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy13.getListing(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1671)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1212)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1195)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1140)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1136)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1154)
>   at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2054)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:159)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:279)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:270)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:254)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenabl

[jira] [Commented] (HDFS-16732) [SBN READ] Avoid get location from observer when the block report is delayed.

2022-11-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636765#comment-17636765
 ] 

Erik Krogen commented on HDFS-16732:


Note that a bug with this change was reported, and now fixed, in HDFS-16832.

> [SBN READ] Avoid get location from observer when the block report is delayed.
> -
>
> Key: HDFS-16732
> URL: https://issues.apache.org/jira/browse/HDFS-16732
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> Hive on tez application fail occasionally after observer is enable, log show 
> below.
> {code:java}
> 2022-08-18 15:22:06,914 [ERROR] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Vertex Input: namenodeinfo_stg initializer failed, 
> vertex=vertex_1660618571916_4839_1_00 [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:329)
>   at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
>   at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.afterRanInterruptibly(TrustedListenableFutureTask.java:133)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:80)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:748)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplitHostsAndCachedHosts(FileInputFormat.java:714)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:378)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:159)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:279)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:270)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:254)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>   ... 4 more {code}
> As describe in MAPREDUCE-7082, when the block is missing, then will throw 
> this exception, but my cluster had no missing block.
> In this example, I found getListing return location information. When block 
> report of observer is delayed, will return the block without location.
> HDFS-13924 is introduce to solve this problem, but only consider 
> getBlockLocations. 
> In observer node, all method which may return location should check whether 
> locations is empty or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issue

[jira] [Updated] (HDFS-16832) [SBN READ] Fix NPE when check the block location of empty directory

2022-11-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16832:
---
Fix Version/s: 3.4.0

> [SBN READ] Fix NPE when check the block location of empty directory
> ---
>
> Key: HDFS-16832
> URL: https://issues.apache.org/jira/browse/HDFS-16832
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> HDFS-16732 is introduced for check block location when getListing or 
> getFileInfo. But When we check block location of empty directory will throw 
> NPE.
> Exception stack on tez client are below:
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
> java.lang.NullPointerException
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1492)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1389)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy12.getListing(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678)
>   at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy13.getListing(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1671)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1212)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1195)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1140)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1136)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1154)
>   at 
> org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2054)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:159)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:279)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:270)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:254)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFuture

[jira] [Commented] (HDFS-13791) Limit logging frequency of edit tail related statements

2022-11-18 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635998#comment-17635998
 ] 

Erik Krogen commented on HDFS-13791:


Hi [~chengbing.liu], good callout. I would be supportive of making them static. 
Thanks for reporting!

> Limit logging frequency of edit tail related statements
> ---
>
> Key: HDFS-13791
> URL: https://issues.apache.org/jira/browse/HDFS-13791
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, qjm
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13791-HDFS-12943.000.patch, 
> HDFS-13791-HDFS-12943.001.patch, HDFS-13791-HDFS-12943.002.patch, 
> HDFS-13791-HDFS-12943.003.patch, HDFS-13791-HDFS-12943.004.patch, 
> HDFS-13791-HDFS-12943.005.patch, HDFS-13791-HDFS-12943.006.patch
>
>
> There are a number of log statements that occur every time new edits are 
> tailed by a Standby NameNode. When edits are tailing only on the order of 
> every tens of seconds, this is fine. With the work in HDFS-13150, however, 
> edits may be tailed every few milliseconds, which can flood the logs with 
> tailing-related statements. We should throttle it to limit it to printing at 
> most, say, once per 5 seconds.
> We can implement logic similar to that used in HDFS-10713. This may be 
> slightly more tricky since the log statements are distributed across a few 
> classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16771) JN should tersely print logs about NewerTxnIdException

2022-10-19 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620476#comment-17620476
 ] 

Erik Krogen commented on HDFS-16771:


Thanks for catching my mistake [~ferhui] !

> JN should tersely print logs about NewerTxnIdException
> --
>
> Key: HDFS-16771
> URL: https://issues.apache.org/jira/browse/HDFS-16771
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> JournalNode should tersely print some logs about NewerTxnIdException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16682) [SBN Read] make estimated transactions configurable

2022-09-20 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607344#comment-17607344
 ] 

Erik Krogen commented on HDFS-16682:


I remember at the time we built this feature, [~shv]  had reservations about 
exposing these as configurable parameters. Konstantin, do you have any 
commentary here?

 

One high-level comment is that if we do add configs, they should be namespaced 
to make their purpose more clear as a part of the observer namenode feature 
specifically.

> [SBN Read] make estimated transactions configurable
> ---
>
> Key: HDFS-16682
> URL: https://issues.apache.org/jira/browse/HDFS-16682
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In GlobalStateIdContext, ESTIMATED_TRANSACTIONS_PER_SECOND and 
> ESTIMATED_SERVER_TIME_MULTIPLIER should be configured.
> These parameter depends  on different cluster's load. In the other way, these 
> config will help use to simulate observer namenode was far behind.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16771) JN should tersely print logs about NewerTxnIdException

2022-09-15 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16771:
---
Fix Version/s: 3.4.0

> JN should tersely print logs about NewerTxnIdException
> --
>
> Key: HDFS-16771
> URL: https://issues.apache.org/jira/browse/HDFS-16771
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> JournalNode should tersely print some logs about NewerTxnIdException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16659) JournalNode should throw NewerTxnIdException if SinceTxId is bigger than HighestWrittenTxId

2022-09-06 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16659.

Fix Version/s: 3.4.0
   Resolution: Fixed

> JournalNode should throw NewerTxnIdException if SinceTxId is bigger than 
> HighestWrittenTxId
> ---
>
> Key: HDFS-16659
> URL: https://issues.apache.org/jira/browse/HDFS-16659
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> JournalNode should throw `CacheMissException` if `sinceTxId` is bigger than 
> `highestWrittenTxId` during handling `getJournaledEdits` rpc from NNs. 
> Current logic may cause in-progress EditlogTailer cannot replay any Edits 
> from JournalNodes in some corner cases, resulting in ObserverNameNode cannot 
> handle requests from clients.
> Suppose there are 3 journalNodes, JN0 ~ JN1.
> * JN0 has some abnormal cases when Active Namenode is syncing 10 Edits with 
> first txid 11
> * NameNode just ignore the abnormal JN0 and continue to sync Edits to Journal 
> 1 and 2
> * JN0 backed to health
> * NameNode continue sync 10 Edits with first txid 21.
> * At this point, there are no Edits 11 ~ 30 in the cache of JN0
> * Observer NameNode try to select EditLogInputStream through 
> `getJournaledEdits` with since txId 21
> * Journal 2 has some abnormal cases and caused a slow response
> The expected result is: Response should contain 20 Edits from txId 21 to txId 
> 30 from JN1 and JN2. Because Active NameNode successfully write these Edits 
> to JN1 and JN2 and failed write these edits to JN0.
> But in the current implementation,  the response is [Response(0) from JN0, 
> Response(10) from JN1], because  there are some abnormal cases in  JN2, such 
> as GC, bad network,  cause a slow response. So the `maxAllowedTxns` will be 
> 0, NameNode will not replay any Edits.
> As above, the root case is that JournalNode should throw Miss Cache Exception 
> when `sinceTxid` is more than `highestWrittenTxId`.
> And the bug code as blew:
> {code:java}
> if (sinceTxId > getHighestWrittenTxId()) {
> // Requested edits that don't exist yet; short-circuit the cache here
> metrics.rpcEmptyResponses.incr();
> return 
> GetJournaledEditsResponseProto.newBuilder().setTxnCount(0).build(); 
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16732) [SBN READ] Avoid get location from observer when the block report is delayed.

2022-08-25 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16732.

Fix Version/s: 3.4.0
   3.3.9
   Resolution: Fixed

Merged PR 4756 to trunk and branch-3.3. Thanks [~zhengchenyu]!

> [SBN READ] Avoid get location from observer when the block report is delayed.
> -
>
> Key: HDFS-16732
> URL: https://issues.apache.org/jira/browse/HDFS-16732
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> Hive on tez application fail occasionally after observer is enable, log show 
> below.
> {code:java}
> 2022-08-18 15:22:06,914 [ERROR] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Vertex Input: namenodeinfo_stg initializer failed, 
> vertex=vertex_1660618571916_4839_1_00 [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:329)
>   at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
>   at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.afterRanInterruptibly(TrustedListenableFutureTask.java:133)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:80)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:748)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplitHostsAndCachedHosts(FileInputFormat.java:714)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:378)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:159)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:279)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:270)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:270)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:254)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
>   ... 4 more {code}
> As describe in MAPREDUCE-7082, when the block is missing, then will throw 
> this exception, but my cluster had no missing block.
> In this example, I found getListing return location information. When block 
> report of observer is delayed, will return the block without location.
> HDFS-13924 is introduce to solve this problem, but only consider 
> getBlockLocations. 
> In observer node, all method which may return location should check whether 
> locations is empty or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs

[jira] [Commented] (HDFS-13522) RBF: Support observer node from Router-Based Federation

2022-07-28 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572562#comment-17572562
 ] 

Erik Krogen commented on HDFS-13522:


Thanks for sharing [~xuzq_zander], very interesting!

{quote}
I just feel that Design A will caused Client carries a lot of useless NS TxIds 
to RBF, because at most scenarios, RBF just proxy request from one client to 
one downstream NS. And as the underlying NS increases, client will carries more 
and more useless NS TxIds to RBF.
{quote}
[~simbadzina] -- I haven't looked carefully at the design but are we currently 
including all downstream NS in the header regardless of whether or not a client 
accesses it? It's a good point that ideally we would only include the state for 
NS which are actually accessed.

> RBF: Support observer node from Router-Based Federation
> ---
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-13522.001.patch, HDFS-13522.002.patch, 
> HDFS-13522_WIP.patch, HDFS-13522_proposal_zhengchenyu_v1.pdf, RBF_ Observer 
> support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png, 
> observer_reads_in_rbf_proposal_simbadzina_v1.pdf, 
> observer_reads_in_rbf_proposal_simbadzina_v2.pdf
>
>  Time Spent: 20h 50m
>  Remaining Estimate: 0h
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13522) RBF: Support observer node from Router-Based Federation

2022-07-28 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572562#comment-17572562
 ] 

Erik Krogen edited comment on HDFS-13522 at 7/28/22 5:01 PM:
-

Thanks for sharing [~xuzq_zander], very interesting!

{quote}
I just feel that Design A will caused Client carries a lot of useless NS TxIds 
to RBF, because at most scenarios, RBF just proxy request from one client to 
one downstream NS. And as the underlying NS increases, client will carries more 
and more useless NS TxIds to RBF.
{quote}
[~simbadzina] -- I haven't looked carefully at the design but are we currently 
including all downstream NS in the header regardless of whether or not a client 
accesses it? It's a good point that ideally we would only include the state for 
NS which are actually accessed (even if there are only single-digit NS, still 
there's no point carrying the extra state if it is not used).


was (Author: xkrogen):
Thanks for sharing [~xuzq_zander], very interesting!

{quote}
I just feel that Design A will caused Client carries a lot of useless NS TxIds 
to RBF, because at most scenarios, RBF just proxy request from one client to 
one downstream NS. And as the underlying NS increases, client will carries more 
and more useless NS TxIds to RBF.
{quote}
[~simbadzina] -- I haven't looked carefully at the design but are we currently 
including all downstream NS in the header regardless of whether or not a client 
accesses it? It's a good point that ideally we would only include the state for 
NS which are actually accessed.

> RBF: Support observer node from Router-Based Federation
> ---
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-13522.001.patch, HDFS-13522.002.patch, 
> HDFS-13522_WIP.patch, HDFS-13522_proposal_zhengchenyu_v1.pdf, RBF_ Observer 
> support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png, 
> observer_reads_in_rbf_proposal_simbadzina_v1.pdf, 
> observer_reads_in_rbf_proposal_simbadzina_v2.pdf
>
>  Time Spent: 20h 50m
>  Remaining Estimate: 0h
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16696) NameNode supports a new MsyncRPCServer to handle msync() RPC separately

2022-07-27 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572023#comment-17572023
 ] 

Erik Krogen commented on HDFS-16696:


Generally +1 from me as long as we can get it done without introducing too much 
additional complexity, and keep it off-by-default. If we assume there are many 
clients using always-msync mode (including for Design B laid out in 
HDFS-13522), then latency of msync ops becomes very critical. Even without 
always-msync mode, latency of msync can have an impact on performance, so it's 
still potentially useful.

> NameNode supports a new MsyncRPCServer to handle msync() RPC separately
> ---
>
> Key: HDFS-16696
> URL: https://issues.apache.org/jira/browse/HDFS-16696
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>
> HDFS-12943 introduced Consistent Reads from Standby Node. It use msync 
> mechanism to guarantee the consistency.  So the latency of msycn() rpc is 
> very important, especially for some end users who need call msync() rpc every 
> time.
> Unfortunately, NameNode handle msync() RPCs same with other RPCs, also need 
> enqueue, wait, handled. So the msync() will be blocked by other RPCs, such as 
> setQuota, rename, delete, etc. 
> So we need a new mechanism to guarantee the latency of the msync() RPC.
> Such as: 
> * We can supports a new MsyncRPCServer in NameNode to separately msync() RPC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13522) RBF: Support observer node from Router-Based Federation

2022-07-27 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572017#comment-17572017
 ] 

Erik Krogen edited comment on HDFS-13522 at 7/27/22 4:27 PM:
-

[~xuzq_zander] I'm curious to learn more about your use case. I would assume 
that if your namespaces are so finely segmented that you have 100+, then each 
one would be small enough so as to not require the use of CRS. Are you really 
running 100+ namespaces, each of which includes Observer Nodes?

I am still in support of a hybrid Design A + B in the interest of supporting 
both old and new clients, but I am curious about the situation that would lead 
to the issue of very large RPC headers as discussed above.


was (Author: xkrogen):
[~xuzq_zander] I'm curious to learn more about your use case. I would assume 
that if your namespaces are so finely segmented that you have 100+, then each 
one would be small enough so as to not require the use of CRS. Are you really 
running 100+ namespaces, each of which includes Observer Nodes?

> RBF: Support observer node from Router-Based Federation
> ---
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-13522.001.patch, HDFS-13522.002.patch, 
> HDFS-13522_WIP.patch, HDFS-13522_proposal_zhengchenyu_v1.pdf, RBF_ Observer 
> support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png, 
> observer_reads_in_rbf_proposal_simbadzina_v1.pdf, 
> observer_reads_in_rbf_proposal_simbadzina_v2.pdf
>
>  Time Spent: 20h
>  Remaining Estimate: 0h
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13522) RBF: Support observer node from Router-Based Federation

2022-07-27 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572017#comment-17572017
 ] 

Erik Krogen commented on HDFS-13522:


[~xuzq_zander] I'm curious to learn more about your use case. I would assume 
that if your namespaces are so finely segmented that you have 100+, then each 
one would be small enough so as to not require the use of CRS. Are you really 
running 100+ namespaces, each of which includes Observer Nodes?

> RBF: Support observer node from Router-Based Federation
> ---
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-13522.001.patch, HDFS-13522.002.patch, 
> HDFS-13522_WIP.patch, HDFS-13522_proposal_zhengchenyu_v1.pdf, RBF_ Observer 
> support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png, 
> observer_reads_in_rbf_proposal_simbadzina_v1.pdf, 
> observer_reads_in_rbf_proposal_simbadzina_v2.pdf
>
>  Time Spent: 20h
>  Remaining Estimate: 0h
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-14 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522432#comment-17522432
 ] 

Erik Krogen commented on HDFS-16507:


Sounds good, thanks for that context.

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:

[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-13 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521809#comment-17521809
 ] 

Erik Krogen commented on HDFS-16507:


{quote}
Do you mean, if the situation arises that ` minTxIdToKeep > curSegmentTxId `, 
ANN will crash because `Preconditions.CheckArgument` failure?
{quote}
Yeah, that is my concern.

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.pro

[jira] [Commented] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-04-12 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521381#comment-17521381
 ] 

Erik Krogen commented on HDFS-16507:


[~tomscut] -- thanks for changing the assert to a 
{{Preconditions.checkArgument()}}. I agree that this makes sense to ensure 
safety of the log.
However if the situation arises that `{{minTxIdToKeep > curSegmentTxId}}`, then 
the check will fail, so the NN will still crash, right? Forgive me if I'm 
misremembering how the NN will handle a failure in 
{{FSEditLog#purgeLogsOlderThan()}}. While I agree that PR#4082 was a good step, 
the underlying issue seems not to be resolved, right?

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(Fil

[jira] [Commented] (HDFS-16493) [SBN Read]When fast path tail enabled, standby or observer namenode may read uncommitted data

2022-03-04 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501435#comment-17501435
 ] 

Erik Krogen commented on HDFS-16493:


Thanks for reporting [~liutongwei]! I guess this is a continuation of [your 
comment on 
HDFS-13150|https://issues.apache.org/jira/browse/HDFS-13150?focusedCommentId=17408479&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17408479],
 is that correct?

As I said there I don't personally have bandwidth to dig deep onto this, but 
from your detailed explanation, it does seem to be a valid issue. I will let 
[~shv] take a closer look.

> [SBN Read]When fast path tail enabled, standby or observer namenode may read 
> uncommitted data
> -
>
> Key: HDFS-16493
> URL: https://issues.apache.org/jira/browse/HDFS-16493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namanode
>Reporter: liutongwei
>Priority: Critical
> Attachments: example.patch
>
>
> Although fast path tail use quorum read to pull edit log, it seem like can 
> read uncommitted data in some corner case.
> Here is an example. Suppose we have three JN, their init state is:
>  
> {code:java}
> epoch 1
> JN1 [1-3](in-progress)
> JN2 [1-3](in-progress)
> JN3 [1-4](in-progress)
> Note that, in epoch 1 txid 1-3 was committed, and txid 4 not.
> {code}
> When a failover occur, if a new writer cannot contact to JN3 for network 
> partition, and finish the recovery stage, and write a new txid 4 in epoch 2, 
> which value not equal to JN3's.
>  
> {code:java}
> epcho 2
> JN1 [1-3](finalized) [4-4](inprogress)
> JN2 [1-3](finalized) [4-4](inprogress)
> JN3 [1-4](inprogress)
> Note that, in JN3 txid4's value not equal to other JN.
> {code}
>  
> Now there is a read namenode to pull edits, and it contact to JN3 and JN2, it 
> got majority response. But it got logs of same length but different 
> content.And no more information to choose which log is right. If we choose 
> JN3, we got meta data corruption.
> There is a test example patch [^example.patch] for running and debug.
> For fix it i think we should add finalized state to 
> {{{}GetJournaledEditsResponseProto{}}}, so we can discard the fault log.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16181) [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when tailEditLog form JN

2021-10-04 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16181.

Fix Version/s: 3.1.5
   3.2.4
   3.3.2
   2.10.2
   3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thank you [~jianghuazhu]! This is my mistake. I just updated the JIRA status.

> [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when 
> tailEditLog form JN
> -
>
> Key: HDFS-16181
> URL: https://issues.apache.org/jira/browse/HDFS-16181
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4, 3.1.5
>
> Attachments: after.jpg, before.jpg
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I found the JN turn on edit cache, but the metric of 
> rpcRequestCacheMissAmount can not display.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16233) Do not use exception handler to implement copy-on-write for EnumCounters

2021-09-24 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16233:
---
Fix Version/s: 3.3.2
   3.2.3
   2.10.2
   3.4.0
   3.1.5

> Do not use exception handler to implement copy-on-write for EnumCounters
> 
>
> Key: HDFS-16233
> URL: https://issues.apache.org/jira/browse/HDFS-16233
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2, 3.1.5
>
> Attachments: Screen Shot 2021-09-22 at 1.59.59 PM.png, 
> profile_c7_delete_asyncaudit.html
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> HDFS-14547 saves the NameNode heap space occupied by EnumCounters by 
> essentially implementing a copy-on-write strategy.
> At beginning, all EnumCounters refers to the same ConstEnumCounters to save 
> heap space. When it is modified, an exception is thrown and the exception 
> handler converts ConstEnumCounters to EnumCounters object and updates it.
> Using exception handler to perform anything more than occasional is bad for 
> performance. 
> Propose: use instanceof keyword to detect the type of object and do COW 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16233) Do not use exception handler to implement copy-on-write for EnumCounters

2021-09-24 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16233.

Resolution: Fixed

> Do not use exception handler to implement copy-on-write for EnumCounters
> 
>
> Key: HDFS-16233
> URL: https://issues.apache.org/jira/browse/HDFS-16233
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2, 3.1.5
>
> Attachments: Screen Shot 2021-09-22 at 1.59.59 PM.png, 
> profile_c7_delete_asyncaudit.html
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> HDFS-14547 saves the NameNode heap space occupied by EnumCounters by 
> essentially implementing a copy-on-write strategy.
> At beginning, all EnumCounters refers to the same ConstEnumCounters to save 
> heap space. When it is modified, an exception is thrown and the exception 
> handler converts ConstEnumCounters to EnumCounters object and updates it.
> Using exception handler to perform anything more than occasional is bad for 
> performance. 
> Propose: use instanceof keyword to detect the type of object and do COW 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16233) Do not use exception handler to implement copy-on-write for EnumCounters

2021-09-24 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419878#comment-17419878
 ] 

Erik Krogen commented on HDFS-16233:


Merged to {{trunk}} and backported to branch-3.3, branch-3.2, branch-3.1, 
branch-2.10

Thanks [~weichiu]!

> Do not use exception handler to implement copy-on-write for EnumCounters
> 
>
> Key: HDFS-16233
> URL: https://issues.apache.org/jira/browse/HDFS-16233
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2021-09-22 at 1.59.59 PM.png, 
> profile_c7_delete_asyncaudit.html
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> HDFS-14547 saves the NameNode heap space occupied by EnumCounters by 
> essentially implementing a copy-on-write strategy.
> At beginning, all EnumCounters refers to the same ConstEnumCounters to save 
> heap space. When it is modified, an exception is thrown and the exception 
> handler converts ConstEnumCounters to EnumCounters object and updates it.
> Using exception handler to perform anything more than occasional is bad for 
> performance. 
> Propose: use instanceof keyword to detect the type of object and do COW 
> accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13150) [Edit Tail Fast Path] Allow SbNN to tail in-progress edits from JN via RPC

2021-09-15 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415662#comment-17415662
 ] 

Erik Krogen commented on HDFS-13150:


[~liutongwei] thanks for sharing your concern!

I don't quite remember how epochs interplay with the durability or reuse of 
transaction IDs -- it's been quite a while since I've looked at this area of 
the code. Unfortunately I'm also not actively working on HDFS currently. I took 
a brief look around the JN code in this area to refresh my memory, but I'm 
still missing some details and don't have the time to invest in properly 
understanding your concern.

[~shv], do you have any insight on the concern above?

> [Edit Tail Fast Path] Allow SbNN to tail in-progress edits from JN via RPC
> --
>
> Key: HDFS-13150
> URL: https://issues.apache.org/jira/browse/HDFS-13150
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs, journal-node, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: edit-tailing-fast-path-design-v0.pdf, 
> edit-tailing-fast-path-design-v1.pdf, edit-tailing-fast-path-design-v2.pdf
>
>
> In the interest of making coordinated/consistent reads easier to complete 
> with low latency, it is advantageous to reduce the time between when a 
> transaction is applied on the ANN and when it is applied on the SbNN. We 
> propose adding a new "fast path" which can be used to tail edits when low 
> latency is desired. We leave the existing tailing logic in place, and fall 
> back to this path on startup, recovery, and when the fast path encounters 
> unrecoverable errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16181) [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when tailEditLog form JN

2021-09-15 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415645#comment-17415645
 ] 

Erik Krogen commented on HDFS-16181:


Committed to trunk via PR #3317 and backported to branch-3.3, branch-3.2, 
branch-3.1, branch-2.10

Thanks [~wangzhaohui]!

> [SBN Read] Fix metric of RpcRequestCacheMissAmount can't display when 
> tailEditLog form JN
> -
>
> Key: HDFS-16181
> URL: https://issues.apache.org/jira/browse/HDFS-16181
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Critical
>  Labels: pull-request-available
> Attachments: after.jpg, before.jpg
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I found the JN turn on edit cache, but the metric of 
> rpcRequestCacheMissAmount can not display.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15808) Add metrics for FSNamesystem read/write lock hold long time

2021-02-19 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287156#comment-17287156
 ] 

Erik Krogen commented on HDFS-15808:


I agree with [~tomscut] here, as long as the metric is marked as a {{COUNTER}}, 
indicating to the metrics system that it is a monotonically increasing counter. 
For example 
[inGraphs|https://engineering.linkedin.com/blog/2017/08/ingraphs--monitoring-and-unexpected-artwork]
 will automatically turn counter-type metrics into delta graphs. It looks like 
the current patch doesn't set the type, meaning it uses {{Type.DEFAULT}}, which 
AFAICT will end up as a {{GAUGE}} (ref 
[here|https://github.com/apache/hadoop/blob/1e3a6efcef2924a7966c44ca63476c853956691d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/lib/MethodMetric.java#L63]).
 So I think we need to adjust the {{type}} value on the {{@Metric}} annotation.

> Add metrics for FSNamesystem read/write lock hold long time
> ---
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
> Attachments: lockLongHoldCount
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15808) Add metrics for FSNamesystem read/write lock warnings

2021-02-10 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282570#comment-17282570
 ] 

Erik Krogen commented on HDFS-15808:


Hi [~tomscut], the proposal looks pretty reasonable to me from a quick glance, 
but I'm not working in this area these days so can't allocate time at the 
moment. cc some folks that are hopefully more currently relevant: [~shv] 
[~daryn] [~vinayakumarb] [~kihwal]

One thing I'll say is that I think "LockWarning" is not a great name, maybe 
"LockLongHoldTime" or something similar to be more clear about what it is 
measuring.

> Add metrics for FSNamesystem read/write lock warnings
> -
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10873) Add histograms for FSNamesystemLock Metrics

2021-02-10 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen reassigned HDFS-10873:
--

Assignee: (was: Erik Krogen)

> Add histograms for FSNamesystemLock Metrics
> ---
>
> Key: HDFS-10873
> URL: https://issues.apache.org/jira/browse/HDFS-10873
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Reporter: Erik Krogen
>Priority: Major
>
> It could be useful to have full histograms for how long operations are 
> holding the namesystem lock in addition to just rate information. This will, 
> however, be a large number of emitted metrics and likely require more 
> coordination, so it may be best to have this be separately configurable from 
> more simple namesystem lock metrics. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-03 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278150#comment-17278150
 ] 

Erik Krogen commented on HDFS-13609:


Thanks for the detailed explanation and example [~xuzq_zander]! I see now the 
issue. I believe the logic in the snippet you shared is doing the right thing 
-- and is necessary for correctness. I guess there are two issues being 
discussed here:

# Why JN1 is lagging: You're saying this is happening because JN1 wrote some 
txns to its cache, but not onto disk. Can you elaborate on why this causes it 
to lag? It's been a long time since I looked at this code. Regardless, I agree 
that this is definitely a bug, and we should be doing whatever is necessary to 
keep the cache reflective of what eventually got written to disk. I don't know 
if the right approach is to write to the cache after the disk (this may cause 
performance issues?) or to invalidate the cache if the disk write fails.
# Why JN1 lagging is causing broader issues: We use 
{{loggers.waitForWriteQuorum}} which returns as soon as it gets a quorum of 
responses, but in some cases we actually want to wait a bit longer but get more 
responses, for example in the case you described where JN1 keeps responding 
with a low txid. I actually have some memory of discussing this back in the 
implementation days with [~shv] and [~vagarychen] but don't remember the 
conclusion -- I think we were waiting to see if this became an issue in 
practice.

It sounds like (1) is a legitimate bug, and (2) is kind of a bug and kind of a 
performance/reliability enhancement. Unfortunately I'm no longer working in 
this area so I can't spend much time beyond providing high-level input, but 
these sound both like good areas for improvement. Perhaps [~shv] can provide 
input on whether we're seeing similar issues on our end and whether or not he 
remembers any discussions on these matters.

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277250#comment-17277250
 ] 

Erik Krogen commented on HDFS-13609:


Hi [~xuzq_zander], thanks for taking a look.

{quote}
when onlyDurableTxns is false, maxAllowedTxns = responseCounts.get(0)
{quote}
Correct me if I'm wrong but I think you have this backwards. If 
{{onlyDurableTxns}} is false, then {{maxAllowedTxns = highestTxnCount}} which 
is {{responseCounts.get(2)}}

It is when {{onlyDurableTxns}} is true that you get {{responseCounts.get(0)}}. 
In this case, we really do need to take the lowest of the returned values. 
Since we only got 3 responses, we can't make any assumptions about the other 2 
JNs, so just assume they have 0 txns. We only want to take txns that have 
landed on a quorum of JNs (thus becoming durable). Thus since we only got 3 
responses, we have to take the lowest txn that any of those responses are aware 
of. For example if we got back {{(5, 10, 20)}}, then only txns 1-5 are 
available on all 3 JNs we got responses from, so those are the only 
transactions we know are durable. Of course more _might_ be durable if they 
were persisted on the two JNs we didn't get responses from, but we don't know 
that.

Let me know if that clears things up.

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14084) Need for more stats in DFSClient

2020-09-28 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203485#comment-17203485
 ] 

Erik Krogen commented on HDFS-14084:


Hey [~sodonnell], thanks for the ping. Internally we decided to go with a 
different approach so I never got around to wrapping this up. Unfortunately I'm 
not working on HDFS these days and don't have the bandwidth to devote to it 
beyond maybe some simple reviews. I've unassigned this ticket from myself.

> Need for more stats in DFSClient
> 
>
> Key: HDFS-14084
> URL: https://issues.apache.org/jira/browse/HDFS-14084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Pranay Singh
>Priority: Minor
> Attachments: HDFS-14084.001.patch, HDFS-14084.002.patch, 
> HDFS-14084.003.patch, HDFS-14084.004.patch, HDFS-14084.005.patch, 
> HDFS-14084.006.patch, HDFS-14084.007.patch, HDFS-14084.008.patch, 
> HDFS-14084.009.patch, HDFS-14084.010.patch, HDFS-14084.011.patch, 
> HDFS-14084.012.patch, HDFS-14084.013.patch, HDFS-14084.014.patch, 
> HDFS-14084.015.patch, HDFS-14084.016.patch, HDFS-14084.017.patch, 
> HDFS-14084.018.patch
>
>
> The usage of HDFS has changed from being used as a map-reduce filesystem, now 
> it's becoming more of like a general purpose filesystem. In most of the cases 
> there are issues with the Namenode so we have metrics to know the workload or 
> stress on Namenode.
> However, there is a need to have more statistics collected for different 
> operations/RPCs in DFSClient to know which RPC operations are taking longer 
> time or to know what is the frequency of the operation.These statistics can 
> be exposed to the users of DFS Client and they can periodically log or do 
> some sort of flow control if the response is slow. This will also help to 
> isolate HDFS issue in a mixed environment where on a node say we have Spark, 
> HBase and Impala running together. We can check the throughput of different 
> operation across client and isolate the problem caused because of noisy 
> neighbor or network congestion or shared JVM.
> We have dealt with several problems from the field for which there is no 
> conclusive evidence as to what caused the problem. If we had metrics or stats 
> in DFSClient we would be better equipped to solve such complex problems.
> List of jiras for reference:
> -
>  HADOOP-15538 HADOOP-15530 ( client side deadlock)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14084) Need for more stats in DFSClient

2020-09-28 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen reassigned HDFS-14084:
--

Assignee: (was: Erik Krogen)

> Need for more stats in DFSClient
> 
>
> Key: HDFS-14084
> URL: https://issues.apache.org/jira/browse/HDFS-14084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Pranay Singh
>Priority: Minor
> Attachments: HDFS-14084.001.patch, HDFS-14084.002.patch, 
> HDFS-14084.003.patch, HDFS-14084.004.patch, HDFS-14084.005.patch, 
> HDFS-14084.006.patch, HDFS-14084.007.patch, HDFS-14084.008.patch, 
> HDFS-14084.009.patch, HDFS-14084.010.patch, HDFS-14084.011.patch, 
> HDFS-14084.012.patch, HDFS-14084.013.patch, HDFS-14084.014.patch, 
> HDFS-14084.015.patch, HDFS-14084.016.patch, HDFS-14084.017.patch, 
> HDFS-14084.018.patch
>
>
> The usage of HDFS has changed from being used as a map-reduce filesystem, now 
> it's becoming more of like a general purpose filesystem. In most of the cases 
> there are issues with the Namenode so we have metrics to know the workload or 
> stress on Namenode.
> However, there is a need to have more statistics collected for different 
> operations/RPCs in DFSClient to know which RPC operations are taking longer 
> time or to know what is the frequency of the operation.These statistics can 
> be exposed to the users of DFS Client and they can periodically log or do 
> some sort of flow control if the response is slow. This will also help to 
> isolate HDFS issue in a mixed environment where on a node say we have Spark, 
> HBase and Impala running together. We can check the throughput of different 
> operation across client and isolate the problem caused because of noisy 
> neighbor or network congestion or shared JVM.
> We have dealt with several problems from the field for which there is no 
> conclusive evidence as to what caused the problem. If we had metrics or stats 
> in DFSClient we would be better equipped to solve such complex problems.
> List of jiras for reference:
> -
>  HADOOP-15538 HADOOP-15530 ( client side deadlock)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally

2020-05-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113269#comment-17113269
 ] 

Erik Krogen commented on HDFS-15368:


cc [~vagarychen] and [~shv]

> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
> 
>
> Key: HDFS-15368
> URL: https://issues.apache.org/jira/browse/HDFS-15368
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: balancer, test
> Attachments: HDFS-15368.001.patch
>
>
> When I am working on HDFS-13183, I found that 
> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, 
> because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, 
> when invoke getBlocks with opening Observer Read feature, it could request 
> any one of two ObserverNN based on my observation. So only verify the first 
> ObserverNN and check times of invoke #getBlocks is not expected.
> {code:java}
>   for (int i = 0; i < cluster.getNumNameNodes(); i++) {
> // First observer node is at idx 2, or 3 if 2 has been shut down
> // It should get both getBlocks calls, all other NNs should see 0 
> calls
> int expectedObserverIdx = withObserverFailure ? 3 : 2;
> int expectedCount = (i == expectedObserverIdx) ? 2 : 0;
> verify(namesystemSpies.get(i), times(expectedCount))
> .getBlocks(any(), anyLong(), anyLong());
>   }
> {code}
> cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, 
> would you like give some suggestions? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13904) ContentSummary does not always respect processing limit, resulting in long lock acquisitions

2020-05-06 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100870#comment-17100870
 ] 

Erik Krogen commented on HDFS-13904:


Looks like I must have confused {{getReadHoldCount()}} and 
{{getReadLockCount()}}, good catch [~umamaheswararao]. It would appear that you 
are right that the check is indeed valid. In that case I am out of ideas for 
what went wrong.

FYI [~shv] [~vagarychen] since you recently looked into some related issues.

> ContentSummary does not always respect processing limit, resulting in long 
> lock acquisitions
> 
>
> Key: HDFS-13904
> URL: https://issues.apache.org/jira/browse/HDFS-13904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> HDFS-4995 added a config {{dfs.content-summary.limit}} which allows for an 
> administrator to set a limit on the number of entries processed during a 
> single acquisition of the {{FSNamesystemLock}} during the creation of a 
> content summary. This is useful to prevent very long (multiple seconds) 
> pauses on the NameNode when {{getContentSummary}} is called on large 
> directories.
> However, even on versions with HDFS-4995, we have seen warnings like:
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read 
> lock held for 9398 ms via
> java.lang.Thread.getStackTrace(Thread.java:1552)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486)
> org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656)
> {code}
> happen quite consistently when {{getContentSummary}} was called on a large 
> directory on a heavily-loaded NameNode. Such long pauses completely destroy 
> the performance of the NameNode. We have the limit set to its default of 
> 5000; if it was respected, clearly there would not be a 10-second pause.
> The current {{yield()}} code within {{ContentSummaryComputationContext}} 
> looks like:
> {code}
>   public boolean yield() {
> // Are we set up to do this?
> if (limitPerRun <= 0 || dir == null || fsn == null) {
>   return false;
> }
> // Have we reached the limit?
> long currentCount = counts.getFileCount() +
> counts.getSymlinkCount() +
> counts.getDirectoryCount() +
> counts.getSnapshotableDirectoryCount();
> if (currentCount <= nextCountLimit) {
>   return false;
> }
> // Update the next limit
> nextCountLimit = currentCount + limitPerRun;
> boolean hadDirReadLock = dir.hasReadLock();
> boolean hadDirWriteLock = dir.hasWriteLock();
> boolean hadFsnReadLock = fsn.hasReadLock();
> boolean hadFsnWriteLock = fsn.hasWriteLock();
> // sanity check.
> if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
> hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
> fsn.getReadHoldCount() != 1) {
>   // cannot relinquish
>   return false;
> }
> // unlock
> dir.readUnlock();
> fsn.readUnlock("contentSummary");
> try {
>   Thread.sleep(sleepMilliSec, sleepNanoSec);
> } catch (InterruptedException ie) {
> } finally {
>   // reacquire
>   fsn.readLock();
>   dir.readLock();
> }
> yieldCount++;
> return true;
>   }
> {code}
> We believe that this check in particular is the culprit:
> {code}
> if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
> hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
> fsn.getReadHoldCount() != 1) {
>   // cannot relinquish
>   return false;
> }
> {code}
> The content summary computation will only relinquish the lock if it is 
> currently the _only_ holder of the lock. Given the high volume of read 
> requests on a heavily loaded NameNode, especially when unfair locking is 
> enabled, it is likely there may be another holder of the read lock performing 
> some short-lived operation. By refusing to give up the lock in this case, the 
> content summary computation ends up never relinquishing the lock.
> We propose to simply remove the readHoldCount checks from this {{yield()}}. 
> This should alleviate the ca

[jira] [Commented] (HDFS-13904) ContentSummary does not always respect processing limit, resulting in long lock acquisitions

2020-05-04 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099220#comment-17099220
 ] 

Erik Krogen commented on HDFS-13904:


Hi [~umamaheswararao], I'm not actively working on this. I don't believe we 
applied any fix to the NN; instead we focused on migrating users to the 
{{getQuotaUsage()}} API, since it was quota checks which caused the really 
large issues.

Yes, the NN had consistent load throughout (besides some minor blips around 
restarts of course). It indeed was interesting to see the difference across 
restarts. I don't have any good ideas there.

GC pauses were low and consistent with normal behavior.

> ContentSummary does not always respect processing limit, resulting in long 
> lock acquisitions
> 
>
> Key: HDFS-13904
> URL: https://issues.apache.org/jira/browse/HDFS-13904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> HDFS-4995 added a config {{dfs.content-summary.limit}} which allows for an 
> administrator to set a limit on the number of entries processed during a 
> single acquisition of the {{FSNamesystemLock}} during the creation of a 
> content summary. This is useful to prevent very long (multiple seconds) 
> pauses on the NameNode when {{getContentSummary}} is called on large 
> directories.
> However, even on versions with HDFS-4995, we have seen warnings like:
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read 
> lock held for 9398 ms via
> java.lang.Thread.getStackTrace(Thread.java:1552)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486)
> org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656)
> {code}
> happen quite consistently when {{getContentSummary}} was called on a large 
> directory on a heavily-loaded NameNode. Such long pauses completely destroy 
> the performance of the NameNode. We have the limit set to its default of 
> 5000; if it was respected, clearly there would not be a 10-second pause.
> The current {{yield()}} code within {{ContentSummaryComputationContext}} 
> looks like:
> {code}
>   public boolean yield() {
> // Are we set up to do this?
> if (limitPerRun <= 0 || dir == null || fsn == null) {
>   return false;
> }
> // Have we reached the limit?
> long currentCount = counts.getFileCount() +
> counts.getSymlinkCount() +
> counts.getDirectoryCount() +
> counts.getSnapshotableDirectoryCount();
> if (currentCount <= nextCountLimit) {
>   return false;
> }
> // Update the next limit
> nextCountLimit = currentCount + limitPerRun;
> boolean hadDirReadLock = dir.hasReadLock();
> boolean hadDirWriteLock = dir.hasWriteLock();
> boolean hadFsnReadLock = fsn.hasReadLock();
> boolean hadFsnWriteLock = fsn.hasWriteLock();
> // sanity check.
> if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
> hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
> fsn.getReadHoldCount() != 1) {
>   // cannot relinquish
>   return false;
> }
> // unlock
> dir.readUnlock();
> fsn.readUnlock("contentSummary");
> try {
>   Thread.sleep(sleepMilliSec, sleepNanoSec);
> } catch (InterruptedException ie) {
> } finally {
>   // reacquire
>   fsn.readLock();
>   dir.readLock();
> }
> yieldCount++;
> return true;
>   }
> {code}
> We believe that this check in particular is the culprit:
> {code}
> if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
> hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
> fsn.getReadHoldCount() != 1) {
>   // cannot relinquish
>   return false;
> }
> {code}
> The content summary computation will only relinquish the lock if it is 
> currently the _only_ holder of the lock. Given the high volume of read 
> requests on a heavily loaded NameNode, especially when unfair locking is 
> enabled, it is likely there may be another holder of the read lock performing 
> some short-lived operation. By refusing to give up the lock in this case,

[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing

2020-05-04 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099021#comment-17099021
 ] 

Erik Krogen commented on HDFS-15323:


+1 pretty simple fix, LGTM.

> StandbyNode fails transition to active due to insufficient transaction tailing
> --
>
> Key: HDFS-15323
> URL: https://issues.apache.org/jira/browse/HDFS-15323
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.7.7
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15323-branch-2.10.002.patch, 
> HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch, HDFS-15323.002.patch
>
>
> StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind 
> in tailing journal transaction (from QJM) it can crash with 
> {{IllegalStateException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId

2020-03-23 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065103#comment-17065103
 ] 

Erik Krogen commented on HDFS-14442:


Apologies, I was on leave when pinged earlier. Thanks a lot for fixing this up 
and getting it committed!

> Disagreement between HAUtil.getAddressOfActive and 
> RpcInvocationHandler.getConnectionId
> ---
>
> Key: HDFS-14442
> URL: https://issues.apache.org/jira/browse/HDFS-14442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Assignee: Ravuri Sushma sree
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, 
> HDFS-14442.003.patch, HDFS-14442.004.patch
>
>
> While working on HDFS-14245, we noticed a discrepancy in some proxy-handling 
> code.
> The description of {{RpcInvocationHandler.getConnectionId()}} states:
> {code}
>   /**
>* Returns the connection id associated with the InvocationHandler instance.
>* @return ConnectionId
>*/
>   ConnectionId getConnectionId();
> {code}
> It does not make any claims about whether this connection ID will be an 
> active proxy or not. Yet in {{HAUtil}} we have:
> {code}
>   /**
>* Get the internet address of the currently-active NN. This should rarely 
> be
>* used, since callers of this method who connect directly to the NN using 
> the
>* resulting InetSocketAddress will not be able to connect to the active NN 
> if
>* a failover were to occur after this method has been called.
>* 
>* @param fs the file system to get the active address of.
>* @return the internet address of the currently-active NN.
>* @throws IOException if an error occurs while resolving the active NN.
>*/
>   public static InetSocketAddress getAddressOfActive(FileSystem fs)
>   throws IOException {
> if (!(fs instanceof DistributedFileSystem)) {
>   throw new IllegalArgumentException("FileSystem " + fs + " is not a 
> DFS.");
> }
> // force client address resolution.
> fs.exists(new Path("/"));
> DistributedFileSystem dfs = (DistributedFileSystem) fs;
> DFSClient dfsClient = dfs.getClient();
> return RPC.getServerAddress(dfsClient.getNamenode());
>   }
> {code}
> Where the call {{RPC.getServerAddress()}} eventually terminates into 
> {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> 
> {{RPC.getConnectionIdForProxy()}} -> 
> {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making 
> an incorrect assumption that {{RpcInvocationHandler}} will necessarily return 
> an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a 
> counter-example to this, since the current connection ID may be pointing at, 
> for example, an Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14519) NameQuota is not update after concat operation, so namequota is wrong

2019-12-17 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998387#comment-16998387
 ] 

Erik Krogen commented on HDFS-14519:


Awesome, thanks [~ayushtkn]!

> NameQuota is not update after concat operation, so namequota is wrong
> -
>
> Key: HDFS-14519
> URL: https://issues.apache.org/jira/browse/HDFS-14519
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-14519-branch-2.10-003.patch, 
> HDFS-14519-branch-2.10-03.patch, HDFS-14519.001.patch, HDFS-14519.002.patch, 
> HDFS-14519.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14519) NameQuota is not update after concat operation, so namequota is wrong

2019-12-12 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994870#comment-16994870
 ] 

Erik Krogen commented on HDFS-14519:


Thanks for the update :) I am +1 on the backport as long as the Jenkins report 
comes back clean.

> NameQuota is not update after concat operation, so namequota is wrong
> -
>
> Key: HDFS-14519
> URL: https://issues.apache.org/jira/browse/HDFS-14519
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14519-branch-2.10-003.patch, 
> HDFS-14519-branch-2.10-03.patch, HDFS-14519.001.patch, HDFS-14519.002.patch, 
> HDFS-14519.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14519) NameQuota is not update after concat operation, so namequota is wrong

2019-12-12 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994820#comment-16994820
 ] 

Erik Krogen commented on HDFS-14519:


Hi [~ayushtkn], it looks like the {{branch-2.10-003}} patch you uploaded still 
has {{Assert.assertEquals}} so it is failing to compile. Did you attach the 
wrong file perhaps?

> NameQuota is not update after concat operation, so namequota is wrong
> -
>
> Key: HDFS-14519
> URL: https://issues.apache.org/jira/browse/HDFS-14519
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14519-branch-2.10-003.patch, HDFS-14519.001.patch, 
> HDFS-14519.002.patch, HDFS-14519.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15044) [Dynamometer] Show the line of audit log when parsing it unsuccessfully

2019-12-12 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15044:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> [Dynamometer] Show the line of audit log when parsing it unsuccessfully
> ---
>
> Key: HDFS-15044
> URL: https://issues.apache.org/jira/browse/HDFS-15044
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: tools
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15044) [Dynamometer] Show the line of audit log when parsing it unsuccessfully

2019-12-12 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15044:
---
Fix Version/s: 3.3.0

> [Dynamometer] Show the line of audit log when parsing it unsuccessfully
> ---
>
> Key: HDFS-15044
> URL: https://issues.apache.org/jira/browse/HDFS-15044
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: tools
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15044) [Dynamometer] Show the line of audit log when parsing it unsuccessfully

2019-12-12 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994803#comment-16994803
 ] 

Erik Krogen commented on HDFS-15044:


Just committed this to trunk. Thanks for the contribution [~tasanuma]!

> [Dynamometer] Show the line of audit log when parsing it unsuccessfully
> ---
>
> Key: HDFS-15044
> URL: https://issues.apache.org/jira/browse/HDFS-15044
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: tools
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-11 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Just committed this to trunk ~ branch-2.10. Thanks for the reviews [~shv]!

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> HDFS-15032.005.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-11 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Fix Version/s: 2.10.1
   3.2.2
   3.1.4
   3.3.0

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> HDFS-15032.005.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14519) NameQuota is not update after concat operation, so namequota is wrong

2019-12-11 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993718#comment-16993718
 ] 

Erik Krogen commented on HDFS-14519:


Hi [~ayushtkn], just want to check if you still plan to work on this backport?

> NameQuota is not update after concat operation, so namequota is wrong
> -
>
> Key: HDFS-14519
> URL: https://issues.apache.org/jira/browse/HDFS-14519
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14519.001.patch, HDFS-14519.002.patch, 
> HDFS-14519.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-10 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992999#comment-16992999
 ] 

Erik Krogen commented on HDFS-15032:


I'm not seeing the test failures locally and it looks like the findbugs report 
is not related to this. v5 should be ready for review [~shv]

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> HDFS-15032.005.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-10 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992719#comment-16992719
 ] 

Erik Krogen edited comment on HDFS-15032 at 12/10/19 4:35 PM:
--

Thanks for the info [~shv], good to know. I've removed the {{toString()}} stuff 
in v5.

After seeing the Jenkins failure, I experimented and found that the new test 
timed out half of the time (5 of 10 runs) when run on my machine, but it 
succeeded every time when I increased the timeout to 2 minutes. I think it just 
needs longer since there is more overhead involved with the failure handling.

To avoid spurious failures, I increased the timeout for the failure test to 3 
minutes, and for the non-failure observer test to 2 minutes.


was (Author: xkrogen):
Thanks for the info [~shv], good to know. I've removed the {{toString()}} stuff 
in v5.

After seeing the Jenkins failure, I experimented and found that the new test 
timing out half of the time (5 of 10 runs) when run on my machine, but it 
succeeded every time when I increased the timeout to 2 minutes. I think it just 
needs longer since there is more overhead involved with the failure handling.

To avoid spurious failures, I increased the timeout for the failure test to 3 
minutes, and for the non-failure observer test to 2 minutes.

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> HDFS-15032.005.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-10 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992719#comment-16992719
 ] 

Erik Krogen commented on HDFS-15032:


Thanks for the info [~shv], good to know. I've removed the {{toString()}} stuff 
in v5.

After seeing the Jenkins failure, I experimented and found that the new test 
timing out half of the time (5 of 10 runs) when run on my machine, but it 
succeeded every time when I increased the timeout to 2 minutes. I think it just 
needs longer since there is more overhead involved with the failure handling.

To avoid spurious failures, I increased the timeout for the failure test to 3 
minutes, and for the non-failure observer test to 2 minutes.

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> HDFS-15032.005.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-10 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: HDFS-15032.005.patch

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> HDFS-15032.005.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Comment: was deleted

(was: | (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  8s{color} 
| {color:red} HDFS-15032 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-15032 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28486/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.

)

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> debugger_with_tostring.png, debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991822#comment-16991822
 ] 

Erik Krogen commented on HDFS-15032:


It looks like Yetus was trying to pick up the image as a patch:
{code}
HDFS-15032 patch is being downloaded at Mon Dec  9 17:47:52 UTC 2019 from
  
https://issues.apache.org/jira/secure/attachment/12988361/debugger_without_tostring.png
 -> Downloaded
{code}
I'm re-attaching v3 as v004 to get Yetus to pick it up (hopefully).

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> debugger_with_tostring.png, debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: HDFS-15032.004.patch

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, HDFS-15032.004.patch, 
> debugger_with_tostring.png, debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991806#comment-16991806
 ] 

Erik Krogen commented on HDFS-15032:


Hey [~shv], that's a great question. In many cases, only a reference to the 
proxy is kept, so it is the direct {{toString}} method of the proxy that you 
see. For example, this is what a debugger stopped in 
{{ObserverReadProxyProvider}} looks like without this change:
 !debugger_without_tostring.png! 
You can see that the proxy (which is really a combined proxy) is reporting that 
it is a {{NameNodeProtocolTranslatorPB}}, because it is the {{toString()}} 
method of the first proxy which is being used. This was misleading to me when I 
was trying to investigate this issue, as it led me to believe a plain 
{{NameNodeProtocol}} was showing up where I expected a {{BalancerProtocol}}. 
However with the change, it is more obvious what is going on:
 !debugger_with_tostring.png! 

I see your concern about the performance, however. I've added a v003 patch 
which replaces to string comparison with a call to {{Method.equals()}}, which I 
confirmed internally only does a few reference equality checks:
{code}
public boolean equals(Object obj) {
if (obj != null && obj instanceof Method) {
Method other = (Method)obj;
if ((getDeclaringClass() == other.getDeclaringClass())
&& (getName() == other.getName())) {
if (!returnType.equals(other.getReturnType()))
return false;
return equalParamTypes(parameterTypes, other.parameterTypes);
}
}
return false;
}
{code}
Let me know if that addresses your concerns. If you think it's too risky for 
performance, I'm fine with removing it.

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: debugger_with_tostring.png

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: debugger_without_tostring.png

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch, debugger_with_tostring.png, 
> debugger_without_tostring.png
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-09 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: HDFS-15032.003.patch

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch, HDFS-15032.003.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-06 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989926#comment-16989926
 ] 

Erik Krogen commented on HDFS-15032:


[~shv] can you take a look at the v2 patch when you have a chance? I don't 
think the test failures are related.

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-05 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989125#comment-16989125
 ] 

Erik Krogen commented on HDFS-15032:


The new test case was failing due to interference between the two Observer 
Node-based balancer tests. I disabled the FS cache to fix the problem.

I also noticed that this patch caused {{TestBalancerService}} to fail as well. 
I've handled this in v002 patch. Basically the test was timing out because, 
with the default failover attempt count of 15, it would take too long to 
declare failure to communicate with the failing NameNode (now that the 
exception was being properly treated). Reducing the max failover attempts fixed 
the issue.

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-05 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: HDFS-15032.002.patch

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch, 
> HDFS-15032.002.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-05 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: HDFS-15032.001.patch

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch, HDFS-15032.001.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an unavailable NN via ObserverReadProxyProvider

2019-12-04 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Summary: Balancer crashes when it fails to contact an unavailable NN via 
ObserverReadProxyProvider  (was: Balancer crashes when it fails to contact an 
NN via ObserverReadProxyProvider)

> Balancer crashes when it fails to contact an unavailable NN via 
> ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider

2019-12-04 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Attachment: HDFS-15032.000.patch

> Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15032) Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider

2019-12-04 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-15032:
---
Status: Patch Available  (was: Open)

> Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-15032.000.patch
>
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider

2019-12-04 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988094#comment-16988094
 ] 

Erik Krogen commented on HDFS-15032:


This was caused because the {{RpcInvocationHandler}} defined as part of the 
{{ProxyCombiner}} in HDFS-14162 didn't properly handle exceptions of type 
{{InvocationTargetException}} that are thrown from the {{invoke()}}. I'm 
attaching a v000 patch which fixes this and adds a test case in 
{{TestBalancerWithHANameNodes}}. I also updated the {{toString()}} method of 
the proxy created by {{ProxyCombiner}} to be more readable in a debugger.

> Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider
> -
>
> Key: HDFS-15032
> URL: https://issues.apache.org/jira/browse/HDFS-15032
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.10.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> When trying to run the Balancer using ObserverReadProxyProvider (to allow it 
> to read from the Observer Node as described in HDFS-14979), if one of the NNs 
> isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15032) Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider

2019-12-04 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988082#comment-16988082
 ] 

Erik Krogen commented on HDFS-15032:


You'll get an exception like:
{code}
19/12/04 18:17:44 ERROR balancer.Balancer: Exiting balancer due an exception
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy4.getHAServiceState(Unknown Source)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.getHAServiceState(ObserverReadProxyProvider.java:276)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.changeProxy(ObserverReadProxyProvider.java:262)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider.access$900(ObserverReadProxyProvider.java:69)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider$ObserverReadInvocationHandler.invoke(ObserverReadProxyProvider.java:399)
at com.sun.proxy.$Proxy4.getDatanodeStorageReport(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy4.getDatanodeStorageReport(Unknown Source)
at 
org.apache.hadoop.hdfs.server.balancer.NameNodeConnector.getLiveDatanodeStorageReport(NameNodeConnector.java:198)
at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher.init(Dispatcher.java:1001)
at 
org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:602)
at 
org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:696)
at 
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:775)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at 
org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:918)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.ipc.ProxyCombiner$CombinedProxyInvocationHandler.invoke(ProxyCombiner.java:97)
... 23 more
Caused by: java.net.ConnectException: Call From  to 
 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1544)
at org.apache.hadoop.ipc.Client.call(Client.java:1486)
at org.apache.hadoop.ipc.Client.call(Client.java:1385)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy15.getHAServiceState(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getHAServiceState(ClientNamenodeProtocolTranslatorPB.java:1630)
... 28 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:701)
at 
org.apache.hadoo

[jira] [Created] (HDFS-15032) Balancer crashes when it fails to contact an NN via ObserverReadProxyProvider

2019-12-04 Thread Erik Krogen (Jira)
Erik Krogen created HDFS-15032:
--

 Summary: Balancer crashes when it fails to contact an NN via 
ObserverReadProxyProvider
 Key: HDFS-15032
 URL: https://issues.apache.org/jira/browse/HDFS-15032
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer & mover
Affects Versions: 2.10.0
Reporter: Erik Krogen
Assignee: Erik Krogen


When trying to run the Balancer using ObserverReadProxyProvider (to allow it to 
read from the Observer Node as described in HDFS-14979), if one of the NNs 
isn't running, the Balancer will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14519) NameQuota is not update after concat operation, so namequota is wrong

2019-12-03 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987291#comment-16987291
 ] 

Erik Krogen commented on HDFS-14519:


[~RANith] or [~ayushtkn], are you interested in putting together a branch-2 
backport? It seems that the test case needs to be moved.

> NameQuota is not update after concat operation, so namequota is wrong
> -
>
> Key: HDFS-14519
> URL: https://issues.apache.org/jira/browse/HDFS-14519
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14519.001.patch, HDFS-14519.002.patch, 
> HDFS-14519.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14969) Fix HDFS client unnecessary failover log printing

2019-12-02 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986173#comment-16986173
 ] 

Erik Krogen commented on HDFS-14969:


I linked HDFS-15024 which describes a similar issue. I think the solution will 
be similar, and the two issues can probably share any new methods added to 
{{FailoverProxyProvider}}.

> Fix HDFS client unnecessary failover log printing
> -
>
> Key: HDFS-14969
> URL: https://issues.apache.org/jira/browse/HDFS-14969
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.1.3
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Minor
>  Labels: multi-sbnn
>
> In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and 
> then a client starts rpc with the 1st NN, it will be silent when failover 
> from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd 
> NN, it prints some unnecessary logs, in some scenarios, these logs will be 
> very numerous:
> {code:java}
> 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459)
>  ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15024) [SBN read] In FailoverOnNetworkExceptionRetry , Number of NameNodes as a condition of calculation of sleep time

2019-12-02 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986171#comment-16986171
 ] 

Erik Krogen commented on HDFS-15024:


This seems fairly closely related to HDFS-14969. Both issues are caused because 
of assumptions in the code that only two NNs will be present in the conf, even 
though that hasn't been true since HDFS-6440 was committed. HDFS-14969 isn't as 
impactful since it's just logging, but It seems that the solution (expose a way 
to find how many proxies there are, and use that value instead of hard-coded 2, 
as described in [this 
comment|https://issues.apache.org/jira/browse/HDFS-14969?focusedCommentId=16972962&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16972962])
 is similar.

> [SBN read] In FailoverOnNetworkExceptionRetry , Number of NameNodes as a 
> condition of calculation of sleep time
> ---
>
> Key: HDFS-15024
> URL: https://issues.apache.org/jira/browse/HDFS-15024
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.3.0, 3.2.1
>Reporter: huhaiyang
>Priority: Major
> Attachments: HDFS-15024.001.patch, client_error.log
>
>
> When we enable the ONN , there will be three NN nodes for the client 
> configuration,
> Such as configuration
> 
> dfs.ha.namenodes.ns1
> nn2,nn3,nn1
> 
> Currently, 
> nn2 is in standby state
> nn3 is in observer state 
> nn1 is in active state
> When the user performs an access HDFS operation
> ./bin/hadoop --loglevel debug fs 
> -Ddfs.client.failover.proxy.provider.ns1=org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider
>  -mkdir /user/haiyang1/test8
> You need to request nn1 when you execute the msync method,
> Actually connect nn2 first and failover is required
> In connection nn3 does not meet the requirements, failover needs to be 
> performed, but at this time, failover operation needs to be performed during 
> a period of hibernation
> Finally, it took a period of hibernation to connect the successful request to 
> nn1
> In FailoverOnNetworkExceptionRetry getFailoverOrRetrySleepTime The current 
> default implementation is Sleep time is calculated when more than one 
> failover operation is performed
> I think that the Number of NameNodes as a condition of calculation of sleep 
> time is more reasonable
> That is, in the current test, executing failover on connection nn3 does not 
> need to sleep time to directly connect to the next nn node
> See client_error.log for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly

2019-11-26 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983042#comment-16983042
 ] 

Erik Krogen commented on HDFS-14973:


Just committed v005 to branch-2 (and branch-2.10 for good measure, since the 
discussion on the mailing list hasn't finalized yet AFAICT). Thanks for the 
reviews [~shv]!

> Balancer getBlocks RPC dispersal does not function properly
> ---
>
> Key: HDFS-14973
> URL: https://issues.apache.org/jira/browse/HDFS-14973
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1, 2.11.0
>
> Attachments: HDFS-14973-branch-2.003.patch, 
> HDFS-14973-branch-2.004.patch, HDFS-14973-branch-2.005.patch, 
> HDFS-14973.000.patch, HDFS-14973.001.patch, HDFS-14973.002.patch, 
> HDFS-14973.003.patch, HDFS-14973.test.patch
>
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the 
> Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly

2019-11-26 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-14973:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Balancer getBlocks RPC dispersal does not function properly
> ---
>
> Key: HDFS-14973
> URL: https://issues.apache.org/jira/browse/HDFS-14973
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1, 2.11.0
>
> Attachments: HDFS-14973-branch-2.003.patch, 
> HDFS-14973-branch-2.004.patch, HDFS-14973-branch-2.005.patch, 
> HDFS-14973.000.patch, HDFS-14973.001.patch, HDFS-14973.002.patch, 
> HDFS-14973.003.patch, HDFS-14973.test.patch
>
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the 
> Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly

2019-11-26 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-14973:
---
Fix Version/s: 2.11.0
   2.10.1

> Balancer getBlocks RPC dispersal does not function properly
> ---
>
> Key: HDFS-14973
> URL: https://issues.apache.org/jira/browse/HDFS-14973
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1, 2.11.0
>
> Attachments: HDFS-14973-branch-2.003.patch, 
> HDFS-14973-branch-2.004.patch, HDFS-14973-branch-2.005.patch, 
> HDFS-14973.000.patch, HDFS-14973.001.patch, HDFS-14973.002.patch, 
> HDFS-14973.003.patch, HDFS-14973.test.patch
>
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the 
> Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly

2019-11-26 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982721#comment-16982721
 ] 

Erik Krogen commented on HDFS-14973:


Thanks for the review [~shv]! I addressed your comments in  
[^HDFS-14973-branch-2.005.patch] :

# Great idea, thanks.
# I updated it to avoid the use of {{Preconditions}}. It still throws an 
{{IllegalArgumentException}}, as I think this is a much better semantic 
representation of the issue than an {{IOException}}. The description of an IAE: 
"Thrown to indicate that a method has been passed an illegal or inappropriate 
argument."
# Yup, totally makes sense. My mistake. Fixed.
# I had a bug in the implementation of {{setAtomicLongToMinMax()}} that would 
cause the {{FSNamesystem}} spy to hang sometimes. I fixed this.

> Balancer getBlocks RPC dispersal does not function properly
> ---
>
> Key: HDFS-14973
> URL: https://issues.apache.org/jira/browse/HDFS-14973
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14973-branch-2.003.patch, 
> HDFS-14973-branch-2.004.patch, HDFS-14973-branch-2.005.patch, 
> HDFS-14973.000.patch, HDFS-14973.001.patch, HDFS-14973.002.patch, 
> HDFS-14973.003.patch, HDFS-14973.test.patch
>
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the 
> Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly

2019-11-26 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-14973:
---
Attachment: HDFS-14973-branch-2.005.patch

> Balancer getBlocks RPC dispersal does not function properly
> ---
>
> Key: HDFS-14973
> URL: https://issues.apache.org/jira/browse/HDFS-14973
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14973-branch-2.003.patch, 
> HDFS-14973-branch-2.004.patch, HDFS-14973-branch-2.005.patch, 
> HDFS-14973.000.patch, HDFS-14973.001.patch, HDFS-14973.002.patch, 
> HDFS-14973.003.patch, HDFS-14973.test.patch
>
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the 
> Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14884) Add sanity check that zone key equals feinfo key while setting Xattrs

2019-11-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979612#comment-16979612
 ] 

Erik Krogen commented on HDFS-14884:


Changing the assignee back to [~msingh] just to reflect that they did the 
initial work. Thanks for the help here [~weichiu]!

> Add sanity check that zone key equals feinfo key while setting Xattrs
> -
>
> Key: HDFS-14884
> URL: https://issues.apache.org/jira/browse/HDFS-14884
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, hdfs
>Affects Versions: 2.11.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1, 2.11.0
>
> Attachments: HDFS-14884-branch-2.001.patch, HDFS-14884.001.patch, 
> HDFS-14884.002.patch, HDFS-14884.003.patch, hdfs_distcp.patch
>
>
> Currently, it is possible to set an external attribute where the  zone key is 
> not the same as  feinfo key. This jira will add a precondition before setting 
> this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14884) Add sanity check that zone key equals feinfo key while setting Xattrs

2019-11-21 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen reassigned HDFS-14884:
--

Assignee: Mukul Kumar Singh  (was: Yuval Degani)

> Add sanity check that zone key equals feinfo key while setting Xattrs
> -
>
> Key: HDFS-14884
> URL: https://issues.apache.org/jira/browse/HDFS-14884
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, hdfs
>Affects Versions: 2.11.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1, 2.11.0
>
> Attachments: HDFS-14884-branch-2.001.patch, HDFS-14884.001.patch, 
> HDFS-14884.002.patch, HDFS-14884.003.patch, hdfs_distcp.patch
>
>
> Currently, it is possible to set an external attribute where the  zone key is 
> not the same as  feinfo key. This jira will add a precondition before setting 
> this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14991) Backport HDFS-14346 Better time precision in getTimeDuration to branch-2

2019-11-18 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976991#comment-16976991
 ] 

Erik Krogen commented on HDFS-14991:


I don't think we'll backport HDFS-9847 -- as noted in some of the comments near 
the end of that JIRA conversation, in its current form it's not really a 
compatible backport as it changes some of the default values. I think it's 
probably okay to include the change here, but agreed that we should make both 
the commit message and JIRA description very clearly reflect exactly what was 
committed.

+1 from me assuming you make the checkstyle fix and the description updates.

> Backport HDFS-14346 Better time precision in getTimeDuration to branch-2
> 
>
> Key: HDFS-14991
> URL: https://issues.apache.org/jira/browse/HDFS-14991
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-14991-branch-2.001.patch, 
> HDFS-14991-branch-2.002.patch
>
>
> This is to backport HDFS-14346 to branch 2, as Standby reads in branch-2 
> requires being able to properly specify ms time granularity for Edit log 
> tailing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >