[jira] [Commented] (HADOOP-16453) Update how exceptions are handled in NetUtils.java
[ https://issues.apache.org/jira/browse/HADOOP-16453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899570#comment-16899570 ] He Xiaoqiao commented on HADOOP-16453: -- +1 for [^HADOOP-16453.002.patch] from my side. Thanks [~leosun08]. > Update how exceptions are handled in NetUtils.java > -- > > Key: HADOOP-16453 > URL: https://issues.apache.org/jira/browse/HADOOP-16453 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HADOOP-16453.001.patch, HADOOP-16453.002.patch > > > When there is no String Constructor for the exception, we Log a Trace > Message. Given that log and throw is not a very good approach I think the > right thing would be to just not log it at all as HADOOP-16431. > {code:java} > private static T wrapWithMessage( > T exception, String msg) throws T { > Class clazz = exception.getClass(); > try { > Constructor ctor = > clazz.getConstructor(String.class); > Throwable t = ctor.newInstance(msg); > return (T)(t.initCause(exception)); > } catch (Throwable e) { > LOG.trace("Unable to wrap exception of type {}: it has no (String) " > + "constructor", clazz, e); > throw exception; > } > } > {code} > *exception stack:* > {code:java} > 19/07/12 11:23:45 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: > HDFS_DELEGATION_TOKEN, Service: ha-hdfs:azorprc-xiaomi, Ident: (token for > sql_prc: HDFS_DELEGATION_TOKEN owner=sql_prc/hadoop@XIAOMI.HADOOP, > renewer=yarn_prc, realUser=, issueDate=1562901814007, maxDate=1594437814007, > sequenceNumber=3349939, masterKeyId=1400)] > 19/07/12 11:23:46 TRACE net.NetUtils: Unable to wrap exception of type class > java.nio.channels.ClosedByInterruptException: it has no (String) constructor > java.lang.NoSuchMethodException: > java.nio.channels.ClosedByInterruptException.(java.lang.String) > at java.lang.Class.getConstructor0(Class.java:3082) > at java.lang.Class.getConstructor(Class.java:1825) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:830) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1559) > at org.apache.hadoop.ipc.Client.call(Client.java:1501) > at org.apache.hadoop.ipc.Client.call(Client.java:1411) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:949) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProvider$RequestHedgingInvocationHandler$1.call(RequestHedgingProxyProvider.java:143) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/07/12 11:23:46 INFO Configuration.deprecation: No unit for > dfs.client.datanode-restart.timeout(30) assuming SECONDS > 19/07/12 11:23:46 INFO Configuration.deprecation: No unit for > dfs.client.datanode-restart.timeout(30) assuming SECONDS > 19/07/12 11:23:46 INFO Configuration.deprecation: No unit for > dfs.client.datanode-restart.timeout(30) assuming SECONDS > 19/07/12 11:23:46 INFO Configuration.deprecation: No unit for > dfs.client.datanode-restart.timeout(30) assuming SECONDS > 19/07/12 11:23:46 INFO Configuration.deprecation: No unit for > dfs.client.datanode-restart.timeout(30) assuming SECONDS > 19/07/12 11:23:46 INFO Configuration.deprecation: No unit for > dfs.client.datanode-restart.timeout(30) assuming SECONDS > 19/07/12 11:23:46 WARN ipc.Client: Exception encountered while connecting to > the server : java.io.InterruptedIOException: Interrupted while waiting for IO > on channel java.nio.channels.SocketChannel[connected > local=/10.118.30.48:34324 remote=/10.69.11.137:11200]. 6 millis timeout > left. > 19/07/12 11:23:48 INFO conf.Configuration: resource-types.xml not found > 19/07/12 11:23:48 INFO resource.ResourceUtils: Unable to find > 'resource-types.xml'. >
[jira] [Commented] (HADOOP-15440) Support kerberos principal name pattern for KerberosAuthenticationHandler
[ https://issues.apache.org/jira/browse/HADOOP-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898846#comment-16898846 ] He Xiaoqiao commented on HADOOP-15440: -- [~eyang], I try to recall changes about this patch, and it seems that it is same as {{SecurityUtil#getServerPrincipal}} which is not import by submodule `hadoop-common`. for case `test/test/test`, it will split to [test,test,test] but `components[1]` is not equals to `_HOST`, so it will not be replaced. for case `test/_HOST/test`, it will be replaced to `test/$hostname/test`. {quote}While this works fine for server with single network interface. It can create problems for multi-homed network that getCanonicalHostName doesn't return the desired hostname.{quote} it is true. it seems {{DNS.getHosts}} give one choice, any suggestions? Thanks again. > Support kerberos principal name pattern for KerberosAuthenticationHandler > - > > Key: HADOOP-15440 > URL: https://issues.apache.org/jira/browse/HADOOP-15440 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15440-trunk.001.patch, HADOOP-15440.002.patch > > > When setup HttpFS server or KMS server in security mode, we have to config > kerberos principal for these service, it doesn't support to convert Kerberos > principal name pattern to valid Kerberos principal names whereas > NameNode/DataNode and many other service can do that, so it makes confused > for users. so I propose to replace hostname pattern with hostname, which > should be fully-qualified domain name. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15440) Support kerberos principal name pattern for KerberosAuthenticationHandler
[ https://issues.apache.org/jira/browse/HADOOP-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898551#comment-16898551 ] He Xiaoqiao commented on HADOOP-15440: -- Thanks [~eyang] for your quick and kind response at the same time. And they are very valuable suggestions, I will check it as soon as possible. > Support kerberos principal name pattern for KerberosAuthenticationHandler > - > > Key: HADOOP-15440 > URL: https://issues.apache.org/jira/browse/HADOOP-15440 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15440-trunk.001.patch, HADOOP-15440.002.patch > > > When setup HttpFS server or KMS server in security mode, we have to config > kerberos principal for these service, it doesn't support to convert Kerberos > principal name pattern to valid Kerberos principal names whereas > NameNode/DataNode and many other service can do that, so it makes confused > for users. so I propose to replace hostname pattern with hostname, which > should be fully-qualified domain name. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15440) Support kerberos principal name pattern for KerberosAuthenticationHandler
[ https://issues.apache.org/jira/browse/HADOOP-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898548#comment-16898548 ] He Xiaoqiao commented on HADOOP-15440: -- Thanks [~jojochuang] for getting this issue back, [^HADOOP-15440.002.patch] try to fix checkstyle and pending Jenkins. {quote}please make some examples in the summary so this is easier to understand.{quote} When setup HttpFS server or KMS server in security mode, we should config the item `httpfs.authentication.kerberos.principal` for httpfs principal. Since it doesn't support to convert Kerberos principal name pattern to valid Kerberos principal names, so we have to config the principal value with the real hostname rather than the hostname patter `_HOST` as the following shows, thus we have to prepare different configs for different HttpFS instance or KMS instance. {code:java} httpfs.authentication.kerberos.principal HTTP/`hostname`@REALM {code} cc [~jojochuang],[~eyang], [~stev...@iseran.com] Please take a reviews if you have times. Thanks again. > Support kerberos principal name pattern for KerberosAuthenticationHandler > - > > Key: HADOOP-15440 > URL: https://issues.apache.org/jira/browse/HADOOP-15440 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15440-trunk.001.patch, HADOOP-15440.002.patch > > > When setup HttpFS server or KMS server in security mode, we have to config > kerberos principal for these service, it doesn't support to convert Kerberos > principal name pattern to valid Kerberos principal names whereas > NameNode/DataNode and many other service can do that, so it makes confused > for users. so I propose to replace hostname pattern with hostname, which > should be fully-qualified domain name. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15440) Support kerberos principal name pattern for KerberosAuthenticationHandler
[ https://issues.apache.org/jira/browse/HADOOP-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15440: - Attachment: HADOOP-15440.002.patch > Support kerberos principal name pattern for KerberosAuthenticationHandler > - > > Key: HADOOP-15440 > URL: https://issues.apache.org/jira/browse/HADOOP-15440 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15440-trunk.001.patch, HADOOP-15440.002.patch > > > When setup HttpFS server or KMS server in security mode, we have to config > kerberos principal for these service, it doesn't support to convert Kerberos > principal name pattern to valid Kerberos principal names whereas > NameNode/DataNode and many other service can do that, so it makes confused > for users. so I propose to replace hostname pattern with hostname, which > should be fully-qualified domain name. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable
[ https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879790#comment-16879790 ] He Xiaoqiao commented on HADOOP-16403: -- Thanks [~LiJinglun] for your response. Did you backport HDFS-6763? I have met this issue once, and it have resolved by applying HDFS-6763. Please try to apply that patch and welcome to some more discussion. > Start a new statistical rpc queue and make the Reader's pendingConnection > queue runtime-replaceable > --- > > Key: HADOOP-16403 > URL: https://issues.apache.org/jira/browse/HADOOP-16403 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Jinglun >Priority: Major > Attachments: HADOOP-16403.001.patch, MetricLinkedBlockingQueueTest.pdf > > > I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so > after the active dead, it takes the standby more than 40s to become active. > Many requests(tcp connect request and rpc request) from Datanodes, clients > and zkfc timed out and start retrying. The suddenly request flood lasts for > the next 2 minutes and finally all requests are either handled or run out of > retry times. > Adjusting the rpc related settings might power the NameNode and solve this > problem and the key point is finding the bottle neck. The rpc server can be > described as below: > {noformat} > Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat} > By sampling some failed clients, I find many of them got > ConnectTimeoutException. It's caused by a 20s un-responded tcp connect > request. I think may be the reader queue is full and block the listener from > handling new connections. Both slow handlers and slow readers can block the > whole processing progress, and I need to know who it is. I think *a queue > that computes the qps, write log when the queue is full and could be replaced > easily* will help. > I find the nice work HADOOP-10302 implementing a runtime-swapped queue. > Using it at Reader's queue makes the reader queue runtime-swapped > automatically. The qps computing job could be done by implementing a subclass > of LinkedBlockQueue that does the computing job while put/take/... happens. > The qps data will show on jmx. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16385) Namenode crashes with "RedundancyMonitor thread received Runtime exception"
[ https://issues.apache.org/jira/browse/HADOOP-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877603#comment-16877603 ] He Xiaoqiao commented on HADOOP-16385: -- Thanks [~elgoiri],[~ayushtkn] for your work. +1 for remove `Preconditions.checkArgument` and avoid to crash namenode. I am confused why this situation happened. After apply HADOOP-16028, {{totalInScopeNodes}} always greater than or equal to {{availableNodes}} based on the scope. Please correct me if something I missed. {code:java} Preconditions.checkArgument( totalInScopeNodes >= availableNodes && availableNodes > 0, String .format("%d should >= %d, and both should be positive.", totalInScopeNodes, availableNodes)); {code} > Namenode crashes with "RedundancyMonitor thread received Runtime exception" > --- > > Key: HADOOP-16385 > URL: https://issues.apache.org/jira/browse/HADOOP-16385 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HADOOP-16385-01.patch, HADOOP-16385-02.patch, > HADOOP-16385-03.patch, HADOOP-16385-HDFS_UT.patch, > HADOOP-16385.branch-3.1.001.patch > > > *Description: *While removing dead nodes, Namenode went down with error > "RedundancyMonitor thread received Runtime exception" > *Environment: * > Server OS :- UBUNTU > No. of Cluster Node:- 1NN / 225DN's / 3ZK / 2RM/ 4850 NMs > total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's) > *Steps:* > 1. Total number of containers running state : ~53000 > 2. Because of the load, machine was going to outofMemory and restarting the > machine and starting all the docker containers including NM's and DN's > 3. in some point namenode throughs below error while removing a node and NN > went down. > {noformat} > 2019-06-19 05:54:07,262 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-1550/255.255.117.195:23735 > 2019-06-19 05:54:07,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.117.151:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,281 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-4097/255.255.117.151:23735 > 2019-06-19 05:54:07,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.116.213:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,290 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor > thread received Runtime exception. > java.lang.IllegalArgumentException: 247 should >= 248, and both should be > positive. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:575) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:552) > at > org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:873) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:770) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:712) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:507) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:425) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargets(BlockPlacementPolicyDefault.java:311) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:290) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:103) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1902) > at >
[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable
[ https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876094#comment-16876094 ] He Xiaoqiao commented on HADOOP-16403: -- Thanks [~LiJinglun], {code:java} it takes the standby more than 40s to become active {code} would you like to provide which version do you deploy and how large meta? it seems too long(40s) to transition to active. > Start a new statistical rpc queue and make the Reader's pendingConnection > queue runtime-replaceable > --- > > Key: HADOOP-16403 > URL: https://issues.apache.org/jira/browse/HADOOP-16403 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Jinglun >Priority: Major > Attachments: HADOOP-16403.001.patch > > > I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so > after the active dead, it takes the standby more than 40s to become active. > Many requests(tcp connect request and rpc request) from Datanodes, clients > and zkfc timed out and start retrying. The suddenly request flood lasts for > the next 2 minutes and finally all requests are either handled or run out of > retry times. > Adjusting the rpc related settings might power the NameNode and solve this > problem and the key point is finding the bottle neck. The rpc server can be > described as below: > {noformat} > Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat} > By sampling some failed clients, I find many of them got ConnectException. > It's caused by a 20s un-responded tcp connect request. I think may be the > reader queue is full and block the listener from handling new connections. > Both slow handlers and slow readers can block the whole processing progress, > and I need to know who it is. I think *a queue that computes the qps, write > log when the queue is full and could be replaced easily* will help. > I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using > it at Reader's queue makes the reader queue runtime-swapped automatically. > The qps computing job could be done by implementing a subclass of > LinkedBlockQueue that does the computing job while put/take/... happens. The > qps data will show on jmx. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-15918) Namenode gets stuck when deleting large dir in trash
[ https://issues.apache.org/jira/browse/HADOOP-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870559#comment-16870559 ] He Xiaoqiao edited comment on HADOOP-15918 at 6/23/19 2:27 PM: --- Thanks [~Tao Jie] for your contributions. I think this is very common issue, are you still working in progress? Thanks again. was (Author: hexiaoqiao): Thanks [~Tao Jie] for your contributions. I think this is very common issue, and if you are still working in progress? Thanks again. > Namenode gets stuck when deleting large dir in trash > > > Key: HADOOP-15918 > URL: https://issues.apache.org/jira/browse/HADOOP-15918 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.8.2, 3.1.0 >Reporter: Tao Jie >Assignee: Tao Jie >Priority: Major > Attachments: HADOOP-15918.001.patch, HADOOP-15918.002.patch, > HDFS-13769.001.patch, HDFS-13769.002.patch, HDFS-13769.003.patch, > HDFS-13769.004.patch > > > Similar to the situation discussed in HDFS-13671, Namenode gets stuck for a > long time when deleting trash dir with a large mount of data. We found log in > namenode: > {quote} > 2018-06-08 20:00:59,042 INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(252)) - FSNamesystem write lock held for > 23018 ms via > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1033) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:254) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1567) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2820) > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1047) > {quote} > One simple solution is to avoid deleting large data in one delete RPC call. > We implement a trashPolicy that divide the delete operation into several > delete RPCs, and each single deletion would not delete too many files. > Any thought? [~linyiqun] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15918) Namenode gets stuck when deleting large dir in trash
[ https://issues.apache.org/jira/browse/HADOOP-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870559#comment-16870559 ] He Xiaoqiao commented on HADOOP-15918: -- Thanks [~Tao Jie] for your contributions. I think this is very common issue, and if you are still working in progress? Thanks again. > Namenode gets stuck when deleting large dir in trash > > > Key: HADOOP-15918 > URL: https://issues.apache.org/jira/browse/HADOOP-15918 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.8.2, 3.1.0 >Reporter: Tao Jie >Assignee: Tao Jie >Priority: Major > Attachments: HADOOP-15918.001.patch, HADOOP-15918.002.patch, > HDFS-13769.001.patch, HDFS-13769.002.patch, HDFS-13769.003.patch, > HDFS-13769.004.patch > > > Similar to the situation discussed in HDFS-13671, Namenode gets stuck for a > long time when deleting trash dir with a large mount of data. We found log in > namenode: > {quote} > 2018-06-08 20:00:59,042 INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(252)) - FSNamesystem write lock held for > 23018 ms via > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1033) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:254) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1567) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2820) > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1047) > {quote} > One simple solution is to avoid deleting large data in one delete RPC call. > We implement a trashPolicy that divide the delete operation into several > delete RPCs, and each single deletion would not delete too many files. > Any thought? [~linyiqun] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-16385) Namenode crashes with "RedundancyMonitor thread received Runtime exception"
[ https://issues.apache.org/jira/browse/HADOOP-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870173#comment-16870173 ] He Xiaoqiao edited comment on HADOOP-16385 at 6/22/19 9:41 AM: --- Thanks [~xuzq_zander] for your deep dig. I think HADOOP-16028 may could solve your doubt. Anyway welcome some more discussion. Thanks [~xuzq_zander] again. was (Author: hexiaoqiao): Thanks [~xuzq_zander] for your deep dig. I think HADOOP-16028 may could solve your doubt. Anyway welcome some more discussion. Thans [~xuzq_zander] again. > Namenode crashes with "RedundancyMonitor thread received Runtime exception" > --- > > Key: HADOOP-16385 > URL: https://issues.apache.org/jira/browse/HADOOP-16385 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Attachments: HADOOP-16385.branch-3.1.001.patch > > > *Description: *While removing dead nodes, Namenode went down with error > "RedundancyMonitor thread received Runtime exception" > *Environment: * > Server OS :- UBUNTU > No. of Cluster Node:- 1NN / 225DN's / 3ZK / 2RM/ 4850 NMs > total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's) > *Steps:* > 1. Total number of containers running state : ~53000 > 2. Because of the load, machine was going to outofMemory and restarting the > machine and starting all the docker containers including NM's and DN's > 3. in some point namenode throughs below error while removing a node and NN > went down. > {noformat} > 2019-06-19 05:54:07,262 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-1550/255.255.117.195:23735 > 2019-06-19 05:54:07,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.117.151:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,281 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-4097/255.255.117.151:23735 > 2019-06-19 05:54:07,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.116.213:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,290 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor > thread received Runtime exception. > java.lang.IllegalArgumentException: 247 should >= 248, and both should be > positive. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:575) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:552) > at > org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:873) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:770) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:712) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:507) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:425) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargets(BlockPlacementPolicyDefault.java:311) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:290) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:103) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1902) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1854) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4842) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4709) > at java.lang.Thread.run(Thread.java:748) > 2019-06-19 05:54:07,296 INFO
[jira] [Commented] (HADOOP-16385) Namenode crashes with "RedundancyMonitor thread received Runtime exception"
[ https://issues.apache.org/jira/browse/HADOOP-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870173#comment-16870173 ] He Xiaoqiao commented on HADOOP-16385: -- Thanks [~xuzq_zander] for your deep dig. I think HADOOP-16028 may could solve your doubt. Anyway welcome some more discussion. Thans [~xuzq_zander] again. > Namenode crashes with "RedundancyMonitor thread received Runtime exception" > --- > > Key: HADOOP-16385 > URL: https://issues.apache.org/jira/browse/HADOOP-16385 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Attachments: HADOOP-16385.branch-3.1.001.patch > > > *Description: *While removing dead nodes, Namenode went down with error > "RedundancyMonitor thread received Runtime exception" > *Environment: * > Server OS :- UBUNTU > No. of Cluster Node:- 1NN / 225DN's / 3ZK / 2RM/ 4850 NMs > total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's) > *Steps:* > 1. Total number of containers running state : ~53000 > 2. Because of the load, machine was going to outofMemory and restarting the > machine and starting all the docker containers including NM's and DN's > 3. in some point namenode throughs below error while removing a node and NN > went down. > {noformat} > 2019-06-19 05:54:07,262 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-1550/255.255.117.195:23735 > 2019-06-19 05:54:07,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.117.151:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,281 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-4097/255.255.117.151:23735 > 2019-06-19 05:54:07,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.116.213:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,290 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor > thread received Runtime exception. > java.lang.IllegalArgumentException: 247 should >= 248, and both should be > positive. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:575) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:552) > at > org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:873) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:770) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:712) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:507) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:425) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargets(BlockPlacementPolicyDefault.java:311) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:290) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:103) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1902) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1854) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4842) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4709) > at java.lang.Thread.run(Thread.java:748) > 2019-06-19 05:54:07,296 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: java.lang.IllegalArgumentException: 247 should >= 248, and both > should be positive. > 2019-06-19 05:54:07,298 INFO >
[jira] [Commented] (HADOOP-16385) Namenode crashes with "RedundancyMonitor thread received Runtime exception"
[ https://issues.apache.org/jira/browse/HADOOP-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868828#comment-16868828 ] He Xiaoqiao commented on HADOOP-16385: -- {quote}HADOOP-16028 can be directly cherry-picked to 3.1, Doesn't require a separate patch. I will ping up there.{quote} All right, it make sense. Thanks. > Namenode crashes with "RedundancyMonitor thread received Runtime exception" > --- > > Key: HADOOP-16385 > URL: https://issues.apache.org/jira/browse/HADOOP-16385 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Attachments: HADOOP-16385.branch-3.1.001.patch > > > *Description: *While removing dead nodes, Namenode went down with error > "RedundancyMonitor thread received Runtime exception" > *Environment: * > Server OS :- UBUNTU > No. of Cluster Node:- 1NN / 225DN's / 3ZK / 2RM/ 4850 NMs > total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's) > *Steps:* > 1. Total number of containers running state : ~53000 > 2. Because of the load, machine was going to outofMemory and restarting the > machine and starting all the docker containers including NM's and DN's > 3. in some point namenode throughs below error while removing a node and NN > went down. > {noformat} > 2019-06-19 05:54:07,262 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-1550/255.255.117.195:23735 > 2019-06-19 05:54:07,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.117.151:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,281 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-4097/255.255.117.151:23735 > 2019-06-19 05:54:07,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.116.213:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,290 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor > thread received Runtime exception. > java.lang.IllegalArgumentException: 247 should >= 248, and both should be > positive. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:575) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:552) > at > org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:873) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:770) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:712) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:507) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:425) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargets(BlockPlacementPolicyDefault.java:311) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:290) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:103) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1902) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1854) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4842) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4709) > at java.lang.Thread.run(Thread.java:748) > 2019-06-19 05:54:07,296 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: java.lang.IllegalArgumentException: 247 should >= 248, and both > should be positive. > 2019-06-19 05:54:07,298 INFO >
[jira] [Updated] (HADOOP-16385) Namenode crashes with "RedundancyMonitor thread received Runtime exception"
[ https://issues.apache.org/jira/browse/HADOOP-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16385: - Attachment: HADOOP-16385.branch-3.1.001.patch Status: Patch Available (was: Open) Just backport HADOOP-16028 to branch-3.1 and pending Jenkins. cc [~ayushtkn] could you help to take another reviews? > Namenode crashes with "RedundancyMonitor thread received Runtime exception" > --- > > Key: HADOOP-16385 > URL: https://issues.apache.org/jira/browse/HADOOP-16385 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Attachments: HADOOP-16385.branch-3.1.001.patch > > > *Description: *While removing dead nodes, Namenode went down with error > "RedundancyMonitor thread received Runtime exception" > *Environment: * > Server OS :- UBUNTU > No. of Cluster Node:- 1NN / 225DN's / 3ZK / 2RM/ 4850 NMs > total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's) > *Steps:* > 1. Total number of containers running state : ~53000 > 2. Because of the load, machine was going to outofMemory and restarting the > machine and starting all the docker containers including NM's and DN's > 3. in some point namenode throughs below error while removing a node and NN > went down. > {noformat} > 2019-06-19 05:54:07,262 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-1550/255.255.117.195:23735 > 2019-06-19 05:54:07,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.117.151:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,281 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-4097/255.255.117.151:23735 > 2019-06-19 05:54:07,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.116.213:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,290 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor > thread received Runtime exception. > java.lang.IllegalArgumentException: 247 should >= 248, and both should be > positive. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:575) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:552) > at > org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:873) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:770) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:712) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:507) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:425) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargets(BlockPlacementPolicyDefault.java:311) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:290) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:103) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1902) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1854) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4842) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4709) > at java.lang.Thread.run(Thread.java:748) > 2019-06-19 05:54:07,296 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: java.lang.IllegalArgumentException: 247 should >= 248, and both > should be positive. > 2019-06-19 05:54:07,298 INFO >
[jira] [Moved] (HADOOP-16385) Namenode crashes with "RedundancyMonitor thread received Runtime exception"
[ https://issues.apache.org/jira/browse/HADOOP-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao moved HDFS-14584 to HADOOP-16385: - Affects Version/s: (was: 3.1.1) 3.1.1 Key: HADOOP-16385 (was: HDFS-14584) Project: Hadoop Common (was: Hadoop HDFS) > Namenode crashes with "RedundancyMonitor thread received Runtime exception" > --- > > Key: HADOOP-16385 > URL: https://issues.apache.org/jira/browse/HADOOP-16385 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > > *Description: *While removing dead nodes, Namenode went down with error > "RedundancyMonitor thread received Runtime exception" > *Environment: * > Server OS :- UBUNTU > No. of Cluster Node:- 1NN / 225DN's / 3ZK / 2RM/ 4850 NMs > total 240 machines, in each machine 21 docker containers (1 DN & 20 NM's) > *Steps:* > 1. Total number of containers running state : ~53000 > 2. Because of the load, machine was going to outofMemory and restarting the > machine and starting all the docker containers including NM's and DN's > 3. in some point namenode throughs below error while removing a node and NN > went down. > {noformat} > 2019-06-19 05:54:07,262 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-1550/255.255.117.195:23735 > 2019-06-19 05:54:07,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.117.151:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,281 INFO org.apache.hadoop.net.NetworkTopology: Removing > a node: /rack-4097/255.255.117.151:23735 > 2019-06-19 05:54:07,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > removeDeadDatanode: lost heartbeat from 255.255.116.213:23735, > removeBlocksFromBlockMap true > 2019-06-19 05:54:07,290 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor > thread received Runtime exception. > java.lang.IllegalArgumentException: 247 should >= 248, and both should be > positive. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:575) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:552) > at > org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:873) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:770) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:712) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:507) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:425) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargets(BlockPlacementPolicyDefault.java:311) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:290) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:103) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1902) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1854) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4842) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4709) > at java.lang.Thread.run(Thread.java:748) > 2019-06-19 05:54:07,296 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: java.lang.IllegalArgumentException: 247 should >= 248, and both > should be positive. > 2019-06-19 05:54:07,298 INFO > org.apache.hadoop.hdfs.server.common.HadoopAuditLogger.audit: > process=Namenode operation=shutdown
[jira] [Updated] (HADOOP-15414) Job submit not work well on HDFS Federation with Transparent Encryption feature
[ https://issues.apache.org/jira/browse/HADOOP-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15414: - Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~xiaochen], fixed by HADOOP-14445 and close this issue. > Job submit not work well on HDFS Federation with Transparent Encryption > feature > --- > > Key: HADOOP-15414 > URL: https://issues.apache.org/jira/browse/HADOOP-15414 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Reporter: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15414-trunk.001.patch, > HADOOP-15414-trunk.002.patch > > > When submit sample MapReduce job WordCount which read/write path under > encryption zone on HDFS Federation in security mode to YARN, task throws > exception as below: > {code:java} > 18/04/26 16:07:26 INFO mapreduce.Job: Task Id : attempt_JOBID_m_TASKID_0, > Status : FAILED > Error: java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:489) > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider.decryptEncryptedKey(KMSClientProvider.java:776) > at > org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388) > at > org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1468) > at > org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1538) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:306) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:300) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:300) > at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.open(ChRootedFileSystem.java:258) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.open(ViewFileSystem.java:424) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:793) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:552) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:823) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) > Caused by: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:332) > at > org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:205) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:128) > at > org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:215) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.openConnection(DelegationTokenAuthenticatedURL.java:322) > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider$1.run(KMSClientProvider.java:483) > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider$1.run(KMSClientProvider.java:478) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690) > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:478) > ... 21 more > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt) > at > sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) > at >
[jira] [Commented] (HADOOP-16112) Delete the baseTrashPath's subDir leads to don't modify baseTrashPath
[ https://issues.apache.org/jira/browse/HADOOP-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863689#comment-16863689 ] He Xiaoqiao commented on HADOOP-16112: -- [~jzhuge], Thanks for your comments. {quote}the new unit test passes without any fix, is it valid? I understand race condition is hard to reproduce.{quote} right, new unit test actually does not verify anything, so in my opinion, it is not a valid unit test. I would like to state this case could be interpretable, may be not issue. IMO it make sense for me no matter parent or child path add timestamp to mkdir, in whatever way, I do not think we could guarantee consistency at the client side through retry. Please correct me if something wrong. > Delete the baseTrashPath's subDir leads to don't modify baseTrashPath > - > > Key: HADOOP-16112 > URL: https://issues.apache.org/jira/browse/HADOOP-16112 > Project: Hadoop Common > Issue Type: Bug > Components: common >Affects Versions: 3.2.0 >Reporter: Lisheng Sun >Priority: Major > Attachments: HADOOP-16112.001.patch, HADOOP-16112.002.patch > > > There is race condition in TrashPolicyDefault#moveToTrash > try { > if (!fs.mkdirs(baseTrashPath, PERMISSION)) > { // create current LOG.warn("Can't create(mkdir) trash directory: " + > baseTrashPath); return false; } > } catch (FileAlreadyExistsException e) { > // find the path which is not a directory, and modify baseTrashPath > // & trashPath, then mkdirs > Path existsFilePath = baseTrashPath; > while (!fs.exists(existsFilePath)) > { existsFilePath = existsFilePath.getParent(); } > {color:#ff}// case{color} > {color:#ff} other thread deletes existsFilePath here ,the results > doesn't meet expectation{color} > {color:#ff} for example{color} > {color:#ff} there is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng/b{color} > {color:#ff} when delete /user/u_sunlisheng/b/a. if existsFilePath is > deleted, the result is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng+timstamp/b/a{color} > {color:#ff} so when existsFilePath is deleted, don't modify > baseTrashPath. {color} > baseTrashPath = new Path(baseTrashPath.toString().replace( > existsFilePath.toString(), existsFilePath.toString() + Time.now()) > ); > trashPath = new Path(baseTrashPath, trashPath.getName()); > // retry, ignore current failure > --i; > continue; > } catch (IOException e) > { LOG.warn("Can't create trash directory: " + baseTrashPath, e); cause = e; > break; } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16112) Delete the baseTrashPath's subDir leads to don't modify baseTrashPath
[ https://issues.apache.org/jira/browse/HADOOP-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862802#comment-16862802 ] He Xiaoqiao commented on HADOOP-16112: -- [~leosun08] thanks for your report and sorry for late comment. just quick review #moveToTrash logic, I think the case you mentioned could be interpretable, may be not issue. [~ferhui] [~jzhuge] should be more proper to give another reviews. > Delete the baseTrashPath's subDir leads to don't modify baseTrashPath > - > > Key: HADOOP-16112 > URL: https://issues.apache.org/jira/browse/HADOOP-16112 > Project: Hadoop Common > Issue Type: Bug > Components: common >Affects Versions: 3.2.0 >Reporter: Lisheng Sun >Priority: Major > Attachments: HADOOP-16112.001.patch, HADOOP-16112.002.patch > > > There is race condition in TrashPolicyDefault#moveToTrash > try { > if (!fs.mkdirs(baseTrashPath, PERMISSION)) > { // create current LOG.warn("Can't create(mkdir) trash directory: " + > baseTrashPath); return false; } > } catch (FileAlreadyExistsException e) { > // find the path which is not a directory, and modify baseTrashPath > // & trashPath, then mkdirs > Path existsFilePath = baseTrashPath; > while (!fs.exists(existsFilePath)) > { existsFilePath = existsFilePath.getParent(); } > {color:#ff}// case{color} > {color:#ff} other thread deletes existsFilePath here ,the results > doesn't meet expectation{color} > {color:#ff} for example{color} > {color:#ff} there is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng/b{color} > {color:#ff} when delete /user/u_sunlisheng/b/a. if existsFilePath is > deleted, the result is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng+timstamp/b/a{color} > {color:#ff} so when existsFilePath is deleted, don't modify > baseTrashPath. {color} > baseTrashPath = new Path(baseTrashPath.toString().replace( > existsFilePath.toString(), existsFilePath.toString() + Time.now()) > ); > trashPath = new Path(baseTrashPath, trashPath.getName()); > // retry, ignore current failure > --i; > continue; > } catch (IOException e) > { LOG.warn("Can't create trash directory: " + baseTrashPath, e); cause = e; > break; } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16112) Delete the baseTrashPath's subDir leads to don't modify baseTrashPath
[ https://issues.apache.org/jira/browse/HADOOP-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860040#comment-16860040 ] He Xiaoqiao commented on HADOOP-16112: -- [~leosun08], Do you mean that trash directory should be {/user/test/.Trash/Current/user/test/a+timestamp/b} rather than {/user/test/.Trash/Current/user/test+timestamp/a/b} which is actual result in some corner case? > Delete the baseTrashPath's subDir leads to don't modify baseTrashPath > - > > Key: HADOOP-16112 > URL: https://issues.apache.org/jira/browse/HADOOP-16112 > Project: Hadoop Common > Issue Type: Bug > Components: common >Affects Versions: 3.2.0 >Reporter: Lisheng Sun >Priority: Major > Attachments: HADOOP-16112.001.patch, HADOOP-16112.002.patch > > > There is race condition in TrashPolicyDefault#moveToTrash > try { > if (!fs.mkdirs(baseTrashPath, PERMISSION)) > { // create current LOG.warn("Can't create(mkdir) trash directory: " + > baseTrashPath); return false; } > } catch (FileAlreadyExistsException e) { > // find the path which is not a directory, and modify baseTrashPath > // & trashPath, then mkdirs > Path existsFilePath = baseTrashPath; > while (!fs.exists(existsFilePath)) > { existsFilePath = existsFilePath.getParent(); } > {color:#ff}// case{color} > {color:#ff} other thread deletes existsFilePath here ,the results > doesn't meet expectation{color} > {color:#ff} for example{color} > {color:#ff} there is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng/b{color} > {color:#ff} when delete /user/u_sunlisheng/b/a. if existsFilePath is > deleted, the result is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng+timstamp/b/a{color} > {color:#ff} so when existsFilePath is deleted, don't modify > baseTrashPath. {color} > baseTrashPath = new Path(baseTrashPath.toString().replace( > existsFilePath.toString(), existsFilePath.toString() + Time.now()) > ); > trashPath = new Path(baseTrashPath, trashPath.getName()); > // retry, ignore current failure > --i; > continue; > } catch (IOException e) > { LOG.warn("Can't create trash directory: " + baseTrashPath, e); cause = e; > break; } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-16112) Delete the baseTrashPath's subDir leads to don't modify baseTrashPath
[ https://issues.apache.org/jira/browse/HADOOP-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859920#comment-16859920 ] He Xiaoqiao edited comment on HADOOP-16112 at 6/10/19 10:55 AM: Sorry I don't catch the point and just verify the unit test you attach in [^HADOOP-16112.002.patch], and it run correctly using branch trunk at local, So I am confused what this issue means? Would you mind offer how to reproduce this issue? for instance(just example, no more info.) {code:java} fs -mkdir /user/test fs -rm -r /user/test fs -mkdir /user/test/a fs -rm -r /user/test/b (another thread) {code} was (Author: hexiaoqiao): Sorry I don't catch the point and just verify the unit test you attach in [^HADOOP-16112.002.patch], and it run correct, So I am confused what this issue means? Would you mind offer how to reproduce this issue? for instance(just example, no more info.) {code:java} fs -mkdir /user/test fs -rm -r /user/test fs -mkdir /user/test/a fs -rm -r /user/test/b (another thread) {code} > Delete the baseTrashPath's subDir leads to don't modify baseTrashPath > - > > Key: HADOOP-16112 > URL: https://issues.apache.org/jira/browse/HADOOP-16112 > Project: Hadoop Common > Issue Type: Bug > Components: common >Affects Versions: 3.2.0 >Reporter: Lisheng Sun >Priority: Major > Attachments: HADOOP-16112.001.patch, HADOOP-16112.002.patch > > > There is race condition in TrashPolicyDefault#moveToTrash > try { > if (!fs.mkdirs(baseTrashPath, PERMISSION)) > { // create current LOG.warn("Can't create(mkdir) trash directory: " + > baseTrashPath); return false; } > } catch (FileAlreadyExistsException e) { > // find the path which is not a directory, and modify baseTrashPath > // & trashPath, then mkdirs > Path existsFilePath = baseTrashPath; > while (!fs.exists(existsFilePath)) > { existsFilePath = existsFilePath.getParent(); } > {color:#ff}// case{color} > {color:#ff} other thread deletes existsFilePath here ,the results > doesn't meet expectation{color} > {color:#ff} for example{color} > {color:#ff} there is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng/b{color} > {color:#ff} when delete /user/u_sunlisheng/b/a. if existsFilePath is > deleted, the result is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng+timstamp/b/a{color} > {color:#ff} so when existsFilePath is deleted, don't modify > baseTrashPath. {color} > baseTrashPath = new Path(baseTrashPath.toString().replace( > existsFilePath.toString(), existsFilePath.toString() + Time.now()) > ); > trashPath = new Path(baseTrashPath, trashPath.getName()); > // retry, ignore current failure > --i; > continue; > } catch (IOException e) > { LOG.warn("Can't create trash directory: " + baseTrashPath, e); cause = e; > break; } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16112) Delete the baseTrashPath's subDir leads to don't modify baseTrashPath
[ https://issues.apache.org/jira/browse/HADOOP-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859920#comment-16859920 ] He Xiaoqiao commented on HADOOP-16112: -- Sorry I don't catch the point and just verify the unit test you attach in [^HADOOP-16112.002.patch], and it run correct, So I am confused what this issue means? Would you mind offer how to reproduce this issue? for instance(just example, no more info.) {code:java} fs -mkdir /user/test fs -rm -r /user/test fs -mkdir /user/test/a fs -rm -r /user/test/b (another thread) {code} > Delete the baseTrashPath's subDir leads to don't modify baseTrashPath > - > > Key: HADOOP-16112 > URL: https://issues.apache.org/jira/browse/HADOOP-16112 > Project: Hadoop Common > Issue Type: Bug > Components: common >Affects Versions: 3.2.0 >Reporter: Lisheng Sun >Priority: Major > Attachments: HADOOP-16112.001.patch, HADOOP-16112.002.patch > > > There is race condition in TrashPolicyDefault#moveToTrash > try { > if (!fs.mkdirs(baseTrashPath, PERMISSION)) > { // create current LOG.warn("Can't create(mkdir) trash directory: " + > baseTrashPath); return false; } > } catch (FileAlreadyExistsException e) { > // find the path which is not a directory, and modify baseTrashPath > // & trashPath, then mkdirs > Path existsFilePath = baseTrashPath; > while (!fs.exists(existsFilePath)) > { existsFilePath = existsFilePath.getParent(); } > {color:#ff}// case{color} > {color:#ff} other thread deletes existsFilePath here ,the results > doesn't meet expectation{color} > {color:#ff} for example{color} > {color:#ff} there is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng/b{color} > {color:#ff} when delete /user/u_sunlisheng/b/a. if existsFilePath is > deleted, the result is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng+timstamp/b/a{color} > {color:#ff} so when existsFilePath is deleted, don't modify > baseTrashPath. {color} > baseTrashPath = new Path(baseTrashPath.toString().replace( > existsFilePath.toString(), existsFilePath.toString() + Time.now()) > ); > trashPath = new Path(baseTrashPath, trashPath.getName()); > // retry, ignore current failure > --i; > continue; > } catch (IOException e) > { LOG.warn("Can't create trash directory: " + baseTrashPath, e); cause = e; > break; } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16112) Delete the baseTrashPath's subDir leads to don't modify baseTrashPath
[ https://issues.apache.org/jira/browse/HADOOP-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859749#comment-16859749 ] He Xiaoqiao commented on HADOOP-16112: -- [~leosun08] Thanks for your reports. Would you mind describe this issue detailed and how to reproduce? > Delete the baseTrashPath's subDir leads to don't modify baseTrashPath > - > > Key: HADOOP-16112 > URL: https://issues.apache.org/jira/browse/HADOOP-16112 > Project: Hadoop Common > Issue Type: Bug > Components: common >Affects Versions: 3.2.0 >Reporter: Lisheng Sun >Priority: Major > Attachments: HADOOP-16112.001.patch, HADOOP-16112.002.patch > > > There is race condition in TrashPolicyDefault#moveToTrash > try { > if (!fs.mkdirs(baseTrashPath, PERMISSION)) > { // create current LOG.warn("Can't create(mkdir) trash directory: " + > baseTrashPath); return false; } > } catch (FileAlreadyExistsException e) { > // find the path which is not a directory, and modify baseTrashPath > // & trashPath, then mkdirs > Path existsFilePath = baseTrashPath; > while (!fs.exists(existsFilePath)) > { existsFilePath = existsFilePath.getParent(); } > {color:#ff}// case{color} > {color:#ff} other thread deletes existsFilePath here ,the results > doesn't meet expectation{color} > {color:#ff} for example{color} > {color:#ff} there is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng/b{color} > {color:#ff} when delete /user/u_sunlisheng/b/a. if existsFilePath is > deleted, the result is > /user/u_sunlisheng/.Trash/Current/user/u_sunlisheng+timstamp/b/a{color} > {color:#ff} so when existsFilePath is deleted, don't modify > baseTrashPath. {color} > baseTrashPath = new Path(baseTrashPath.toString().replace( > existsFilePath.toString(), existsFilePath.toString() + Time.now()) > ); > trashPath = new Path(baseTrashPath, trashPath.getName()); > // retry, ignore current failure > --i; > continue; > } catch (IOException e) > { LOG.warn("Can't create trash directory: " + baseTrashPath, e); cause = e; > break; } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840523#comment-16840523 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri] for the review and commit. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0 > > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch, HADOOP-16161.008.patch, > HADOOP-16161.009.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834419#comment-16834419 ] He Xiaoqiao commented on HADOOP-16161: -- [~elgoiri], any furthermore comments or suggestions about this fix? > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch, HADOOP-16161.008.patch, > HADOOP-16161.009.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16254) Add proxy address in IPC connection
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834418#comment-16834418 ] He Xiaoqiao commented on HADOOP-16254: -- [^HADOOP-16254.004.patch] 1. set proxy address in IPC connection and avoid to pass on every call and reduce RPC load. 2. expose interface using static method {{Client#setProxyAddress}} in this version so it is possible that every one can set the field. welcomes some suggestions about safeguard ways. > Add proxy address in IPC connection > --- > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch, HADOOP-16254.002.patch, > HADOOP-16254.004.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16254) Add proxy address in IPC connection
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16254: - Attachment: HADOOP-16254.004.patch > Add proxy address in IPC connection > --- > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch, HADOOP-16254.002.patch, > HADOOP-16254.004.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16254) Add proxy address in IPC connection
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16254: - Summary: Add proxy address in IPC connection (was: Add clientHostname to RPC header) > Add proxy address in IPC connection > --- > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch, HADOOP-16254.002.patch, > HADOOP-16254.004.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16254) Add clientHostname to RPC header
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819802#comment-16819802 ] He Xiaoqiao commented on HADOOP-16254: -- Thanks [~daryn], [~vinayrpet] for your further valuable advices. {quote}Include complete client's socket address instead of just hostname(i.e. Hostname/IP:port ). This will help in identifying details about particular client if required. Instead of changing the RPC Request header, add the same field in "IpcConnectionContextProto" as suggested by Daryn Sharp in the previous Jira. Definitely don't want the peer address info passed on every call. {quote} it makes sense for me, I fully agree to maintain complete client's socket address and move to "IpcConnectionContextProto", thus it does not need to do domain resolve when #getRemoteAddress and also can reduce RPC load. {quote}I was having deja vu seeing this jira. {quote} Yes, this ticket originates from HDFS-13248, and based on some more discussions and maillist suggestions, I initiate this issue, so... {quote}Does it allow anyone to spoof addresses? If I didn't miss a safeguard, -1 on this massive security hole.{quote} About security vulnerability, I think it is limited, take RBF as an example, 1. router server will never use this field even if client set it. 2. I think we can reinforce checking at RPC layer (only regard as legal parameter if current user/ugi is superuser) if client set proxyHostname and send RPC request to Namenode directly. The current patch is just draft version, and continue welcome furthermore suggestions. Thanks all again. > Add clientHostname to RPC header > > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch, HADOOP-16254.002.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16254) Add clientHostname to RPC header
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819018#comment-16819018 ] He Xiaoqiao commented on HADOOP-16254: -- Thanks [~elgoiri], [^HADOOP-16254.002.patch] update the new field named 'proxyHostname' generic following suggestions. And new test to verify protocol. > Add clientHostname to RPC header > > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch, HADOOP-16254.002.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16254) Add clientHostname to RPC header
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16254: - Attachment: HADOOP-16254.002.patch > Add clientHostname to RPC header > > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch, HADOOP-16254.002.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16254) Add clientHostname to RPC header
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16254: - Attachment: HADOOP-16254.001.patch Status: Patch Available (was: Open) submit draft version patch, unit test will be following later. cc [~elgoiri], [~ajisakaa], [~ayushtkn], [~vinayrpet],[~giovanni.fumarola], please have a look if convenient. Thanks. > Add clientHostname to RPC header > > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16254.001.patch > > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16254) Add clientHostname to RPC header
[ https://issues.apache.org/jira/browse/HADOOP-16254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16254: - Description: In order to support data locality of RBF, we need to add new field about client hostname in the RPC headers of Router protocol calls. clientHostname represents hostname of client and forward by Router to Namenode to support data locality friendly. See more [RBF Data Locality Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] in HDFS-13248 and [maillist vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. was: In order to support data locality of RBF, we need to add new field about client hostname in the RPC headers of Router protocol calls. clientHostname represents hostname of client and forward by Router to Namenode to support data locality friendly. See more [^RBF Data Locality Design.pdf] in HDFS-13248 and [maillist vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. > Add clientHostname to RPC header > > > Key: HADOOP-16254 > URL: https://issues.apache.org/jira/browse/HADOOP-16254 > Project: Hadoop Common > Issue Type: New Feature > Components: ipc >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > > In order to support data locality of RBF, we need to add new field about > client hostname in the RPC headers of Router protocol calls. > clientHostname represents hostname of client and forward by Router to > Namenode to support data locality friendly. See more [RBF Data Locality > Design|https://issues.apache.org/jira/secure/attachment/12965092/RBF%20Data%20Locality%20Design.pdf] > in HDFS-13248 and [maillist > vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-16254) Add clientHostname to RPC header
He Xiaoqiao created HADOOP-16254: Summary: Add clientHostname to RPC header Key: HADOOP-16254 URL: https://issues.apache.org/jira/browse/HADOOP-16254 Project: Hadoop Common Issue Type: New Feature Components: ipc Reporter: He Xiaoqiao Assignee: He Xiaoqiao In order to support data locality of RBF, we need to add new field about client hostname in the RPC headers of Router protocol calls. clientHostname represents hostname of client and forward by Router to Namenode to support data locality friendly. See more [^RBF Data Locality Design.pdf] in HDFS-13248 and [maillist vote|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201904.mbox/%3CCAF3Ajax7hGxvowg4K_HVTZeDqC5H=3bfb7mv5sz5mgvadhv...@mail.gmail.com%3E]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809484#comment-16809484 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri] correct {{assertEquals}} usage times, [^HADOOP-16161.009.patch] update that. Maybe I need to highlight this rule. :) {quote}assertEquals() should have the excepted value as the first parameter and the second to be the one checking{quote} Thanks again. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch, HADOOP-16161.008.patch, > HADOOP-16161.009.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.009.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch, HADOOP-16161.008.patch, > HADOOP-16161.009.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808867#comment-16808867 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri], [^HADOOP-16161.008.patch] update unit test and replace all #assertThat using assertEquals. pending Jenkins. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch, HADOOP-16161.008.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.008.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch, HADOOP-16161.008.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806728#comment-16806728 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri] for your reviews, [^HADOOP-16161.007.patch] add new unit test to coverage #sortLocatedBlocks. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.007.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch, HADOOP-16161.007.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791292#comment-16791292 ] He Xiaoqiao commented on HADOOP-16161: -- [~elgoiri], Thanks for digging deep to find the truth. {quote}The only issue would be if some code had a dependency on particular values of the weight. However, it looks like the test are passing with no issues.{quote} Actually, #getWeightUsingNetworkLocation is only invoke by getBlockLocations which determines block locations' order. in one word, #getWeightUsingNetworkLocation is about Read Locality. However I do not find any unit test to verify result of #sortLocatedBlocks directly in #blockmanagement, So it is well understood that test are passed under #blockmanagement. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790333#comment-16790333 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri], do we need another one reviews? > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788842#comment-16788842 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri], [^HADOOP-16161.006.patch] correct the format of unit test code. Please help to review at your convenience. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.006.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch, > HADOOP-16161.006.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.005.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788685#comment-16788685 ] He Xiaoqiao commented on HADOOP-16161: -- [~elgoiri], Thanks. {quote}One minor thing: the assertEquals() in testGetWeight() are still reversed. {quote} right, I update and reverse parameters assertEquals in [^HADOOP-16161.005.patch] . {quote}We could also initialize nodeNotInMap right before using it. {quote} Sorry I don't get the point, variable #nodeNotInMap has initialized respectively in #testGetWeight and #testGetWeightForDepth before using it. If you means initialize when #setup, it may be infeasible since depth of topology is not the same between different test cases. please correct if I am wrong. Thanks again. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch, HADOOP-16161.005.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.004.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787499#comment-16787499 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri] for reviews. update patch following comments and reload [^HADOOP-16161.004.patch]. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch, HADOOP-16161.004.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16119) KMS on Hadoop RPC Engine
[ https://issues.apache.org/jira/browse/HADOOP-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786438#comment-16786438 ] He Xiaoqiao commented on HADOOP-16119: -- [~jojochuang] Thanks for your quick response. And sorry for fuzzy expression. {quote}Regarding delegation tokens – delegation tokens are stored in zookeeper, and after HADOOP-14445, delegation tokens are shared among KMS instances.{quote} My branch is based on branch-2.7 and not patch HADOOP-14445, it makes sense for me. If just consider community version (include branch trunk), It seems to offer local storage with Java KeyStore only and no other choice, Please correct me if I am wrong. Looking forward to CKTS open source. About part "HA", I means KMS instance adding/removing/fault is not transparent for client. title "HA" may mislead, I think this is also scalability issue and sorry for that.:) > KMS on Hadoop RPC Engine > > > Key: HADOOP-16119 > URL: https://issues.apache.org/jira/browse/HADOOP-16119 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Jonathan Eagles >Assignee: Wei-Chiu Chuang >Priority: Major > Attachments: Design doc_ KMS v2.pdf > > > Per discussion on common-dev and text copied here for ease of reference. > https://lists.apache.org/thread.html/0e2eeaf07b013f17fad6d362393f53d52041828feec53dcddff04808@%3Ccommon-dev.hadoop.apache.org%3E > {noformat} > Thanks all for the inputs, > To offer additional information (while Daryn is working on his stuff), > optimizing RPC encryption opens up another possibility: migrating KMS > service to use Hadoop RPC. > Today's KMS uses HTTPS + REST API, much like webhdfs. It has very > undesirable performance (a few thousand ops per second) compared to > NameNode. Unfortunately for each NameNode namespace operation you also need > to access KMS too. > Migrating KMS to Hadoop RPC greatly improves its performance (if > implemented correctly), and RPC encryption would be a prerequisite. So > please keep that in mind when discussing the Hadoop RPC encryption > improvements. Cloudera is very interested to help with the Hadoop RPC > encryption project because a lot of our customers are using at-rest > encryption, and some of them are starting to hit KMS performance limit. > This whole "migrating KMS to Hadoop RPC" was Daryn's idea. I heard this > idea in the meetup and I am very thrilled to see this happening because it > is a real issue bothering some of our customers, and I suspect it is the > right solution to address this tech debt. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786402#comment-16786402 ] He Xiaoqiao commented on HADOOP-16161: -- Thanks [~elgoiri]. resubmit [^HADOOP-16161.003.patch] which following review comments, and pending Jenkins. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.003.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch, > HADOOP-16161.003.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-16119) KMS on Hadoop RPC Engine
[ https://issues.apache.org/jira/browse/HADOOP-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785610#comment-16785610 ] He Xiaoqiao edited comment on HADOOP-16119 at 3/6/19 1:12 PM: -- [~jojochuang] I would like to offer issues about current version of KMS that I meet in practice. # Scalability: now it is difficult to scale KMS instance friendly since delegation token and all key data are isolated between different KMS instances at all. # Transparent: KMSClient has to upgrade the configuration even when add one KMS instance. # HA: it seems that KMS instances is peer-to-peer arch. but client has to try one by one util success if some one fault. the cost is very high. # Data Consistency: KMS instance manages key by Java KeyStore isolated, KMS client request to create key to all KMS instances serially, if one of them failed for some reason, create-request will throw exception and key in KeyStore of different instances will be same completely, also no check background as far as I know. Some of them are also mentioned and as exit criteria in [^Design doc_ KMS v2.pdf] via [~jojochuang]. I think the core issue is no share-storage between different instance in one word. I propose to create a plugin ShareStore as file/dbms/zookeeper behind KMS instance, and let KMS stateless. It seems work well using the share storage reference RBF. Another side, we can also retain cache mechanism to improve the performance. [~jojochuang] please do correct me if I am wrong. was (Author: hexiaoqiao): [~jojochuang] I would like to offer issues about current version of KMS that I meet in practice. # Scalability: now it is difficult to scale KMS instance friendly since delegation token and all key data are isolated between different KMS instances at all. # Transparent: KMSClient has to upgrade the configuration even when add one KMS instance. # HA: it seems that KMS instances is peer-to-peer arch. but client has to try one by one util success if some one fault. the cost is very high. # Data Consistency: KMS instance manages key by Java KeyStore isolated, KMS client request to create key to all KMS instances serially, if one of them failed for some reason, create-request will throw exception and key in KeyStore of different instances will be same completely, also no check background as far as I know. I think the core issue is no share-storage between different instance in one word. I propose to create a plugin ShareStore as file/dbms/zookeeper behind KMS instance, and let KMS stateless. It seems work well using the share storage reference RBF. Another side, we can also retain cache mechanism to improve the performance. [~jojochuang] please do correct me if I am wrong. > KMS on Hadoop RPC Engine > > > Key: HADOOP-16119 > URL: https://issues.apache.org/jira/browse/HADOOP-16119 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Jonathan Eagles >Assignee: Wei-Chiu Chuang >Priority: Major > Attachments: Design doc_ KMS v2.pdf > > > Per discussion on common-dev and text copied here for ease of reference. > https://lists.apache.org/thread.html/0e2eeaf07b013f17fad6d362393f53d52041828feec53dcddff04808@%3Ccommon-dev.hadoop.apache.org%3E > {noformat} > Thanks all for the inputs, > To offer additional information (while Daryn is working on his stuff), > optimizing RPC encryption opens up another possibility: migrating KMS > service to use Hadoop RPC. > Today's KMS uses HTTPS + REST API, much like webhdfs. It has very > undesirable performance (a few thousand ops per second) compared to > NameNode. Unfortunately for each NameNode namespace operation you also need > to access KMS too. > Migrating KMS to Hadoop RPC greatly improves its performance (if > implemented correctly), and RPC encryption would be a prerequisite. So > please keep that in mind when discussing the Hadoop RPC encryption > improvements. Cloudera is very interested to help with the Hadoop RPC > encryption project because a lot of our customers are using at-rest > encryption, and some of them are starting to hit KMS performance limit. > This whole "migrating KMS to Hadoop RPC" was Daryn's idea. I heard this > idea in the meetup and I am very thrilled to see this happening because it > is a real issue bothering some of our customers, and I suspect it is the > right solution to address this tech debt. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16119) KMS on Hadoop RPC Engine
[ https://issues.apache.org/jira/browse/HADOOP-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785610#comment-16785610 ] He Xiaoqiao commented on HADOOP-16119: -- [~jojochuang] I would like to offer issues about current version of KMS that I meet in practice. # Scalability: now it is difficult to scale KMS instance friendly since delegation token and all key data are isolated between different KMS instances at all. # Transparent: KMSClient has to upgrade the configuration even when add one KMS instance. # HA: it seems that KMS instances is peer-to-peer arch. but client has to try one by one util success if some one fault. the cost is very high. # Data Consistency: KMS instance manages key by Java KeyStore isolated, KMS client request to create key to all KMS instances serially, if one of them failed for some reason, create-request will throw exception and key in KeyStore of different instances will be same completely, also no check background as far as I know. I think the core issue is no share-storage between different instance in one word. I propose to create a plugin ShareStore as file/dbms/zookeeper behind KMS instance, and let KMS stateless. It seems work well using the share storage reference RBF. Another side, we can also retain cache mechanism to improve the performance. [~jojochuang] please do correct me if I am wrong. > KMS on Hadoop RPC Engine > > > Key: HADOOP-16119 > URL: https://issues.apache.org/jira/browse/HADOOP-16119 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Jonathan Eagles >Assignee: Wei-Chiu Chuang >Priority: Major > Attachments: Design doc_ KMS v2.pdf > > > Per discussion on common-dev and text copied here for ease of reference. > https://lists.apache.org/thread.html/0e2eeaf07b013f17fad6d362393f53d52041828feec53dcddff04808@%3Ccommon-dev.hadoop.apache.org%3E > {noformat} > Thanks all for the inputs, > To offer additional information (while Daryn is working on his stuff), > optimizing RPC encryption opens up another possibility: migrating KMS > service to use Hadoop RPC. > Today's KMS uses HTTPS + REST API, much like webhdfs. It has very > undesirable performance (a few thousand ops per second) compared to > NameNode. Unfortunately for each NameNode namespace operation you also need > to access KMS too. > Migrating KMS to Hadoop RPC greatly improves its performance (if > implemented correctly), and RPC encryption would be a prerequisite. So > please keep that in mind when discussing the Hadoop RPC encryption > improvements. Cloudera is very interested to help with the Hadoop RPC > encryption project because a lot of our customers are using at-rest > encryption, and some of them are starting to hit KMS performance limit. > This whole "migrating KMS to Hadoop RPC" was Daryn's idea. I heard this > idea in the meetup and I am very thrilled to see this happening because it > is a real issue bothering some of our customers, and I suspect it is the > right solution to address this tech debt. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784754#comment-16784754 ] He Xiaoqiao commented on HADOOP-16161: -- upload patch [^HADOOP-16161.002.patch], add complex topology for test deeper level nodes. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.002.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HADOOP-16161.002.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784734#comment-16784734 ] He Xiaoqiao commented on HADOOP-16161: -- [~elgoiri], Thanks for your continue following. In my test it can cover any depth of topology. I would like to add more unit test to cover your concerns later. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784467#comment-16784467 ] He Xiaoqiao commented on HADOOP-16161: -- I would like to offer more comment about this issue. The following is the complete #getWeightUsingNetworkLocation (only for no datanode client) code based on branch trunk. a. normalize both reader and datanode network location, the result is *rack location* marked readerPath and nodePath which is parent of reader or datanode, both are calculate by rack aware script if configure. b. split both network location by slash, then get the smaller one level. c. find the deepest node which is the common ancestor/parent of the network location mentioned in step a. d. based on step c, calculate topology distance between readerPath and nodePath. All above steps are correct, but the result is distance between parent of reader and parent of node, rather than reader to node. So adding a +2 can avoid this issue I think. welcome discuss and please help to correct me if there are something wrong. {code:java} private static int getWeightUsingNetworkLocation(Node reader, Node node) { //Start off by initializing to Integer.MAX_VALUE int weight = Integer.MAX_VALUE; if(reader != null && node != null) { String readerPath = normalizeNetworkLocationPath( reader.getNetworkLocation()); String nodePath = normalizeNetworkLocationPath( node.getNetworkLocation()); //same rack if(readerPath.equals(nodePath)) { if(reader.getName().equals(node.getName())) { weight = 0; } else { weight = 2; } } else { String[] readerPathToken = readerPath.split(PATH_SEPARATOR_STR); String[] nodePathToken = nodePath.split(PATH_SEPARATOR_STR); int maxLevelToCompare = readerPathToken.length > nodePathToken.length ? nodePathToken.length : readerPathToken.length; int currentLevel = 1; //traverse through the path and calculate the distance while(currentLevel < maxLevelToCompare) { if(!readerPathToken[currentLevel] .equals(nodePathToken[currentLevel])){ break; } currentLevel++; } weight = (readerPathToken.length - currentLevel) + (nodePathToken.length - currentLevel); } } return weight; } {code} > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783600#comment-16783600 ] He Xiaoqiao commented on HADOOP-16161: -- [~elgoiri], I think it is not related to depth of the topology, since #getWeightUsingNetworkLocation does not calculate with the leaf of topology only. FYI. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Moved] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao moved HDFS-14332 to HADOOP-16161: - Component/s: (was: namenode) net Key: HADOOP-16161 (was: HDFS-14332) Project: Hadoop Common (was: Hadoop HDFS) > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HDFS-14332.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: (was: HDFS-14332.001.patch) > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783380#comment-16783380 ] He Xiaoqiao commented on HADOOP-16161: -- move from project HDFS to COMMON and rename the patch. > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16161) NetworkTopology#getWeightUsingNetworkLocation return unexpected result
[ https://issues.apache.org/jira/browse/HADOOP-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-16161: - Attachment: HADOOP-16161.001.patch > NetworkTopology#getWeightUsingNetworkLocation return unexpected result > -- > > Key: HADOOP-16161 > URL: https://issues.apache.org/jira/browse/HADOOP-16161 > Project: Hadoop Common > Issue Type: Bug > Components: net >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-16161.001.patch, HDFS-14332.001.patch > > > Consider the following scenario: > 1. there are 4 slaves and topology like: > Rack: /IDC/RACK1 >hostname1 >hostname2 > Rack: /IDC/RACK2 >hostname3 >hostname4 > 2. Reader from hostname1, and calculate weight between reader and [hostname1, > hostname3, hostname4] by #getWeight, and their corresponding values are > [0,4,4] > 3. Reader from client which is not in the topology, and in the same IDC but > in none rack of the topology, and calculate weight between reader and > [hostname1, hostname3, hostname4] by #getWeightUsingNetworkLocation, and > their corresponding values are [2,2,2] > 4. Other different Reader can get the similar results. > The weight result for case #3 is obviously not the expected value, the truth > is [4,4,4]. this issue may cause reader not really following arrange: local > -> local rack -> remote rack. > After dig the detailed implement, the root cause is > #getWeightUsingNetworkLocation only calculate distance between Racks rather > than hosts. > I think we should add constant 2 to correct the weight of > #getWeightUsingNetworkLocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15864: - Attachment: HADOOP-15864.005.patch Status: Patch Available (was: Reopened) re-upload patch [^HADOOP-15864.005.patch] (same as last one) and just for trigger Jenkins. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.2 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.004.patch, HADOOP-15864.005.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at
[jira] [Updated] (HADOOP-15883) Fix WebHdfsFileSystemContract test
[ https://issues.apache.org/jira/browse/HADOOP-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15883: - Resolution: Won't Fix Status: Resolved (was: Patch Available) > Fix WebHdfsFileSystemContract test > -- > > Key: HADOOP-15883 > URL: https://issues.apache.org/jira/browse/HADOOP-15883 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.4, 3.3.0, 3.1.2, 3.2.1 >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15883.001.patch > > > HADOOP-15864 fix bug about Job/Task execute failure when server (NameNode, > KMS, Timeline) domain name can not resolve. meanwhile it change semantic of > http status code about webhdfsfilesystem, this ticket will trace to fix > TestWebHdfsFileSystemContract#testResponseCode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781888#comment-16781888 ] He Xiaoqiao commented on HADOOP-15864: -- upload new patch [^HADOOP-15864.004.patch] and I try to fix this issue with new configuration in order to avoid this fix affect other unit test. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.004.patch, HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at
[jira] [Updated] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15864: - Attachment: HADOOP-15864.004.patch > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.004.patch, HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:665) > ... 35 more > Caused by: java.lang.reflect.InvocationTargetException > at
[jira] [Commented] (HADOOP-16119) KMS on Hadoop RPC Engine
[ https://issues.apache.org/jira/browse/HADOOP-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781284#comment-16781284 ] He Xiaoqiao commented on HADOOP-16119: -- Thanks [~jojochuang], it is interesting work. Now that I have deploy KMS to support massive column encryption for a long time, KMS performance improvement rather appeals to me and I would like to join and contribute to this work. > KMS on Hadoop RPC Engine > > > Key: HADOOP-16119 > URL: https://issues.apache.org/jira/browse/HADOOP-16119 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Jonathan Eagles >Assignee: Wei-Chiu Chuang >Priority: Major > Attachments: Design doc_ KMS v2.pdf > > > Per discussion on common-dev and text copied here for ease of reference. > https://lists.apache.org/thread.html/0e2eeaf07b013f17fad6d362393f53d52041828feec53dcddff04808@%3Ccommon-dev.hadoop.apache.org%3E > {noformat} > Thanks all for the inputs, > To offer additional information (while Daryn is working on his stuff), > optimizing RPC encryption opens up another possibility: migrating KMS > service to use Hadoop RPC. > Today's KMS uses HTTPS + REST API, much like webhdfs. It has very > undesirable performance (a few thousand ops per second) compared to > NameNode. Unfortunately for each NameNode namespace operation you also need > to access KMS too. > Migrating KMS to Hadoop RPC greatly improves its performance (if > implemented correctly), and RPC encryption would be a prerequisite. So > please keep that in mind when discussing the Hadoop RPC encryption > improvements. Cloudera is very interested to help with the Hadoop RPC > encryption project because a lot of our customers are using at-rest > encryption, and some of them are starting to hit KMS performance limit. > This whole "migrating KMS to Hadoop RPC" was Daryn's idea. I heard this > idea in the meetup and I am very thrilled to see this happening because it > is a real issue bothering some of our customers, and I suspect it is the > right solution to address this tech debt. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739049#comment-16739049 ] He Xiaoqiao commented on HADOOP-15922: -- [~daryn],[~eyang] Thanks for pushing forward this issue and sorry for late response. Please let me know if there are something i missed. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch, HADOOP-15922.007.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734894#comment-16734894 ] He Xiaoqiao commented on HADOOP-15922: -- Check failed junit tests (hadoop.security.ssl.TestSSLFactory ) and found it is failed for times. It maybe not related with patch. [~eyang] Please help to double check. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch, HADOOP-15922.007.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734852#comment-16734852 ] He Xiaoqiao commented on HADOOP-15922: -- [~eyang] Thank you for pushing forward HADOOP-15996 and this issue. update and upload v007 and adapt MIT auth_to_local mechanism based on HADOOP-15996. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch, HADOOP-15922.007.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.007.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch, HADOOP-15922.007.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: (was: HADOOP-15922.007.patch) > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.007.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch, HADOOP-15922.007.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15959) revert HADOOP-12751
[ https://issues.apache.org/jira/browse/HADOOP-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706764#comment-16706764 ] He Xiaoqiao commented on HADOOP-15959: -- hi [~ste...@apache.org], [~ajisakaa], IIUC, this is a common issue, since after revert HADOOP-12751, some auth_to_local rules if include '/' or '@' will always check fail in KerberosName#apply since it throws NoMatchingRule exception. {quote}if (result != null && nonSimplePattern.matcher(result).find()) { throw new NoMatchingRule("Non-simple name " + result + " after auth_to_local rule " + this); }{quote} another case ref HADOOP-15922. Please check +1. > revert HADOOP-12751 > --- > > Key: HADOOP-15959 > URL: https://issues.apache.org/jira/browse/HADOOP-15959 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3, 2.7.7, 2.8.5 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.2.0, 2.7.8, 3.0.4, 3.1.2, 2.8.6, 2.9.3 > > Attachments: HADOOP-15959-001.patch, HADOOP-15959-branch-2-002.patch, > HADOOP-15959-branch-2.7-003.patch > > > HADOOP-12751 doesn't quite work right. Revert. > (this patch is so jenkins can do the test runs) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706737#comment-16706737 ] He Xiaoqiao commented on HADOOP-15922: -- [~eyang] Thanks for your feedback. Actually, if use 'fool/localhost' as impersonate user for ut, KerberosName could not check pass after revert HADOOP-12751, since KerberosName#apply check if impersonate user name include '/' or '@' when apply rule like 'RULE:[2:$1/$2]', and it throws exception because HADOOP-12751. {quote} if (result != null && nonSimplePattern.matcher(result).find()) { throw new NoMatchingRule("Non-simple name " + result + " after auth_to_local rule " + this); }{quote} Before revert HADOOP-12751, this check just LOG.info and not throw exception. IIUC, this is a common issue: if using auth_to_local and some rule include '/' or '@', it always throw exception. FYI. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706207#comment-16706207 ] He Xiaoqiao commented on HADOOP-15922: -- [~eyang] update and re-upload v006 and use special character '%' replace '/' in TestKMS#testGetDelegationTokenByProxyUser compare to v005. Since after revert of HADOOP-12751, have to config complex auth_to_local rules to pass the auth. In order to check client not double encode doAs user name only, choose another special character '%' and do not import auth_to_local rules. FYI. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.006.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch, > HADOOP-15922.006.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703412#comment-16703412 ] He Xiaoqiao commented on HADOOP-15922: -- [~daryn] Thank you for your correct. I upload another v005 and follow your suggestions.could you help to revert commit and review the new one. Thanks again. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.005.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Fix For: 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch, HADOOP-15922.005.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690802#comment-16690802 ] He Xiaoqiao commented on HADOOP-15922: -- [~eyang] Thank your for pushing forward this issue. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690402#comment-16690402 ] He Xiaoqiao commented on HADOOP-15922: -- Hi [~eyang] {quote}If the client is changed to proxy from client/host, then hadoop.kms.proxyuser.client.hosts should include host: conf.set("hadoop.kms.proxyuser.client.hosts", "localhost,host"); {quote} In the unit test, try to use user 'client/host' impersonate 'foo/localhost'. Actually, it is set for user 'client' about which users/groups/hosts can impersonate to, using configuration key 'hadoop.kms.proxyuser.client.users'/'hadoop.kms.proxyuser.client.hosts' and it is indeed valid. Impersonating user 'foo/localhost' can be passing auth since: {code:java} conf.set("hadoop.kms.proxyuser.client.users", "foo/localhost"); conf.set("hadoop.kms.proxyuser.client.hosts", "localhost"); {code} It is not necessary to check groups if users can check pass, ref. org.apache.hadoop.security.authorize.AccessControlList#isUserInList {code:java} public final boolean isUserInList(UserGroupInformation ugi) { if (allAllowed || users.contains(ugi.getShortUserName())) { return true; } else if (!groups.isEmpty()) { for (String group : ugi.getGroups()) { if (groups.contains(group)) { return true; } } } return false; } {code} {quote}I am not sure why KMS doesn't use standard hadoop.proxyuser.client.groups and hadoop.proxyuser.client.hosts.{quote} Configuration prefix 'hadoop.kms' for KMS originate from HADOOP-10433, however I do not find why KMS use no-standard configuration. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689060#comment-16689060 ] He Xiaoqiao commented on HADOOP-15922: -- [~eyang] Thank you for quick response. It makes sense for me. [^HADOOP-15922.004.patch] fixes it. FYI. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.004.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch, HADOOP-15922.004.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689045#comment-16689045 ] He Xiaoqiao commented on HADOOP-15922: -- [~eyang], Thanks for your reviews and comments. [^HADOOP-15922.003.patch] update unit test based Eric's comments: 1. use proxy user named foo/localhost instead of foo/localh...@realm.com and just check special character '/'. 2. limit proxyuser scope about client. 3. I think it is not key point about client principal with/without hostname for this issue and principal is defined when setup at beginning, if I missed something, please correct me. 4. rename test method from doGetDelegationTokenByProxyUser to testGetDelegationTokenByProxyUser. Thanks [~eyang] again. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.003.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch, > HADOOP-15922.003.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687481#comment-16687481 ] He Xiaoqiao commented on HADOOP-15922: -- submit v002 patch with unittest and trigger jenkins again. Hi, [~ste...@apache.org] would you help me to review this patch? > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.002.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch, HADOOP-15922.002.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Status: Patch Available (was: Open) submit v001 patch and trigger Jenkins. UT will follow later. > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
[ https://issues.apache.org/jira/browse/HADOOP-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15922: - Attachment: HADOOP-15922.001.patch > DelegationTokenAuthenticationFilter get wrong doAsUser since it does not > decode URL > --- > > Key: HADOOP-15922 > URL: https://issues.apache.org/jira/browse/HADOOP-15922 > Project: Hadoop Common > Issue Type: Bug > Components: common, kms >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15922.001.patch > > > DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from > client is complete kerberos name (e.g., user/hostn...@realm.com, actually it > is acceptable), because DelegationTokenAuthenticationFilter does not decode > DOAS parameter in URL which is encoded by {{URLEncoder}} at client. > e.g. KMS as example: > a. KMSClientProvider creates connection to KMS Server using > DelegationTokenAuthenticatedURL#openConnection. > b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} > with url encoded user as one parameter of http request. > {code:java} > // proxyuser > if (doAs != null) { > extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); > } > {code} > c. when KMS server receives the request, it does not decode the proxy user. > As result, KMS Server will get the wrong proxy user if this proxy user is > complete Kerberos Name or it includes some special character. Some other > authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-15922) DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL
He Xiaoqiao created HADOOP-15922: Summary: DelegationTokenAuthenticationFilter get wrong doAsUser since it does not decode URL Key: HADOOP-15922 URL: https://issues.apache.org/jira/browse/HADOOP-15922 Project: Hadoop Common Issue Type: Bug Components: common, kms Reporter: He Xiaoqiao Assignee: He Xiaoqiao DelegationTokenAuthenticationFilter get wrong doAsUser when proxy user from client is complete kerberos name (e.g., user/hostn...@realm.com, actually it is acceptable), because DelegationTokenAuthenticationFilter does not decode DOAS parameter in URL which is encoded by {{URLEncoder}} at client. e.g. KMS as example: a. KMSClientProvider creates connection to KMS Server using DelegationTokenAuthenticatedURL#openConnection. b. If KMSClientProvider is a doAsUser, KMSClientProvider will put {{doas}} with url encoded user as one parameter of http request. {code:java} // proxyuser if (doAs != null) { extraParams.put(DO_AS, URLEncoder.encode(doAs, "UTF-8")); } {code} c. when KMS server receives the request, it does not decode the proxy user. As result, KMS Server will get the wrong proxy user if this proxy user is complete Kerberos Name or it includes some special character. Some other authentication and authorization exception will throws next to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667065#comment-16667065 ] He Xiaoqiao commented on HADOOP-15864: -- {quote}a number of other callers of SecurityUtil.buildTokenService in YARN and MAPREDUCE and none seem to handle a null response value{quote} OMG, I will try to fix this issue and keep compatibility with YARN and other assembly in the next days. Thanks [~jojochuang],[~wilfreds] again. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at
[jira] [Updated] (HADOOP-15883) Fix WebHdfsFileSystemContract test
[ https://issues.apache.org/jira/browse/HADOOP-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15883: - Attachment: HADOOP-15883.001.patch Status: Patch Available (was: Open) submit initial patch against trunk and trigger Jenkins. > Fix WebHdfsFileSystemContract test > -- > > Key: HADOOP-15883 > URL: https://issues.apache.org/jira/browse/HADOOP-15883 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.4, 3.3.0, 3.1.2, 3.2.1 >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > Attachments: HADOOP-15883.001.patch > > > HADOOP-15864 fix bug about Job/Task execute failure when server (NameNode, > KMS, Timeline) domain name can not resolve. meanwhile it change semantic of > http status code about webhdfsfilesystem, this ticket will trace to fix > TestWebHdfsFileSystemContract#testResponseCode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15883) Fix WebHdfsFileSystemContract test
[ https://issues.apache.org/jira/browse/HADOOP-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1407#comment-1407 ] He Xiaoqiao commented on HADOOP-15883: -- [~jojochuang],[~ayushtkn] As HADOOP-15864 mentioned, after patch, DataNode will NOT meet IllegalArgumentException when create DFSClient instance, however, when DataNode creates wrapped outputstream, it will meet auth exception since no token exist, so Client will meet HTTP status code 403. In one word, this patch changes http semantic when no parameter about #namenoderpcaddress. On the base of the above, I support also change http status code to 403 when checking at TestWebHdfsFileSystemContract#testResponseCode. FYI. > Fix WebHdfsFileSystemContract test > -- > > Key: HADOOP-15883 > URL: https://issues.apache.org/jira/browse/HADOOP-15883 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.4, 3.3.0, 3.1.2, 3.2.1 >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > > HADOOP-15864 fix bug about Job/Task execute failure when server (NameNode, > KMS, Timeline) domain name can not resolve. meanwhile it change semantic of > http status code about webhdfsfilesystem, this ticket will trace to fix > TestWebHdfsFileSystemContract#testResponseCode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1405#comment-1405 ] He Xiaoqiao commented on HADOOP-15864: -- Thanks [~ayushtkn] feedback, I recheck the fail UT (#TestWebHdfsFileSystemContract) and retest at local machine, it is related to the issue. The main reason: Before patch, {{WebHdfsHandler}} at DataNode will meet {{IllegalArgumentException}} at {{SecurityUtil#buildTokenService}} when create DFSClient instance using {{newDfsClient(nnId, confForCreate);}} when handle event {{onCreate}} but client do not pass parameter #namenoderpcaddress, so Client will meet HTTP status code 400. After patch, DataNode will *NOT* meet {{IllegalArgumentException}} when create DFSClient instance, however, when DataNode creates wrapped outputstream, it will meet auth exception since no token exist, so Client will meet HTTP status code 403. In one word, this patch changes http semantic when no parameter about #namenoderpcaddress. I have created another ticket HADOOP-15883 to trace this issue. Thanks [~ayushtkn] again and sorry I do not see it in time when check fail ut in my local environment. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at
[jira] [Created] (HADOOP-15883) Fix WebHdfsFileSystemContract test
He Xiaoqiao created HADOOP-15883: Summary: Fix WebHdfsFileSystemContract test Key: HADOOP-15883 URL: https://issues.apache.org/jira/browse/HADOOP-15883 Project: Hadoop Common Issue Type: Bug Affects Versions: 3.0.4, 3.3.0, 3.1.2, 3.2.1 Reporter: He Xiaoqiao Assignee: He Xiaoqiao HADOOP-15864 fix bug about Job/Task execute failure when server (NameNode, KMS, Timeline) domain name can not resolve. meanwhile it change semantic of http status code about webhdfsfilesystem, this ticket will trace to fix TestWebHdfsFileSystemContract#testResponseCode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1117#comment-1117 ] He Xiaoqiao commented on HADOOP-15864: -- Thanks [~jojochuang] for your review. {quote} He Xiaoqiao Wei-Chiu Chuang mind giving a check again to hadoop.hdfs.web.TestWebHdfsFileSystemContract{quote} Thanks for your feedback, I will recheck this UT in next two day. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.2, 3.2.1 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at
[jira] [Updated] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15864: - Fix Version/s: 3.3.0 2.7.8 > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Fix For: 2.7.8, 3.3.0 > > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:665) > ... 35 more > Caused by: java.lang.reflect.InvocationTargetException > at
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662429#comment-16662429 ] He Xiaoqiao commented on HADOOP-15864: -- Thanks [~jojochuang] for your suggestion, [^HADOOP-15864.003.patch] is ready for branch trunk, and I fond unit test has pass. I also rename v002 follow the right format and resubmit. FYI. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at
[jira] [Updated] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HADOOP-15864: - Attachment: HADOOP-15864.branch.2.7.004.patch > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch, > HADOOP-15864.branch.2.7.004.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:665) > ... 35 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source) > at
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661790#comment-16661790 ] He Xiaoqiao commented on HADOOP-15864: -- [~jojochuang] check fail unit test and it passed at local machine, I think it is not related about this patch. another question, [^HADOOP-15864-branch.2.7.002.patch] is for branch-2.7 , however jenkins apply it to branch-3.3.0, could you give some suggestions? > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at
[jira] [Commented] (HADOOP-15864) Job submitter / executor fail when SBN domain name can not resolved
[ https://issues.apache.org/jira/browse/HADOOP-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660609#comment-16660609 ] He Xiaoqiao commented on HADOOP-15864: -- submit [^HADOOP-15864.003.patch] for branch trunk. > Job submitter / executor fail when SBN domain name can not resolved > --- > > Key: HADOOP-15864 > URL: https://issues.apache.org/jira/browse/HADOOP-15864 > Project: Hadoop Common > Issue Type: Bug >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Critical > Attachments: HADOOP-15864-branch.2.7.001.patch, > HADOOP-15864-branch.2.7.002.patch, HADOOP-15864.003.patch > > > Job submit failure and Task executes failure if Standby NameNode domain name > can not resolved on HDFS HA with DelegationToken feature. > This issue is triggered when create {{ConfiguredFailoverProxyProvider}} > instance which invoke {{HAUtil.cloneDelegationTokenForLogicalUri}} in HA mode > with Security. Since in HDFS HA mode UGI need include separate token for each > NameNode in order to dealing with Active-Standby switch, the double tokens' > content is same of course. > However when #setTokenService in {{HAUtil.cloneDelegationTokenForLogicalUri}} > it checks whether the address of NameNode has been resolved or not, if Not, > throw #IllegalArgumentException upon, then job submitter/ task executor fail. > HDFS-8068 and HADOOP-12125 try to fix it, but I don't think the two tickets > resolve completely. > Another questions many guys consider is why NameNode domain name can not > resolve? I think there are many scenarios, for instance node replace when > meet fault, and refresh DNS sometimes. Anyway, Standby NameNode failure > should not impact Hadoop cluster stability in my opinion. > a. code ref: org.apache.hadoop.security.SecurityUtil line373-386 > {code:java} > public static Text buildTokenService(InetSocketAddress addr) { > String host = null; > if (useIpForTokenService) { > if (addr.isUnresolved()) { // host has no ip address > throw new IllegalArgumentException( > new UnknownHostException(addr.getHostName()) > ); > } > host = addr.getAddress().getHostAddress(); > } else { > host = StringUtils.toLowerCase(addr.getHostName()); > } > return new Text(host + ":" + addr.getPort()); > } > {code} > b.exception log ref: > {code:xml} > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Couldn't create proxy provider class > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > at > org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:515) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:170) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:761) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:691) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:150) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.(ChRootedFileSystem.java:106) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:178) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.getTargetFileSystem(ViewFileSystem.java:172) > at org.apache.hadoop.fs.viewfs.InodeTree.createLink(InodeTree.java:303) > at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:377) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:172) > at > org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:172) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2713) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2747) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2729) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:176) > at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:665) > ... 35 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source)