[jira] [Created] (HDFS-13468) Add erasure coding metrics into ReadStatistics
Lei (Eddy) Xu created HDFS-13468: Summary: Add erasure coding metrics into ReadStatistics Key: HDFS-13468 URL: https://issues.apache.org/jira/browse/HDFS-13468 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.1, 3.1.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Expose Erasure Coding related metrics for InputStream in ReadStatistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13350) Negative legacy block ID will confuse Erasure Coding to be considered as striped block
Lei (Eddy) Xu created HDFS-13350: Summary: Negative legacy block ID will confuse Erasure Coding to be considered as striped block Key: HDFS-13350 URL: https://issues.apache.org/jira/browse/HDFS-13350 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu HDFS-4645 has changed HDFS block ID from randomly generated to sequential positive IDs. And later on, HDFS EC was built on the assumption that normal 3x replica block IDs are positive, so EC re-use negative IDs as striped blocks. However, there are legacy block IDs can be negative in the system, we should not use hardcode method to check whether a block is stripe or not: {code} public static boolean isStripedBlockID(long id) { return BlockType.fromBlockId(id) == STRIPED; } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-13218) Log audit event only used last EC policy name when add multiple policies from file
[ https://issues.apache.org/jira/browse/HDFS-13218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-13218. -- Resolution: Duplicate Fix Version/s: 3.0.1 3.1.0 Target Version/s: 3.1.0, 3.0.2 Lets work on HDFS-13217. > Log audit event only used last EC policy name when add multiple policies from > file > --- > > Key: HDFS-13218 > URL: https://issues.apache.org/jira/browse/HDFS-13218 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding >Affects Versions: 3.1.0 >Reporter: liaoyuxiangqin >Priority: Major > Fix For: 3.1.0, 3.0.1 > > > When i read the addErasureCodingPolicies() of FSNamesystem class in namenode, > i found the following code only used last ec policy name for logAuditEvent, > i think this audit log can't track whole policies for the add multiple > erasure coding policies to the ErasureCodingPolicyManager. Thanks. > {code:java|title=FSNamesystem.java|borderStyle=solid} > try { > checkOperation(OperationCategory.WRITE); > checkNameNodeSafeMode("Cannot add erasure coding policy"); > for (ErasureCodingPolicy policy : policies) { > try { > ErasureCodingPolicy newPolicy = > FSDirErasureCodingOp.addErasureCodingPolicy(this, policy, > logRetryCache); > addECPolicyName = newPolicy.getName(); > responses.add(new AddErasureCodingPolicyResponse(newPolicy)); > } catch (HadoopIllegalArgumentException e) { > responses.add(new AddErasureCodingPolicyResponse(policy, e)); > } > } > success = true; > return responses.toArray(new AddErasureCodingPolicyResponse[0]); > } finally { > writeUnlock(operationName); > if (success) { > getEditLog().logSync(); > } > logAuditEvent(success, operationName,addECPolicyName, null, null); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13175) Add more information for checking argument in DiskBalancerVolume
Lei (Eddy) Xu created HDFS-13175: Summary: Add more information for checking argument in DiskBalancerVolume Key: HDFS-13175 URL: https://issues.apache.org/jira/browse/HDFS-13175 Project: Hadoop HDFS Issue Type: Improvement Components: diskbalancer Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu We have seen the following stack in production {code Exception in thread "main" java.lang.IllegalArgumentException at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72) at org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerVolume.setUsed(DiskBalancerVolume.java:268) at org.apache.hadoop.hdfs.server.diskbalancer.connectors.DBNameNodeConnector.getVolumeInfoFromStorageReports(DBNameNodeConnector.java:141) at org.apache.hadoop.hdfs.server.diskbalancer.connectors.DBNameNodeConnector.getNodes(DBNameNodeConnector.java:90) at org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerCluster.readClusterInfo(DiskBalancerCluster.java:132) at org.apache.hadoop.hdfs.server.diskbalancer.command.Command.readClusterInfo(Command.java:123) at org.apache.hadoop.hdfs.server.diskbalancer.command.PlanCommand.execute(PlanCommand.java:107) {code} raised from {code} public void setUsed(long dfsUsedSpace) { Preconditions.checkArgument(dfsUsedSpace < this.getCapacity()); this.used = dfsUsedSpace; } {code} However, the datanode reports at the very moment were not captured. We should add more information into the stack trace to better diagnose the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13039) StripedBlockReader#createBlockReader leaks socket on IOException
Lei (Eddy) Xu created HDFS-13039: Summary: StripedBlockReader#createBlockReader leaks socket on IOException Key: HDFS-13039 URL: https://issues.apache.org/jira/browse/HDFS-13039 Project: Hadoop HDFS Issue Type: Bug Components: datanode, erasure-coding Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu When running EC on one cluster, DataNode has millions of {{CLOSE_WAIT}} connections {code:java} $ grep CLOSE_WAIT lsof.out | wc -l 10358700 // All CLOSW_WAITs belong to the same DataNode process (pid=88527) $ grep CLOSE_WAIT lsof.out | awk '{print $2}' | sort | uniq 88527 {code} And DN can not open any file / socket, as shown in the log: {preformat} 2018-01-19 06:47:09,424 WARN io.netty.channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135) at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:563) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:504) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:418) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:390) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145) at java.lang.Thread.run(Thread.java:748) {preformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12994) TestReconstructStripedFile.testNNSendsErasureCodingTasks fails due to socket timeout
Lei (Eddy) Xu created HDFS-12994: Summary: TestReconstructStripedFile.testNNSendsErasureCodingTasks fails due to socket timeout Key: HDFS-12994 URL: https://issues.apache.org/jira/browse/HDFS-12994 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Occasionally, {{testNNSendsErasureCodingTasks}} fails due to socket timeout {code} 2017-12-26 20:35:19,961 [StripedBlockReconstruction-0] INFO datanode.DataNode (StripedBlockReader.java:createBlockReader(132)) - Exception while creating remote block reader, datanode 127.0.0.1:34145 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReader.newConnectedPeer(StripedBlockReader.java:148) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReader.createBlockReader(StripedBlockReader.java:123) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReader.(StripedBlockReader.java:83) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.createReader(StripedReader.java:169) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.initReaders(StripedReader.java:150) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.init(StripedReader.java:133) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:56) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} while the target datanode is removed in the test: {code} 2017-12-26 20:35:18,710 [Thread-2393] INFO net.NetworkTopology (NetworkTopology.java:remove(219)) - Removing a node: /default-rack/127.0.0.1:34145 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12953) XORRawDecoder.doDecode throws NullPointerException
Lei (Eddy) Xu created HDFS-12953: Summary: XORRawDecoder.doDecode throws NullPointerException Key: HDFS-12953 URL: https://issues.apache.org/jira/browse/HDFS-12953 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Thanks [~danielpol] report on HDFS-12860. {noformat} 17/11/30 04:19:55 INFO mapreduce.Job: map 0% reduce 0% 17/11/30 04:20:01 INFO mapreduce.Job: Task Id : attempt_1512036058655_0003_m_02_0, Status : FAILED Error: java.lang.NullPointerException at org.apache.hadoop.io.erasurecode.rawcoder.XORRawDecoder.doDecode(XORRawDecoder.java:83) at org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:106) at org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:170) at org.apache.hadoop.hdfs.StripeReader.decodeAndFillBuffer(StripeReader.java:423) at org.apache.hadoop.hdfs.StatefulStripeReader.decode(StatefulStripeReader.java:94) at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:382) at org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:318) at org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:391) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:813) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:257) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:563) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:794) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12927) Update erasure coding doc to address unsupported APIs
Lei (Eddy) Xu created HDFS-12927: Summary: Update erasure coding doc to address unsupported APIs Key: HDFS-12927 URL: https://issues.apache.org/jira/browse/HDFS-12927 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu {{Concat}}, {{truncate}}, {{setReplication}} are not (fully) supported with EC files. We should update the document to address them explicitly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12923) DFS.concat should throw exception if files have different EC policies.
[ https://issues.apache.org/jira/browse/HDFS-12923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-12923. -- Resolution: Won't Fix Fix Version/s: 3.0.0 Resolved as not an issue. > DFS.concat should throw exception if files have different EC policies. > --- > > Key: HDFS-12923 > URL: https://issues.apache.org/jira/browse/HDFS-12923 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0 >Reporter: Lei (Eddy) Xu >Priority: Critical > Fix For: 3.0.0 > > > {{DFS#concat}} appends blocks from different files to a single file. However, > if these files have different EC policies, or mixed with replicated and EC > files, the resulted file would be problematic to read, because the EC codec > is defined in INode instead of in a block. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12921) DFS.setReplication should throw exception on EC files
[ https://issues.apache.org/jira/browse/HDFS-12921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-12921. -- Resolution: Won't Fix Fix Version/s: 3.0.0 It was a no-op in {{FSDirAttrOp#unprotectedSetReplication()}} > DFS.setReplication should throw exception on EC files > - > > Key: HDFS-12921 > URL: https://issues.apache.org/jira/browse/HDFS-12921 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-beta1 >Reporter: Lei (Eddy) Xu > Fix For: 3.0.0 > > > This was checked from {{o.a.h.fs.shell.SetReplication#processPath}}, however, > {{DistributedFileSystem#setReplication()}} API is also a public API, we > should move the check to {{DistributedFileSystem}} to prevent directly call > this API on EC file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12923) DFS.concat should throw exception if files have different EC policies.
Lei (Eddy) Xu created HDFS-12923: Summary: DFS.concat should throw exception if files have different EC policies. Key: HDFS-12923 URL: https://issues.apache.org/jira/browse/HDFS-12923 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Priority: Critical {{DFS#concat}} appends blocks from different files to a single file. However, if these files have different EC policies, or mixed with replicated and EC files, the resulted file would be problematic to read, because the EC codec is defined in INode instead of in a block. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12921) DFS.setReplication should throw IOE on EC files
Lei (Eddy) Xu created HDFS-12921: Summary: DFS.setReplication should throw IOE on EC files Key: HDFS-12921 URL: https://issues.apache.org/jira/browse/HDFS-12921 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-beta1 Reporter: Lei (Eddy) Xu This was checked from {{o.a.h.fs.shell.SetReplication#processPath}}, however, {{DistributedFileSystem#setReplication()}} API is also a public API, we should move the check to {{DistributedFileSystem}} to prevent directly call this API on EC file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12860) TeraSort failed on erasure coding directory
Lei (Eddy) Xu created HDFS-12860: Summary: TeraSort failed on erasure coding directory Key: HDFS-12860 URL: https://issues.apache.org/jira/browse/HDFS-12860 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Running terasort on a cluster with 8 datanodes, 256g data, using RS-3-2-1024k. The terasort benchmark fails with the following stack trace: {code} 17/11/27 14:44:31 INFO mapreduce.Job: map 45% reduce 0% 17/11/27 14:44:33 INFO mapreduce.Job: Task Id : attempt_1510080297865_0160_m_08_0, Status : FAILED Error: java.lang.IllegalArgumentException at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72) at org.apache.hadoop.hdfs.util.StripedBlockUtil$VerticalRange.(StripedBlockUtil.java:701) at org.apache.hadoop.hdfs.util.StripedBlockUtil.getRangesForInternalBlocks(StripedBlockUtil.java:442) at org.apache.hadoop.hdfs.util.StripedBlockUtil.divideOneStripe(StripedBlockUtil.java:311) at org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:308) at org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:391) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:813) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:257) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12840) Creating a replicated file in a EC zone does not correctly serialized in EditLogs
Lei (Eddy) Xu created HDFS-12840: Summary: Creating a replicated file in a EC zone does not correctly serialized in EditLogs Key: HDFS-12840 URL: https://issues.apache.org/jira/browse/HDFS-12840 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-beta1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Blocker When create a replicated file in an existing EC zone, the edit logs does not differentiate it from an EC file. When {{FSEditLogLoader}} to replay edits, this file is treated as EC file, as a results, it crashes the NN because the blocks of this file are replicated, which does not match with {{INode}}. {noformat} ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation AddBlockOp [path=/system/balancer.id, penultimateBlock=NULL, lastBlock=blk_1073743259_2455, RpcClientId=, RpcCallId=-2] java.lang.IllegalArgumentException: reportedBlock is not striped at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoStriped.addStorage(BlockInfoStriped.java:118) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.addBlock(DatanodeStorageInfo.java:256) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlock(BlockManager.java:3141) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlockUnderConstruction(BlockManager.java:3068) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processAndHandleReportedBlock(BlockManager.java:3864) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processQueuedMessages(BlockManager.java:2916) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processQueuedMessagesForBlock(BlockManager.java:2903) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.addNewBlock(FSEditLogLoader.java:1069) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:532) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12819) Setting/Unsetting EC policy shows warning if the directory is not empty
Lei (Eddy) Xu created HDFS-12819: Summary: Setting/Unsetting EC policy shows warning if the directory is not empty Key: HDFS-12819 URL: https://issues.apache.org/jira/browse/HDFS-12819 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Because the existing data will not be converted when we set or unset EC policy on a directory, a warning from CLI would help to clear user's expectation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12769) TestReadStripedFileWithDecodingCorruptData and TestReadStripedFileWithDecodingDeletedData timeout in trunk
Lei (Eddy) Xu created HDFS-12769: Summary: TestReadStripedFileWithDecodingCorruptData and TestReadStripedFileWithDecodingDeletedData timeout in trunk Key: HDFS-12769 URL: https://issues.apache.org/jira/browse/HDFS-12769 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-beta1 Reporter: Lei (Eddy) Xu Priority: Major Recently, TestReadStripedFileWithDecodingCorruptData and TestReadStripedFileWithDecodingDeletedData fail frequently. For example, in HDFS-12725. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12613) Native EC coder should implement release() as idempotent function.
Lei (Eddy) Xu created HDFS-12613: Summary: Native EC coder should implement release() as idempotent function. Key: HDFS-12613 URL: https://issues.apache.org/jira/browse/HDFS-12613 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-beta1 Reporter: Lei (Eddy) Xu Recently, we found native EC coder crashes JVM because {{NativeRSDecoder#release()}} being called multiple times (HDFS-12612 and HDFS-12606). We should strength the implement the native code to make {{release()}} idempotent as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12606) JVM crashes when running NNBench on EC enabled.
Lei (Eddy) Xu created HDFS-12606: Summary: JVM crashes when running NNBench on EC enabled. Key: HDFS-12606 URL: https://issues.apache.org/jira/browse/HDFS-12606 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-beta1 Reporter: Lei (Eddy) Xu Priority: Critical When running NNbench on a RS(6,3) directory, JVM crashes double free or corruption: {code} 08:16:29 Running NNBENCH. 08:16:29 WARNING: Use "yarn jar" to launch YARN applications. 08:16:31 NameNode Benchmark 0.4 08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Test Inputs: 08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Test Operation: create_write 08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Start time: 2017-10-04 08:18:31,16 : : 08:18:54 *** Error in `/usr/java/jdk1.8.0_144/bin/java': double free or corruption (out): 0x7ffb55dbfab0 *** 08:18:54 === Backtrace: = 08:18:54 /lib64/libc.so.6(+0x7c619)[0x7ffb5b85f619] 08:18:54 [0x7ffb45017774] 08:18:54 === Memory map: 08:18:54 0040-00401000 r-xp ca:01 276832134 /usr/java/jdk1.8.0_144/bin/java 08:18:54 0060-00601000 rw-p ca:01 276832134 /usr/java/jdk1.8.0_144/bin/java 08:18:54 0173e000-01f91000 rw-p 00:00 0 [heap] 08:18:54 60360-61470 rw-p 00:00 0 08:18:54 61470-72bd0 ---p 00:00 0 08:18:54 72bd0-73a50 rw-p 00:00 0 08:18:54 73a50-7c000 ---p 00:00 0 08:18:54 7c000-7c040 rw-p 00:00 0 08:18:54 7c040-8 ---p 00:00 0 08:18:54 7ffb20174000-7ffb208ab000 rw-p 00:00 0 08:18:54 7ffb208ab000-7ffb20975000 ---p 00:00 0 08:18:54 7ffb20975000-7ffb20b75000 rw-p 00:00 0 08:18:54 7ffb20b75000-7ffb20d75000 rw-p 00:00 0 08:18:54 7ffb20d75000-7ffb20d8a000 r-xp ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 08:18:54 7ffb20d8a000-7ffb20f89000 ---p 00015000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 08:18:54 7ffb20f89000-7ffb20f8a000 r--p 00014000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 08:18:54 7ffb20f8a000-7ffb20f8b000 rw-p 00015000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 08:18:54 7ffb20f8b000-7ffb20fbd000 r-xp ca:01 553654092 /usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so 08:18:54 7ffb20fbd000-7ffb211bc000 ---p 00032000 ca:01 553654092 /usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so 08:18:54 7ffb211bc000-7ffb211c2000 rw-p 00031000 ca:01 553654092 /usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so : : 08:18:54 7ffb5c3fb000-7ffb5c3fc000 r--p 00:00 0 08:18:54 7ffb5c3fc000-7ffb5c3fd000 rw-p 00:00 0 08:18:54 7ffb5c3fd000-7ffb5c3fe000 r--p 00021000 ca:01 637266 /usr/lib64/ld-2.17.so 08:18:54 7ffb5c3fe000-7ffb5c3ff000 rw-p 00022000 ca:01 637266 /usr/lib64/ld-2.17.so 08:18:54 7ffb5c3ff000-7ffb5c40 rw-p 00:00 0 08:18:54 7ffdf8767000-7ffdf8788000 rw-p 00:00 0 [stack] 08:18:54 7ffdf878b000-7ffdf878d000 r-xp 00:00 0 [vdso] 08:18:54 ff60-ff601000 r-xp 00:00 0 [vsyscall] {code} It happens on both {{jdk1.8.0_144}} and {{jdk1.8.0_121}} in our environments. It is highly suspicious due to the native code used in erasure coding, i.e., ISA-L is not thread safe [https://01.org/sites/default/files/documentation/isa-l_open_src_2.10.pdf] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12604) StreamCapability enums are not displayed in javadoc
Lei (Eddy) Xu created HDFS-12604: Summary: StreamCapability enums are not displayed in javadoc Key: HDFS-12604 URL: https://issues.apache.org/jira/browse/HDFS-12604 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.0.0-beta1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor http://hadoop.apache.org/docs/r3.0.0-beta1/api/org/apache/hadoop/fs/StreamCapabilities.html {{StreamCapability#HFLUSH}} and {{StreamCapability#HSYNC}} are not displayed in the doc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12575) Improve test coverage for EC related edit logs ops
Lei (Eddy) Xu created HDFS-12575: Summary: Improve test coverage for EC related edit logs ops Key: HDFS-12575 URL: https://issues.apache.org/jira/browse/HDFS-12575 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: SammiChen HDFS-12569 found that we have little test coverage for edit logs ops of erasure coding. And we've seen the following bug bring down SNN in our test environments: {code} 6:42:18.177 AM ERROR FSEditLogLoader Encountered exception on operation AddBlockOp [path=/tmp/foo/bar, penultimateBlock=NULL, lastBlock=blk_1073743386_69322, RpcClientId=, RpcCallId=-2] java.lang.IllegalArgumentException: reportedBlock is not striped at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at 6:42:18.190 AM FATAL EditLogTailer Unknown error encountered while tailing edits. Shutting down standby NN. java.io.IOException: java.lang.IllegalArgumentException: reportedBlock is not striped at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:251) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:150) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293) at {code} We should add coverage for these important edit logs, i.e., set/unset policy, enable/remove policies and etc are correctly persisted in edit logs, and test the scenarios like: * Restart NN * Replay edits after checkpoint * Apply edits on SNN. * and etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12569) Unset EC policy logs empty payload in edit log
Lei (Eddy) Xu created HDFS-12569: Summary: Unset EC policy logs empty payload in edit log Key: HDFS-12569 URL: https://issues.apache.org/jira/browse/HDFS-12569 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Blocker The edit log generated by {{hdfs ec -unsetPolicy}} generates an {{OP_REMOVE_XATTR}} entry in edit logs, but the payload is missing: {code} OP_REMOVE_XATTR 420481 / b098e758-9d7f-48b7-aa91-80ca52133b09 0 {code} As a result, when Active NN restarts, or the Standby NN replay edits, this op has not effect. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12523) Thread pools in ErasureCodingWorker do not shutdown
Lei (Eddy) Xu created HDFS-12523: Summary: Thread pools in ErasureCodingWorker do not shutdown Key: HDFS-12523 URL: https://issues.apache.org/jira/browse/HDFS-12523 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu There is no code path in {{ErasureCodingWorker}} to shutdown its two thread pools: {{stripedReconstructionPool}} and {{stripedReadPool}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12483) Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery
[ https://issues.apache.org/jira/browse/HDFS-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-12483. -- Resolution: Duplicate > Provide a configuration to adjust the weight of EC recovery tasks to adjust > the speed of recovery > - > > Key: HDFS-12483 > URL: https://issues.apache.org/jira/browse/HDFS-12483 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding >Affects Versions: 3.0.0-alpha4 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu >Priority: Minor > > The relative speed of EC recovery comparing to 3x replica recovery is a > function of (EC codec, number of sources, NIC speed, and CPU speed, and etc). > Currently the EC recovery has a fixed {{xmitsInProgress}} of {{max(# of > sources, # of targets)}} comparing to {{1}} for 3x replica recovery, and NN > uses {{xmitsInProgress}} to decide how much recovery tasks to schedule to the > DataNode this we can add a coefficient for user to tune the weight of EC > recovery tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12482) Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery
Lei (Eddy) Xu created HDFS-12482: Summary: Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery Key: HDFS-12482 URL: https://issues.apache.org/jira/browse/HDFS-12482 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor The relative speed of EC recovery comparing to 3x replica recovery is a function of (EC codec, number of sources, NIC speed, and CPU speed, and etc). Currently the EC recovery has a fixed {{xmitsInProgress}} of {{max(# of sources, # of targets)}} comparing to {{1}} for 3x replica recovery, and NN uses {{xmitsInProgress}} to decide how much recovery tasks to schedule to the DataNode this we can add a coefficient for user to tune the weight of EC recovery tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12483) Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery
Lei (Eddy) Xu created HDFS-12483: Summary: Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery Key: HDFS-12483 URL: https://issues.apache.org/jira/browse/HDFS-12483 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor The relative speed of EC recovery comparing to 3x replica recovery is a function of (EC codec, number of sources, NIC speed, and CPU speed, and etc). Currently the EC recovery has a fixed {{xmitsInProgress}} of {{max(# of sources, # of targets)}} comparing to {{1}} for 3x replica recovery, and NN uses {{xmitsInProgress}} to decide how much recovery tasks to schedule to the DataNode this we can add a coefficient for user to tune the weight of EC recovery tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12472) Add JUNIT timeout to TestBlockStatsMXBean
Lei (Eddy) Xu created HDFS-12472: Summary: Add JUNIT timeout to TestBlockStatsMXBean Key: HDFS-12472 URL: https://issues.apache.org/jira/browse/HDFS-12472 Project: Hadoop HDFS Issue Type: Improvement Reporter: Lei (Eddy) Xu Priority: Minor Add Junit timeout to {{TestBlockStatsMXBean}} so that it can show up in the test failure report if timeout occurs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12439) TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally
[ https://issues.apache.org/jira/browse/HDFS-12439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-12439. -- Resolution: Duplicate Close this one because HDFS-12449 has patch available. > TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally > > > Key: HDFS-12439 > URL: https://issues.apache.org/jira/browse/HDFS-12439 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.0.0-alpha4 >Reporter: Lei (Eddy) Xu > Labels: flaky-test > > With error message: > {code} > Error Message > test timed out after 6 milliseconds > Stacktrace > java.lang.Exception: test timed out after 6 milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:917) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1199) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:842) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > at org.apache.hadoop.hdfs.DFSTestUtil.writeFile(DFSTestUtil.java:835) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.writeFile(TestReconstructStripedFile.java:273) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:461) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:439) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12439) TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally
Lei (Eddy) Xu created HDFS-12439: Summary: TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally Key: HDFS-12439 URL: https://issues.apache.org/jira/browse/HDFS-12439 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu With error message: {code} Error Message test timed out after 6 milliseconds Stacktrace java.lang.Exception: test timed out after 6 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:917) at org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1199) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:842) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) at org.apache.hadoop.hdfs.DFSTestUtil.writeFile(DFSTestUtil.java:835) at org.apache.hadoop.hdfs.TestReconstructStripedFile.writeFile(TestReconstructStripedFile.java:273) at org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:461) at org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:439) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12360) TestLeaseRecoveryStriped.testLeaseRecovery failure
[ https://issues.apache.org/jira/browse/HDFS-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-12360. -- Resolution: Duplicate > TestLeaseRecoveryStriped.testLeaseRecovery failure > -- > > Key: HDFS-12360 > URL: https://issues.apache.org/jira/browse/HDFS-12360 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yongjun Zhang > > TestLeaseRecoveryStriped.testLeaseRecovery failed: > {code} > --- > T E S T S > --- > Running org.apache.hadoop.hdfs.TestLeaseRecoveryStriped > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.808 sec > <<< FAILURE! - in org.apache.hadoop.hdfs.TestLeaseRecoveryStriped > testLeaseRecovery(org.apache.hadoop.hdfs.TestLeaseRecoveryStriped) Time > elapsed: 15.509 sec <<< FAILURE! > java.lang.AssertionError: failed testCase at i=0, blockLengths=[10485760, > 4194304, 6291456, 10485760, 11534336, 11534336, 6291456, 4194304, 3145728] > java.io.IOException: Failed: the number of failed blocks = 4 > the number of > data blocks = 3 > at > org.apache.hadoop.hdfs.DFSStripedOutputStream.checkStreamers(DFSStripedOutputStream.java:393) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream.handleStreamerFailure(DFSStripedOutputStream.java:411) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream.flushAllInternals(DFSStripedOutputStream.java:1128) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream.checkStreamerFailures(DFSStripedOutputStream.java:628) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:564) > at > org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217) > at > org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164) > at > org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145) > at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:79) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:48) > at java.io.DataOutputStream.write(DataOutputStream.java:88) > at > org.apache.hadoop.hdfs.TestLeaseRecoveryStriped.writePartialBlocks(TestLeaseRecoveryStriped.java:182) > at > org.apache.hadoop.hdfs.TestLeaseRecoveryStriped.runTest(TestLeaseRecoveryStriped.java:158) > at > org.apache.hadoop.hdfs.TestLeaseRecoveryStriped.testLeaseRecovery(TestLeaseRecoveryStriped.java:147) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.hadoop.hdfs.TestLease
[jira] [Created] (HDFS-12412) Remove ErasureCodingWorker.stripedReadPool
Lei (Eddy) Xu created HDFS-12412: Summary: Remove ErasureCodingWorker.stripedReadPool Key: HDFS-12412 URL: https://issues.apache.org/jira/browse/HDFS-12412 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu In {{ErasureCodingWorker}}, it uses {{stripedReconstructionPool}} to schedule the EC recovery tasks, while uses {{stripedReadPool}} for the reader threads in each recovery task. We only need one of them to throttle the speed of recovery process, because each EC recovery task has a fix number of source readers (i.e., 3 for RS(3,2)). And because of the findings in HDFS-12044, the speed of EC recovery can be throttled by {{strippedReconstructionPool}} with {{xmitsInProgress}}. Moreover, keeping {{stripedReadPool}} makes customer difficult to understand and calculate the right balance between {{dfs.datanode.ec.reconstruction.stripedread.threads}}, {{dfs.datanode.ec.reconstruction.stripedblock.threads.size}} and {{maxReplicationStreams}}. For example, a small {{stripread.threads}} (comparing to which {{reconstruction.threads.size}} implies), will unnecessarily limit the speed of recovery, which leads to larger MTTR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12409) Add metrics of execution time of EC recovery tasks
Lei (Eddy) Xu created HDFS-12409: Summary: Add metrics of execution time of EC recovery tasks Key: HDFS-12409 URL: https://issues.apache.org/jira/browse/HDFS-12409 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Admin can use more metrics to monitor EC recovery tasks, to get insights to tune recovery performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12351) Explicitly describe the minimal number of DataNodes required to support an EC policy in EC document.
Lei (Eddy) Xu created HDFS-12351: Summary: Explicitly describe the minimal number of DataNodes required to support an EC policy in EC document. Key: HDFS-12351 URL: https://issues.apache.org/jira/browse/HDFS-12351 Project: Hadoop HDFS Issue Type: Improvement Components: documentation, erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Should explicitly call out the minimal number of DataNodes (ie.. 5 for RS(3,2)) in EC document, to make it easy to understand for non-storage people. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12349) Improve log message when
Lei (Eddy) Xu created HDFS-12349: Summary: Improve log message when Key: HDFS-12349 URL: https://issues.apache.org/jira/browse/HDFS-12349 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor When an EC output stream could not alloc enough blocks for parity blocks, it sets the warning. {code} if (blocks[i] == null) { LOG.warn("Failed to get block location for parity block, index=" + i); {code} We should clarify the cause of this warning message. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12263) Revise StreamCapacities doc to describe the API usage and the requirements for customized OutputStream implemetation
Lei (Eddy) Xu created HDFS-12263: Summary: Revise StreamCapacities doc to describe the API usage and the requirements for customized OutputStream implemetation Key: HDFS-12263 URL: https://issues.apache.org/jira/browse/HDFS-12263 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu [~busbey] raised the concerns to call out what is the expected way to call {{StreamCapabilities}} from the client side. And this doc should also describe the rules for any {{FSOutputStream}} implementation to follow. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12260) StreamCapabilities.StreamCapability should be public.
Lei (Eddy) Xu created HDFS-12260: Summary: StreamCapabilities.StreamCapability should be public. Key: HDFS-12260 URL: https://issues.apache.org/jira/browse/HDFS-12260 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Client should use {{StreamCapability}} enum instead of raw string to query the capability of an OutputStream, for better type safety / IDE supports and etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12251) Add document for StreamCapabilities
Lei (Eddy) Xu created HDFS-12251: Summary: Add document for StreamCapabilities Key: HDFS-12251 URL: https://issues.apache.org/jira/browse/HDFS-12251 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Update filesystem docs to describe the purpose and usage of {{StreamCapabilities}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12234) [SPS] Allow setting Xattr without SPS running.
Lei (Eddy) Xu created HDFS-12234: Summary: [SPS] Allow setting Xattr without SPS running. Key: HDFS-12234 URL: https://issues.apache.org/jira/browse/HDFS-12234 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: HDFS-10285 Reporter: Lei (Eddy) Xu As discussed in HDFS-10285, if this API is widely used by downstream projects (i.e., HBase), it should allow the client to call this API without querying the running status of SPS service. It would introduce great burden for this API to be used. Given the constraints this SPS service has (i.e., can not run with Mover , and might be disabled by default), it should allow the API call success as long as related xattr being persisted. SPS can run later to catch on. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12233) Add API to unset SPS on a path
Lei (Eddy) Xu created HDFS-12233: Summary: Add API to unset SPS on a path Key: HDFS-12233 URL: https://issues.apache.org/jira/browse/HDFS-12233 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode, namenode Affects Versions: HDFS-10285 Reporter: Lei (Eddy) Xu As discussed in HDFS-10285, we should allow to unset SPS on a path. For example, an user might mistakenly set SPS on "/", and triggers significant amount of data movement. Unset SPS will allow user to fix his own mistake. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12221) Replace xcerces in XmlEditsVisitor
Lei (Eddy) Xu created HDFS-12221: Summary: Replace xcerces in XmlEditsVisitor Key: HDFS-12221 URL: https://issues.apache.org/jira/browse/HDFS-12221 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu XmlEditsVisitor should use new XML capability in the newer JDK, to make JAR shading easier (HADOOP-14672) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12215) DataNode#transferBlock does not create its daemon in the xceiver thread group
Lei (Eddy) Xu created HDFS-12215: Summary: DataNode#transferBlock does not create its daemon in the xceiver thread group Key: HDFS-12215 URL: https://issues.apache.org/jira/browse/HDFS-12215 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu As mentioned in HDFS-12044, DataNode#transferBlock daemon is not calculated to xceiver count. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12208) NN should consider DataNode#xmitInProgress when placing new block
Lei (Eddy) Xu created HDFS-12208: Summary: NN should consider DataNode#xmitInProgress when placing new block Key: HDFS-12208 URL: https://issues.apache.org/jira/browse/HDFS-12208 Project: Hadoop HDFS Issue Type: Improvement Components: block placement, erasure-coding Affects Versions: 3.0.0-alpha4 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor As discussed in HDFS-12044, NN only considers xceiver counts on DN when placing new blocks. NN should also consider background reconstruction works, presented by xmits on DN. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12072) Provide fairness between EC and non-EC recovery tasks.
Lei (Eddy) Xu created HDFS-12072: Summary: Provide fairness between EC and non-EC recovery tasks. Key: HDFS-12072 URL: https://issues.apache.org/jira/browse/HDFS-12072 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu In {{DatanodeManager#handleHeartbeat}}, it takes up to {{maxTransfer}} reconstruction tasks for non-EC, then if the request can not be full filled, it takes more tasks from EC reconstruction tasks. {code} List pendingList = nodeinfo.getReplicationCommand( maxTransfers); if (pendingList != null) { cmds.add(new BlockCommand(DatanodeProtocol.DNA_TRANSFER, blockPoolId, pendingList)); maxTransfers -= pendingList.size(); } // check pending erasure coding tasks List pendingECList = nodeinfo .getErasureCodeCommand(maxTransfers); if (pendingECList != null) { cmds.add(new BlockECReconstructionCommand( DNA_ERASURE_CODING_RECONSTRUCTION, pendingECList)); } {code} So on a large cluster, if there are large number of constantly non-EC reconstruction tasks, EC reconstruction tasks do not have a chance to run. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12065) Fix log format in StripedBlockReconstructor
Lei (Eddy) Xu created HDFS-12065: Summary: Fix log format in StripedBlockReconstructor Key: HDFS-12065 URL: https://issues.apache.org/jira/browse/HDFS-12065 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Trivial The {{LOG}} is using wrong signature in {{StripedBlockReconstructor}}, and results to the following message without the stack: {code} Failed to reconstruct striped block: BP-1026491657-172.31.114.203-1498498077419:blk_-9223372036854759232_5065 java.lang.NullPointerException {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12044) Mismatch between BlockManager#maxReplicatioStreams and ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst recovery.
Lei (Eddy) Xu created HDFS-12044: Summary: Mismatch between BlockManager#maxReplicatioStreams and ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst recovery. Key: HDFS-12044 URL: https://issues.apache.org/jira/browse/HDFS-12044 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu {{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}} and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is full. When {{BlockManager#maxReplicationStream}} is larger than {{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}}, for example, {{maxReplicationStream=20}} and {{corePoolSize=2 , maxPoolSize=8}}. Meanwhile, NN sends up to {{maxTransfer}} reconstruction tasks to DN for each heartbeat, and it is calculated in {{FSNamesystem}}: {code} final int maxTransfer = blockManager.getMaxReplicationStreams() - xmitsInProgress; {code} However, at any giving time, {{{ErasureCodingWorker#stripedReconstructionPool}} takes 2 {{xmitInProcess}}. So for each heartbeat in 3s, NN will send about {{20-2 = 18}} reconstruction tasks to the DN, and DN throw away most of them if there were 8 tasks in the queue already. So NN needs to take longer to re-consider these blocks were under-replicated to schedule new tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12033) DatanodeManager picking EC recovery tasks should also consider the number of regular replication tasks.
Lei (Eddy) Xu created HDFS-12033: Summary: DatanodeManager picking EC recovery tasks should also consider the number of regular replication tasks. Key: HDFS-12033 URL: https://issues.apache.org/jira/browse/HDFS-12033 Project: Hadoop HDFS Issue Type: Bug Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu In {{DatanodeManager#handleHeartbeat}}, it choose both pending replication list and pending EC list to up to {{maxTransfers}} items. It should only send {{maxTransfers}} tasks combined to DN. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11987) DistributedFileSystem#create and append do not honor CreateFlag.CREATE|APPEND
Lei (Eddy) Xu created HDFS-11987: Summary: DistributedFileSystem#create and append do not honor CreateFlag.CREATE|APPEND Key: HDFS-11987 URL: https://issues.apache.org/jira/browse/HDFS-11987 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.0.0-alpha3, 2.8.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu {{DistributedFileSystem#create()}} and {{DistributedFIleSystem#append()}} do not honor the expected behavior on {{CreateFlag.CREATE|APPEND}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11975) Provide a system-default EC policy
Lei (Eddy) Xu created HDFS-11975: Summary: Provide a system-default EC policy Key: HDFS-11975 URL: https://issues.apache.org/jira/browse/HDFS-11975 Project: Hadoop HDFS Issue Type: Sub-task Components: erasure-coding Affects Versions: 3.0.0-alpha3 Reporter: Lei (Eddy) Xu Assignee: SammiChen >From the usability point of view, it'd be nice to be able to specify a >system-wide EC policy, i.e., in {{hdfs-site.xml}}. For most of users / admin / >downstream projects, it is not necessary to know the tradeoffs of the EC >policy, considering that it requires the knowledge of EC, the actual physical >topology of the clusters, and many other factors (i.e., network, cluster size >and etc). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-11651) Add a public API for specifying an EC policy at create time
[ https://issues.apache.org/jira/browse/HDFS-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-11651. -- Resolution: Duplicate This can be addressed in HADOOP-14394. So I close this JIRA as duplicated. > Add a public API for specifying an EC policy at create time > --- > > Key: HDFS-11651 > URL: https://issues.apache.org/jira/browse/HDFS-11651 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding >Affects Versions: 3.0.0-alpha4 >Reporter: Andrew Wang > Labels: hdfs-ec-3.0-nice-to-have > > Follow-on work from HDFS-10996. We extended the create builder, but it still > requires casting to DistributedFileSystem to use, thus is not a public API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11659) TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten fail due to no
Lei (Eddy) Xu created HDFS-11659: Summary: TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten fail due to no Key: HDFS-11659 URL: https://issues.apache.org/jira/browse/HDFS-11659 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0-alpha2, 2.7.3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu The test fails after the following error messages: {code} java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:57377,DS-b4ec61fc-657c-4e2a-9dc3-8d93b7769a2b,DISK], DatanodeInfoWithStorage[127.0.0.1:47448,DS-18bca8d7-048d-4d7f-9594-d2df16096a3d,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:57377,DS-b4ec61fc-657c-4e2a-9dc3-8d93b7769a2b,DISK], DatanodeInfoWithStorage[127.0.0.1:47448,DS-18bca8d7-048d-4d7f-9594-d2df16096a3d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1280) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1354) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1512) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1236) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:721) {code} In such case, the DataNode that has removed can not be used in the pipeline recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-10638) Modifications to remove the assumption that StorageLocation is associated with java.io.File in Datanode.
[ https://issues.apache.org/jira/browse/HDFS-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-10638. -- Resolution: Fixed Fix Version/s: 3.0.0-alpha2 +1 . Thanks for the good work! I re-run the failed test and it passes on my laptop. So I commit the patch to trunk. > Modifications to remove the assumption that StorageLocation is associated > with java.io.File in Datanode. > > > Key: HDFS-10638 > URL: https://issues.apache.org/jira/browse/HDFS-10638 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, fs >Reporter: Virajith Jalaparti >Assignee: Virajith Jalaparti > Fix For: 3.0.0-alpha2 > > Attachments: HDFS-10638.001.patch, HDFS-10638.002.patch, > HDFS-10638.003.patch, HDFS-10638.004.patch, HDFS-10638.005.patch > > > Changes to ensure that {{StorageLocation}} need not be associated with a > {{java.io.File}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-10960) TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWritten fails at disk error verification after volume remove
[ https://issues.apache.org/jira/browse/HDFS-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-10960. -- Resolution: Fixed Fix Version/s: 2.9.0 2.8.0 Re-worked to commit 01 patch to branch-2 and branch-2.8. Thanks [~kihwal] and [~manojg] for working closely on this patch. > TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWritten fails at disk error > verification after volume remove > > > Key: HDFS-10960 > URL: https://issues.apache.org/jira/browse/HDFS-10960 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.0.0-alpha2 >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Minor > Fix For: 2.8.0, 2.9.0, 3.0.0-alpha2 > > Attachments: HDFS-10960.01.patch, HDFS-10960.02.patch > > > TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWritten fails occasionally in > the following verification. > {code} > 700 // If an IOException thrown from BlockReceiver#run, it triggers > 701 // DataNode#checkDiskError(). So we can test whether > checkDiskError() is called, > 702 // to see whether there is IOException in BlockReceiver#run(). > 703 assertEquals(lastTimeDiskErrorCheck, dn.getLastDiskErrorCheck()); > 704 > {code} > {noformat} > Error Message > expected:<0> but was:<6498109> > Stacktrace > java.lang.AssertionError: expected:<0> but was:<6498109> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWrittenForDatanode(TestDataNodeHotSwapVolumes.java:703) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten(TestDataNodeHotSwapVolumes.java:620) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10911) Change edit log OP_UPDATE_BLOCKS to store delta blocks only.
Lei (Eddy) Xu created HDFS-10911: Summary: Change edit log OP_UPDATE_BLOCKS to store delta blocks only. Key: HDFS-10911 URL: https://issues.apache.org/jira/browse/HDFS-10911 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.0.0-alpha1, 2.7.3 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Every time a HDFS client {{close}} or {{hflush}} an open file, NameNode enumerates all the blocks and stores then into edit log (OP_UPDATE_BLOCKS). It would cause problem when the client is appending a large file frequently (i.e., WAL). Because HDFS is append only, we can only store the blocks that have been changed (delta blocks) in edit log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10734) Rename "dfs.ha.tail-edits.rolledits.timeout" to "dfs.ha.log-roll.execution.timeout"
Lei (Eddy) Xu created HDFS-10734: Summary: Rename "dfs.ha.tail-edits.rolledits.timeout" to "dfs.ha.log-roll.execution.timeout" Key: HDFS-10734 URL: https://issues.apache.org/jira/browse/HDFS-10734 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 2.9.0, 3.0.0-alpha2 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor In HDFS-4176, it introduces {{dfs.ha.tail-edits.rolledits.timeout}}. [~Surendra Singh Lilhore] kindly suggested to rename it to {{dfs.ha.log-roll.execution.timeout}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-4176) EditLogTailer should call rollEdits with a timeout
[ https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu reopened HDFS-4176: - it was not related to HDFS-9659. > EditLogTailer should call rollEdits with a timeout > -- > > Key: HDFS-4176 > URL: https://issues.apache.org/jira/browse/HDFS-4176 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, namenode >Affects Versions: 2.0.2-alpha, 3.0.0-alpha1 >Reporter: Todd Lipcon >Assignee: Lei (Eddy) Xu > Fix For: 3.0.0-alpha1 > > Attachments: namenode.jstack4 > > > When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it > currently does so without a timeout. So, if the active NN has frozen (but not > actually crashed), this call can hang forever. This can then potentially > prevent the standby from becoming active. > This may actually considered a side effect of HADOOP-6762 -- if the RPC were > interruptible, that would also fix the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-4176) EditLogTailer should call rollEdits with a timeout
[ https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-4176. - Resolution: Duplicate HDFS-9659 adds timeout to RPC calls, thus fixed this issue. > EditLogTailer should call rollEdits with a timeout > -- > > Key: HDFS-4176 > URL: https://issues.apache.org/jira/browse/HDFS-4176 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, namenode >Affects Versions: 2.0.2-alpha, 3.0.0-alpha1 >Reporter: Todd Lipcon >Assignee: Lei (Eddy) Xu > Attachments: namenode.jstack4 > > > When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it > currently does so without a timeout. So, if the active NN has frozen (but not > actually crashed), this call can hang forever. This can then potentially > prevent the standby from becoming active. > This may actually considered a side effect of HADOOP-6762 -- if the RPC were > interruptible, that would also fix the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-9310) TestDataNodeHotSwapVolumes fails occasionally
[ https://issues.apache.org/jira/browse/HDFS-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-9310. - Resolution: Duplicate Assignee: Lei (Eddy) Xu Fix Version/s: 2.8.0 It was fixed in HDFS-9137. > TestDataNodeHotSwapVolumes fails occasionally > - > > Key: HDFS-9310 > URL: https://issues.apache.org/jira/browse/HDFS-9310 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.8.0 >Reporter: Arpit Agarwal >Assignee: Lei (Eddy) Xu > Fix For: 2.8.0 > > > TestDataNodeHotSwapVolumes fails occasionally in Jenkins and locally. e.g. > https://builds.apache.org/job/PreCommit-HDFS-Build/13197/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testRemoveVolumeBeingWritten/ > *Error Message* > Timed out waiting for /test to reach 3 replicas > *Stacktrace* > java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 > replicas > at > org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:768) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWrittenForDatanode(TestDataNodeHotSwapVolumes.java:644) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten(TestDataNodeHotSwapVolumes.java:569) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10600) PlanCommand#getThrsholdPercentage should not use throughput value.
Lei (Eddy) Xu created HDFS-10600: Summary: PlanCommand#getThrsholdPercentage should not use throughput value. Key: HDFS-10600 URL: https://issues.apache.org/jira/browse/HDFS-10600 Project: Hadoop HDFS Issue Type: Sub-task Components: diskbalancer Affects Versions: 2.9.0, 3.0.0-beta1 Reporter: Lei (Eddy) Xu In {{PlanCommand#getThresholdPercentage}} {code} private double getThresholdPercentage(CommandLine cmd) { if ((value <= 0.0) || (value > 100.0)) { value = getConf().getDouble( DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT, DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT_DEFAULT); } return value; } {code} {{DISK_THROUGHPUT}} has the unit of "MB", so it does not make sense to return {{throughput}} as a percentage value. Btw, we should use {{THROUGHPUT}} instead of {{THRUPUT}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10598) DiskBalancer does not execute multi-steps plan.
Lei (Eddy) Xu created HDFS-10598: Summary: DiskBalancer does not execute multi-steps plan. Key: HDFS-10598 URL: https://issues.apache.org/jira/browse/HDFS-10598 Project: Hadoop HDFS Issue Type: Sub-task Components: diskbalancer Affects Versions: 2.8.0, 3.0.0-beta1 Reporter: Lei (Eddy) Xu Priority: Critical I set up a 3 DN node cluster, each one with 2 small disks. After creating some files to fill HDFS, I added two more small disks to one DN. And run the diskbalancer on this DataNode. The disk usage before running diskbalancer: {code} /dev/loop0 3.9G 2.1G 1.6G 58% /mnt/data1 /dev/loop1 3.9G 2.6G 1.1G 71% /mnt/data2 /dev/loop2 3.9G 17M 3.6G 1% /mnt/data3 /dev/loop3 3.9G 17M 3.6G 1% /mnt/data4 {code} However, after running diskbalancer (i.e., {{-query}} shows {{PLAN_DONE}}) {code} /dev/loop0 3.9G 1.2G 2.5G 32% /mnt/data1 /dev/loop1 3.9G 2.6G 1.1G 71% /mnt/data2 /dev/loop2 3.9G 953M 2.7G 26% /mnt/data3 /dev/loop3 3.9G 17M 3.6G 1% /mnt/data4 {code} It is suspicious that in {{DiskBalancerMover#copyBlocks}}, every return does {{this.setExitFlag}} which prevents {{copyBlocks()}} be called multiple times from {{DiskBalancer#executePlan}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10567) Improve plan command help message
Lei (Eddy) Xu created HDFS-10567: Summary: Improve plan command help message Key: HDFS-10567 URL: https://issues.apache.org/jira/browse/HDFS-10567 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Lei (Eddy) Xu {code} --bandwidth Maximum disk bandwidth to be consumed by diskBalancer. e.g. 10 --maxerror Describes how many errors can be tolerated while copying between a pair of disks. --outFile to write output to, if not specified defaults will be used. --plan creates a plan for datanode. --thresholdPercentagePercentage skew that wetolerate before diskbalancer starts working e.g. 10 --v Print out the summary of the plan on console {code} We should * Put the unit into {{--bandwidth}}, or its help message. Is it an integer or float / double number? Not clear in CLI message. * Give more details about {{--plan}}. It is not clear what the {{}} is for. * {{--thresholdPercentage}}, has typo {{wetolerate}} in the error message. Also it needs to indicated that it is the difference between space utilization between two disks / volumes. Is it an integer or float / double number? Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10552) DiskBalancer "-query" results in NPE if no plan for the node
Lei (Eddy) Xu created HDFS-10552: Summary: DiskBalancer "-query" results in NPE if no plan for the node Key: HDFS-10552 URL: https://issues.apache.org/jira/browse/HDFS-10552 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: HDFS-1312 Reporter: Lei (Eddy) Xu Priority: Critical {code} 16/06/20 11:50:16 INFO command.Command: Executing "query plan" command. java.lang.NullPointerException at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$QueryPlanStatusResponseProto$Builder.setPlanID(ClientDatanodeProtocolProtos.java:12782) at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.queryDiskBalancerPlan(ClientDatanodeProtocolServerSideTranslatorPB.java:340) at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17513) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10551) o.a.h.h.s.diskbalancer.command.Command does not actually verify options as expected.
Lei (Eddy) Xu created HDFS-10551: Summary: o.a.h.h.s.diskbalancer.command.Command does not actually verify options as expected. Key: HDFS-10551 URL: https://issues.apache.org/jira/browse/HDFS-10551 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Lei (Eddy) Xu Priority: Critical In {{diskbalancer.command.Command#verifyCommandOptions}}. The following code does not do what it expected to do: {code} if (!validArgs.containsKey(opt.getArgName())) { {code} opt.getArgName() always returns "arg" instead of i.e., {{report}} or {{uri}}, which is the expected parameter to check. It should use {{opt.getLongOpt()}} to get the option names. It can pass on the branch because {{opt.getArgName()}} always returns {{"arg"}}, which is accidently in {{validArgs}}. However I don't think it is the intention for this function. Additionally, in the following code {code} validArguments.append("Valid arguments are : %n"); {code} This {{%n}} is not used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10545) PlanCommand should use -fs instead of -uri to be consistent with other hdfs commands
Lei (Eddy) Xu created HDFS-10545: Summary: PlanCommand should use -fs instead of -uri to be consistent with other hdfs commands Key: HDFS-10545 URL: https://issues.apache.org/jira/browse/HDFS-10545 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: HDFS-1312 Reporter: Lei (Eddy) Xu Priority: Minor PlanCommand currently uses {{-uri}} to specify NameNode, while in all other hdfs commands (i.e., {{hdfs dfsadmin}} and {{hdfs balancer}})) they use {{-fs}} to specify NameNode. It'd be better to use {{-fs}} here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10541) When no actions in plan, error message says "Plan was generated more than 24 hours ago"
Lei (Eddy) Xu created HDFS-10541: Summary: When no actions in plan, error message says "Plan was generated more than 24 hours ago" Key: HDFS-10541 URL: https://issues.apache.org/jira/browse/HDFS-10541 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: HDFS-1312 Reporter: Lei (Eddy) Xu Priority: Minor The message is misleading. Instead, it should make it clear that there are no steps (or no action) to take in this plan - and should probably not error out. {code} 16/06/16 14:56:53 INFO command.Command: Executing "execute plan" command Plan was generated more than 24 hours ago. at org.apache.hadoop.hdfs.server.datanode.DiskBalancer.verifyTimeStamp(DiskBalancer.java:387) at org.apache.hadoop.hdfs.server.datanode.DiskBalancer.verifyPlan(DiskBalancer.java:315) at org.apache.hadoop.hdfs.server.datanode.DiskBalancer.submitPlan(DiskBalancer.java:173) at org.apache.hadoop.hdfs.server.datanode.DataNode.submitDiskBalancerPlan(DataNode.java:3059) at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.submitDiskBalancerPlan(ClientDatanodeProtocolServerSideTranslatorPB.java:299) at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17509) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) {code} This happens when a plan looks like the following one: {code} {"volumeSetPlans":[],"nodeName":"a.b.c","nodeUUID":null,"port":20001,"timeStamp":0} {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10540) The CLI error message for disk balancer is not enabled is not clear.
Lei (Eddy) Xu created HDFS-10540: Summary: The CLI error message for disk balancer is not enabled is not clear. Key: HDFS-10540 URL: https://issues.apache.org/jira/browse/HDFS-10540 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: HDFS-1312 Reporter: Lei (Eddy) Xu When running the {{hdfs diskbalancer}} against a DN whose disk balancer feature is not enabled, it reports: {code} $ hdfs diskbalancer -plan 127.0.0.1 -uri hdfs://localhost 16/06/16 18:03:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Internal error, Unable to create JSON string. at org.apache.hadoop.hdfs.server.datanode.DiskBalancer.getVolumeNames(DiskBalancer.java:260) at org.apache.hadoop.hdfs.server.datanode.DataNode.getDiskBalancerSetting(DataNode.java:3105) at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.getDiskBalancerSetting(ClientDatanodeProtocolServerSideTranslatorPB.java:359) at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17515) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) Caused by: org.apache.hadoop.hdfs.server.diskbalancer.DiskBalancerException: Disk Balancer is not enabled. at org.apache.hadoop.hdfs.server.datanode.DiskBalancer.checkDiskBalancerEnabled(DiskBalancer.java:293) at org.apache.hadoop.hdfs.server.datanode.DiskBalancer.getVolumeNames(DiskBalancer.java:251) ... 11 more {code} We should not directly throw IOE to the user. And it should explicitly explain the reason that the operation fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10496) ExecuteCommand checks planFile in a wrong way
Lei (Eddy) Xu created HDFS-10496: Summary: ExecuteCommand checks planFile in a wrong way Key: HDFS-10496 URL: https://issues.apache.org/jira/browse/HDFS-10496 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: HDFS-1312 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Critical In {{ExecuteCommand#execute}}, it checks the plan file as {code} Preconditions.checkArgument(planFile == null || planFile.isEmpty(), "Invalid plan file specified."); {code} Which stops the execution with a correct planFile argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-10225) DataNode hot swap drives should recognize storage type tags.
Lei (Eddy) Xu created HDFS-10225: Summary: DataNode hot swap drives should recognize storage type tags. Key: HDFS-10225 URL: https://issues.apache.org/jira/browse/HDFS-10225 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.7.2 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu The current hot swap code only differentiate data dirs by their paths. People might want to change the types of certain data dirs from the default value in an existing cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-9124) NullPointerException when underreplicated blocks are there
[ https://issues.apache.org/jira/browse/HDFS-9124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-9124. - Resolution: Duplicate Fix Version/s: 2.7.4 It was fixed in HDFS-9574. > NullPointerException when underreplicated blocks are there > -- > > Key: HDFS-9124 > URL: https://issues.apache.org/jira/browse/HDFS-9124 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Syed Akram >Assignee: Syed Akram > Fix For: 2.7.4 > > > 2015-09-22 09:48:47,830 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: dn1:50010:DataXceiver error > processing WRITE_BLOCK operation src: /dn1:42973 dst: /dn2:50010 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:186) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:677) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9795) OIV Delimited should show which files are ACL-enabled.
Lei (Eddy) Xu created HDFS-9795: --- Summary: OIV Delimited should show which files are ACL-enabled. Key: HDFS-9795 URL: https://issues.apache.org/jira/browse/HDFS-9795 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 2.7.2 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Trivial In {{hdfs oiv}} delimited output, there is no easy way to see whether a file has ACLs. {{FsShell}} shows a {{+}} in the permission. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9715) Check storage ID uniqueness on datanode startup
Lei (Eddy) Xu created HDFS-9715: --- Summary: Check storage ID uniqueness on datanode startup Key: HDFS-9715 URL: https://issues.apache.org/jira/browse/HDFS-9715 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.7.2 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu We should fix this to check storage ID uniqueness on datanode startup. If someone has manually edited the storage ID files, or if they have duplicated a directory (or re-added an old disk) they could end up with a duplicate storage ID and not realize it. The HDFS-7575 fix does generate storage UUID for each storage, but not checks the uniqueness of these UUIDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8860) Remove unused Replica copyOnWrite code
[ https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-8860. - Resolution: Fixed I had a discussion with [~cmccabe] offline and learned that {{ReplicaInfo#unlinkBlock}} was designed to append workload before HDFS-1700. It was not designed to remove the hardlinks created by {{DN}} upgrade. Since the code of creating hardlinks when appending a file is gone, the patch is still valid to remove dead code. > Remove unused Replica copyOnWrite code > -- > > Key: HDFS-8860 > URL: https://issues.apache.org/jira/browse/HDFS-8860 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.0.0, 2.8.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Fix For: 2.8.0 > > Attachments: HDFS-8860.0.patch > > > {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, > because {{isUnlinked()}} always returns true. > {code} > if (isUnlinked()) { > return false; > } > {code} > Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and > {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink > Lets remove the relevant code to eliminate the confusions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-8860) Remove unused Replica copyOnWrite code
[ https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu reopened HDFS-8860: - > Remove unused Replica copyOnWrite code > -- > > Key: HDFS-8860 > URL: https://issues.apache.org/jira/browse/HDFS-8860 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.0.0, 2.8.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Fix For: 2.8.0 > > Attachments: HDFS-8860.0.patch > > > {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, > because {{isUnlinked()}} always returns true. > {code} > if (isUnlinked()) { > return false; > } > {code} > Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and > {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink > Lets remove the relevant code to eliminate the confusions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8860) Remove unused Replica copyOnWrite code
[ https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-8860. - Resolution: Invalid Revert the change. {{FinalizedReplica}} can return a valid {{unlinked}} value. > Remove unused Replica copyOnWrite code > -- > > Key: HDFS-8860 > URL: https://issues.apache.org/jira/browse/HDFS-8860 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.0.0, 2.8.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Fix For: 2.8.0 > > Attachments: HDFS-8860.0.patch > > > {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, > because {{isUnlinked()}} always returns true. > {code} > if (isUnlinked()) { > return false; > } > {code} > Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and > {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink > Lets remove the relevant code to eliminate the confusions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-8860) Remove unused Replica copyOnWrite code
[ https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu reopened HDFS-8860: - > Remove unused Replica copyOnWrite code > -- > > Key: HDFS-8860 > URL: https://issues.apache.org/jira/browse/HDFS-8860 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.0.0, 2.8.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > Fix For: 2.8.0 > > Attachments: HDFS-8860.0.patch > > > {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, > because {{isUnlinked()}} always returns true. > {code} > if (isUnlinked()) { > return false; > } > {code} > Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and > {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink > Lets remove the relevant code to eliminate the confusions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9312) Fix TestReplication to be FsDataset-agnostic.
Lei (Eddy) Xu created HDFS-9312: --- Summary: Fix TestReplication to be FsDataset-agnostic. Key: HDFS-9312 URL: https://issues.apache.org/jira/browse/HDFS-9312 Project: Hadoop HDFS Issue Type: Improvement Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor {{TestReplication}} uses raw file system access to inject dummy replica files. It makes {{TestReplication}} not compatible to non-fs dataset implementation. We can fix it by using existing {{FsDatasetTestUtils}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9292) Make TestFileConcorruption independent to underlying FsDataset Implementation.
Lei (Eddy) Xu created HDFS-9292: --- Summary: Make TestFileConcorruption independent to underlying FsDataset Implementation. Key: HDFS-9292 URL: https://issues.apache.org/jira/browse/HDFS-9292 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor {{TestFileCorruption}} manipulates the block data by directly accessing the block files on disk. {{MiniDFSCluster}} has already offered ways to corrupt data. We can use that to make {{TestFileCorruption}} agnostic to the implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9291) Fix TestInterDatanodeProtocol to be FsDataset-agnostic.
Lei (Eddy) Xu created HDFS-9291: --- Summary: Fix TestInterDatanodeProtocol to be FsDataset-agnostic. Key: HDFS-9291 URL: https://issues.apache.org/jira/browse/HDFS-9291 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS, test Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor {{TestInterDatanodeProtocol}} assumes the fsdataset is {{FsDatasetImpl}}. This JIRA will make it dataset agnostic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9281) Change TestDiskError to not explicitly use File to check block pool existence.
Lei (Eddy) Xu created HDFS-9281: --- Summary: Change TestDiskError to not explicitly use File to check block pool existence. Key: HDFS-9281 URL: https://issues.apache.org/jira/browse/HDFS-9281 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS, test Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor {{TestDiskError}} checks the existence of a block pool by checking the directories in the file-based block pool exists. However, it does not apply to non file based fsdataset. We can fix it by abstracting the checking logic behind {{FsDatasetTestUtils}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9267) TestDiskError should get stored replicas through FsDatasetTestUtils.
Lei (Eddy) Xu created HDFS-9267: --- Summary: TestDiskError should get stored replicas through FsDatasetTestUtils. Key: HDFS-9267 URL: https://issues.apache.org/jira/browse/HDFS-9267 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor {{TestDiskError#testReplicationError}} scans local directories to verify blocks and metadata files, which leaks the details of {{FsDataset}} implementation. This JIRA will abstract the "scanning" operation to {{FsDatasetTestUtils}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9252) Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp.
Lei (Eddy) Xu created HDFS-9252: --- Summary: Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp. Key: HDFS-9252 URL: https://issues.apache.org/jira/browse/HDFS-9252 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu {{TestFileTruncate}} verifies block size and genstamp by directly accessing the local filesystem, e.g.: {code} assertTrue(cluster.getBlockMetadataFile(dn0, newBlock.getBlock()).getName().endsWith( newBlock.getBlock().getGenerationStamp() + ".meta")); {code} Lets abstract the fsdataset-special logic behind FsDatasetTestUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.
Lei (Eddy) Xu created HDFS-9251: --- Summary: Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code. Key: HDFS-9251 URL: https://issues.apache.org/jira/browse/HDFS-9251 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates block and metadata files: {code} replicaInfo.getBlockFile().createNewFile(); replicaInfo.getMetaFile().createNewFile(); {code} It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes to use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9188) Make block corruption related tests FsDataset-agnostic.
Lei (Eddy) Xu created HDFS-9188: --- Summary: Make block corruption related tests FsDataset-agnostic. Key: HDFS-9188 URL: https://issues.apache.org/jira/browse/HDFS-9188 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS, test Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Currently, HDFS does block corruption tests by directly accessing the files stored on the storage directories, which assumes {{FsDatasetImpl}} is the dataset implementation. However, with works like OZone (HDFS-7240) and HDFS-8679, there will be different FsDataset implementations. So we need a general way to run whitebox tests like corrupting blocks and crc files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8860) Remove Replica hardlink / unlink code
Lei (Eddy) Xu created HDFS-8860: --- Summary: Remove Replica hardlink / unlink code Key: HDFS-8860 URL: https://issues.apache.org/jira/browse/HDFS-8860 Project: Hadoop HDFS Issue Type: Improvement Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, because {{isUnlinked()}} always returns true. {code} if (isUnlinked()) { return false; } {code} Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink Lets remove the relevant code to eliminate the confusions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8834) TestReplication#testReplicationWhenBlockCorruption is not valid after HDFS-6482
Lei (Eddy) Xu created HDFS-8834: --- Summary: TestReplication#testReplicationWhenBlockCorruption is not valid after HDFS-6482 Key: HDFS-8834 URL: https://issues.apache.org/jira/browse/HDFS-8834 Project: Hadoop HDFS Issue Type: Test Components: datanode Affects Versions: 2.7.1 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor {{TestReplication#testReplicationWhenBlockCorruption}} assumes DN has one level of directory: {code} File[] listFiles = participatedNodeDirs.listFiles(); {code} However, HDFS-6482 changed the layout of block directories and used two level directories, which makes the following code invalidate (not running). {code} for (File file : listFiles) { if (file.getName().startsWith(Block.BLOCK_FILE_PREFIX) && !file.getName().endsWith("meta")) { blockFile = file.getName(); for (File file1 : nonParticipatedNodeDirs) { file1.mkdirs(); new File(file1, blockFile).createNewFile(); new File(file1, blockFile + "_1000.meta").createNewFile(); } break; } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6672) Regression with hdfs oiv tool
[ https://issues.apache.org/jira/browse/HDFS-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-6672. - Resolution: Fixed Fix Version/s: 2.7.0 Hi, [~cnauroth]. Thanks for bring this up. I have a few other {{oiv}} related JIRAs under umbrella JIRA HDFS-8061. I think we can close this JIRA for now. > Regression with hdfs oiv tool > - > > Key: HDFS-6672 > URL: https://issues.apache.org/jira/browse/HDFS-6672 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu >Priority: Minor > Labels: patch, regression, tools > Fix For: 2.7.0 > > > Because the fsimage format changes from Writeable encoding to ProtocolBuffer, > a new {{OIV}} tool was written. However it lacks a few features existed in > the old {{OIV}} tool, such as a _Delimited_ processor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8712) Remove "public" and "abstract" modifiers in FsVolumeSpi and FsDatasetSpi
Lei (Eddy) Xu created HDFS-8712: --- Summary: Remove "public" and "abstract" modifiers in FsVolumeSpi and FsDatasetSpi Key: HDFS-8712 URL: https://issues.apache.org/jira/browse/HDFS-8712 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Trivial In [Java Language Specification 9.4|http://docs.oracle.com/javase/specs/jls/se7/html/jls-9.html#jls-9.4]: bq. It is permitted, but discouraged as a matter of style, to redundantly specify the public and/or abstract modifier for a method declared in an interface. {{FsDatasetSpi}} and {{FsVolumeSpi}} mark methods as public, which cause many warnings in IDEs and {{checkstyle}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8617) Throttle DiskChecker#checkDirs() speed.
Lei (Eddy) Xu created HDFS-8617: --- Summary: Throttle DiskChecker#checkDirs() speed. Key: HDFS-8617 URL: https://issues.apache.org/jira/browse/HDFS-8617 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu As described in HDFS-8564, {{DiskChecker.checkDirs(finalizedDir)}} is causing excessive I/Os because {{finalizedDirs}} might have up to 64K sub-directories (HDFS-6482). This patch proposes to limit the rate of IO operations in {{DiskChecker.checkDirs()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8582) Spurious failure messages when running datanode reconfiguration
Lei (Eddy) Xu created HDFS-8582: --- Summary: Spurious failure messages when running datanode reconfiguration Key: HDFS-8582 URL: https://issues.apache.org/jira/browse/HDFS-8582 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor When running a DN reconfig to hotswap some drives, it spits out this output: {noformat} $ hdfs dfsadmin -reconfig datanode localhost:9023 status 15/06/09 14:58:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Reconfiguring status for DataNode[localhost:9023]: started at Tue Jun 09 14:57:37 PDT 2015 and finished at Tue Jun 09 14:57:56 PDT 2015. FAILED: Change property rpc.engine.org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolPB From: "org.apache.hadoop.ipc.ProtobufRpcEngine" To: "" Error: Property rpc.engine.org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolPB is not reconfigurable. FAILED: Change property mapreduce.client.genericoptionsparser.used From: "true" To: "" Error: Property mapreduce.client.genericoptionsparser.used is not reconfigurable. FAILED: Change property rpc.engine.org.apache.hadoop.ipc.ProtocolMetaInfoPB From: "org.apache.hadoop.ipc.ProtobufRpcEngine" To: "" Error: Property rpc.engine.org.apache.hadoop.ipc.ProtocolMetaInfoPB is not reconfigurable. SUCCESS: Change property dfs.datanode.data.dir From: "file:///data/1/user/dfs" To: "file:///data/1/user/dfs,file:///data/2/user/dfs" FAILED: Change property dfs.datanode.startup From: "REGULAR" To: "" Error: Property dfs.datanode.startup is not reconfigurable. FAILED: Change property rpc.engine.org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolPB From: "org.apache.hadoop.ipc.ProtobufRpcEngine" To: "" Error: Property rpc.engine.org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolPB is not reconfigurable. FAILED: Change property rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB From: "org.apache.hadoop.ipc.ProtobufRpcEngine" To: "" Error: Property rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not reconfigurable. FAILED: Change property rpc.engine.org.apache.hadoop.tracing.TraceAdminProtocolPB From: "org.apache.hadoop.ipc.ProtobufRpcEngine" To: "" Error: Property rpc.engine.org.apache.hadoop.tracing.TraceAdminProtocolPB is not reconfigurable. {noformat} These failed messages are spurious and should not be shown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8573) Move create restartMeta logic from BlockReceiver to ReplicaInPipeline
Lei (Eddy) Xu created HDFS-8573: --- Summary: Move create restartMeta logic from BlockReceiver to ReplicaInPipeline Key: HDFS-8573 URL: https://issues.apache.org/jira/browse/HDFS-8573 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu When DN quick restarts, a {{.restart}} file is created for the {{ReplicaInPipeline}}. This logic should not expose the implementation details in BlockReceiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8535) Clarify that dfs usage in dfsadmin -report output includes all block replicas.
Lei (Eddy) Xu created HDFS-8535: --- Summary: Clarify that dfs usage in dfsadmin -report output includes all block replicas. Key: HDFS-8535 URL: https://issues.apache.org/jira/browse/HDFS-8535 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Some user get confused about this and think it is just space used by the files forgetting about the additional replicas that take up space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-8322) Display warning if hadoop fs -ls is showing the local filesystem
[ https://issues.apache.org/jira/browse/HDFS-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu reopened HDFS-8322: - Re-open to put warnings behind an optional configuration. > Display warning if hadoop fs -ls is showing the local filesystem > > > Key: HDFS-8322 > URL: https://issues.apache.org/jira/browse/HDFS-8322 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS >Affects Versions: 2.7.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu >Priority: Minor > Attachments: HDFS-8322.000.patch > > > Using {{LocalFileSystem}} is rarely the intention of running {{hadoop fs > -ls}}. > This JIRA proposes displaying a warning message if hadoop fs -ls is showing > the local filesystem or using default fs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8322) Display warning if hadoop fs -ls is showing the local filesystem
Lei (Eddy) Xu created HDFS-8322: --- Summary: Display warning if hadoop fs -ls is showing the local filesystem Key: HDFS-8322 URL: https://issues.apache.org/jira/browse/HDFS-8322 Project: Hadoop HDFS Issue Type: Improvement Components: HDFS Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Using {{LocalFileSystem}} is rarely the intention of running {{hadoop fs -ls}}. This JIRA proposes displaying a warning message if hadoop fs -ls is showing the local filesystem or using default fs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8306) Generate ACL and Xattr outputs in OIV XML outputs
Lei (Eddy) Xu created HDFS-8306: --- Summary: Generate ACL and Xattr outputs in OIV XML outputs Key: HDFS-8306 URL: https://issues.apache.org/jira/browse/HDFS-8306 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: 2.7.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Currently, in the {{hdfs oiv}} XML outputs, not all fields of fsimage are outputs. It makes inspecting {{fsimage}} from XML outputs less practical. Also it prevents recovering a fsimage from XML file. This JIRA is adding ACL and XAttrs in the XML outputs as the first step to achieve the goal described in HDFS-8061. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8051) FsVolumeList#addVolume should release volume reference if not put it into BlockScanner.
Lei (Eddy) Xu created HDFS-8051: --- Summary: FsVolumeList#addVolume should release volume reference if not put it into BlockScanner. Key: HDFS-8051 URL: https://issues.apache.org/jira/browse/HDFS-8051 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu {{FsVolumeList#addVolume()}} passes {{FsVolumeReference}} to blockScanner: {code} if (blockScanner != null) { blockScanner.addVolumeScanner(ref); } {code} However, if {{blockScanner == null}}, the {{FsVolumeReference}} will not be released. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8006) Report removed storages after removing them by DataNode#checkDirs()
[ https://issues.apache.org/jira/browse/HDFS-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu resolved HDFS-8006. - Resolution: Invalid {{DataNode#handleDiskError}} schedules block reports already. > Report removed storages after removing them by DataNode#checkDirs() > --- > > Key: HDFS-8006 > URL: https://issues.apache.org/jira/browse/HDFS-8006 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Lei (Eddy) Xu >Assignee: Lei (Eddy) Xu > > Similar to HDFS-7961, after DN removes storages due to disk errors > (HDFS-7722), DN should send a full block report to NN to remove storages > (HDFS-7960) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8006) Report removed storages after removing them by DataNode#checkDirs()
Lei (Eddy) Xu created HDFS-8006: --- Summary: Report removed storages after removing them by DataNode#checkDirs() Key: HDFS-8006 URL: https://issues.apache.org/jira/browse/HDFS-8006 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Similar to HDFS-7961, after DN removes storages due to disk errors (HDFS-7722), DN should send a full block report to NN to remove storages (HDFS-7960) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7996) After swapping a volume, BlockReceiver reports ReplicaNotFoundException
Lei (Eddy) Xu created HDFS-7996: --- Summary: After swapping a volume, BlockReceiver reports ReplicaNotFoundException Key: HDFS-7996 URL: https://issues.apache.org/jira/browse/HDFS-7996 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Critical When removing a disk from an actively writing DataNode, the BlockReceiver working on the disk throws {{ReplicaNotFoundException}} because the replicas are removed from the memory: {code} 2015-03-26 08:02:43,154 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removed volume: /data/2/dfs/dn/current 2015-03-26 08:02:43,163 INFO org.apache.hadoop.hdfs.server.common.Storage: Removing block level storage: /data/2/dfs/dn/current/BP-51301509-10.20.202.114-1427296597742 2015-03-26 08:02:43,163 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append to a non-existent replica BP-51301509-10.20.202.114-1427296597742:blk_1073742979_2160 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaInfo(FsDatasetImpl.java:615) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1362) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.finalizeBlock(BlockReceiver.java:1281) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1241) at java.lang.Thread.run(Thread.java:745) {code} {{FsVolumeList#removeVolume}} waits all threads release {{FsVolumeReference}} on the volume to be removed, however, in {{PacketResponder#finalizeBlock()}}, it calls {code} private void finalizeBlock(long startTime) throws IOException { BlockReceiver.this.close(); final long endTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime() : 0; block.setNumBytes(replicaInfo.getNumBytes()); datanode.data.finalizeBlock(block); {code} The {{FsVolumeReference}} was released in {{BlockReceiver.this.close()}} before calling {{datanode.data.finalizeBlock(block)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7961) Trigger full block report after hot swapping disk
Lei (Eddy) Xu created HDFS-7961: --- Summary: Trigger full block report after hot swapping disk Key: HDFS-7961 URL: https://issues.apache.org/jira/browse/HDFS-7961 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Fix For: 3.0.0, 2.7.0 As discussed in HDFS-7960, NN could not remove the data storage metadata from its memory. DN should trigger a full block report immediately after running hot swapping drives. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7960) NameNode should prune storages that are no longer existed on DataNode
Lei (Eddy) Xu created HDFS-7960: --- Summary: NameNode should prune storages that are no longer existed on DataNode Key: HDFS-7960 URL: https://issues.apache.org/jira/browse/HDFS-7960 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu NameNode should be able to remove storages that is not represented on DataNode. For example, hot swapped on that DataNode or DN restarted with less data dirs. It seems that once a datanode storage is removed from a datanode, those blocks on the storage will not be reconciled as gone from the Namenode until the namenode has been restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7917) Use file to replace data dirs in test to simulate a disk failure.
Lei (Eddy) Xu created HDFS-7917: --- Summary: Use file to replace data dirs in test to simulate a disk failure. Key: HDFS-7917 URL: https://issues.apache.org/jira/browse/HDFS-7917 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor Currently, in several tests, e.g., {{TestDataNodeVolumeFailureXXX}} and {{TestDataNotHowSwapVolumes}}, we simulate a disk failure by setting a directory's executable permission as false. However, it raises the risk that if the cleanup code could not be executed, the directory can not be easily removed by Jenkins job. Since in {{DiskChecker#checkDirAccess}}: {code} private static void checkDirAccess(File dir) throws DiskErrorException { if (!dir.isDirectory()) { throw new DiskErrorException("Not a directory: " + dir.toString()); } checkAccessByFileMethods(dir); } {code} We can replace the DN data directory as a file to achieve the same fault injection goal, while it is safer for cleaning up in any circumstance. Additionally, as [~cnauroth] suggested: bq. That might even let us enable some of these tests that are skipped on Windows, because Windows allows access for the owner even after permissions have been stripped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7908) Use larger value for fs.s3a.connection.timeout and change the unit to seconds.
Lei (Eddy) Xu created HDFS-7908: --- Summary: Use larger value for fs.s3a.connection.timeout and change the unit to seconds. Key: HDFS-7908 URL: https://issues.apache.org/jira/browse/HDFS-7908 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Priority: Minor The default value of {{fs.s3a.connection.timeout}} is {{5}} milliseconds. It causes many {{SocketTimeoutException}} when uploading large files using {{hadoop fs -put}}. Also, the units for {{fs.s3a.connection.timeout}} and {{fs.s3a.connection.estaablish.timeout}} are milliseconds. For s3 connections, I think it is not necessary to have sub-seconds timeout value. Thus I suggest to change the time unit to seconds, to easy sys admin's job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)