from:"Lei \(Eddy\) Xu \(JIRA\)"

[jira] [Created] (HDFS-13468) Add erasure coding metrics into ReadStatistics

2018-04-17 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-13468:


 Summary: Add erasure coding metrics into ReadStatistics
 Key: HDFS-13468
 URL: https://issues.apache.org/jira/browse/HDFS-13468
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.1, 3.1.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


Expose Erasure Coding related metrics for InputStream in ReadStatistics. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13350) Negative legacy block ID will confuse Erasure Coding to be considered as striped block

2018-03-26 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-13350:


 Summary: Negative legacy block ID will confuse Erasure Coding to 
be considered as striped block
 Key: HDFS-13350
 URL: https://issues.apache.org/jira/browse/HDFS-13350
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


HDFS-4645 has changed HDFS block ID from randomly generated to sequential 
positive IDs.  And later on, HDFS EC was built on the assumption that normal 3x 
replica block IDs are positive, so EC re-use negative IDs as striped blocks.

However, there are legacy block IDs can be negative in the system, we should 
not use hardcode method to check whether a block is stripe or not:

{code}
  public static boolean isStripedBlockID(long id) {
return BlockType.fromBlockId(id) == STRIPED;
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-13218) Log audit event only used last EC policy name when add multiple policies from file

2018-03-02 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-13218.
--
  Resolution: Duplicate
   Fix Version/s: 3.0.1
  3.1.0
Target Version/s: 3.1.0, 3.0.2

Lets work on HDFS-13217.

> Log audit event only used last EC policy name when add multiple policies from 
> file 
> ---
>
> Key: HDFS-13218
> URL: https://issues.apache.org/jira/browse/HDFS-13218
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.1.0
>Reporter: liaoyuxiangqin
>Priority: Major
> Fix For: 3.1.0, 3.0.1
>
>
> When i read the addErasureCodingPolicies() of FSNamesystem class in namenode, 
> i found the following code only used last ec policy name for  logAuditEvent, 
> i think this audit log can't track whole policies for the add multiple 
> erasure coding policies to the ErasureCodingPolicyManager. Thanks.
> {code:java|title=FSNamesystem.java|borderStyle=solid}
> try {
>   checkOperation(OperationCategory.WRITE);
>   checkNameNodeSafeMode("Cannot add erasure coding policy");
>   for (ErasureCodingPolicy policy : policies) {
> try {
>   ErasureCodingPolicy newPolicy =
>   FSDirErasureCodingOp.addErasureCodingPolicy(this, policy,
>   logRetryCache);
>   addECPolicyName = newPolicy.getName();
>   responses.add(new AddErasureCodingPolicyResponse(newPolicy));
> } catch (HadoopIllegalArgumentException e) {
>   responses.add(new AddErasureCodingPolicyResponse(policy, e));
> }
>   }
>   success = true;
>   return responses.toArray(new AddErasureCodingPolicyResponse[0]);
> } finally {
>   writeUnlock(operationName);
>   if (success) {
> getEditLog().logSync();
>   }
>   logAuditEvent(success, operationName,addECPolicyName, null, null);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13175) Add more information for checking argument in DiskBalancerVolume

2018-02-20 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-13175:


 Summary: Add more information for checking argument in 
DiskBalancerVolume
 Key: HDFS-13175
 URL: https://issues.apache.org/jira/browse/HDFS-13175
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: diskbalancer
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


We have seen the following stack in production

{code
Exception in thread "main" java.lang.IllegalArgumentException
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at 
org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerVolume.setUsed(DiskBalancerVolume.java:268)
at 
org.apache.hadoop.hdfs.server.diskbalancer.connectors.DBNameNodeConnector.getVolumeInfoFromStorageReports(DBNameNodeConnector.java:141)
at 
org.apache.hadoop.hdfs.server.diskbalancer.connectors.DBNameNodeConnector.getNodes(DBNameNodeConnector.java:90)
at 
org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerCluster.readClusterInfo(DiskBalancerCluster.java:132)
at 
org.apache.hadoop.hdfs.server.diskbalancer.command.Command.readClusterInfo(Command.java:123)
at 
org.apache.hadoop.hdfs.server.diskbalancer.command.PlanCommand.execute(PlanCommand.java:107)
{code}

raised from 
{code}
 public void setUsed(long dfsUsedSpace) {
Preconditions.checkArgument(dfsUsedSpace < this.getCapacity());
this.used = dfsUsedSpace;
  }
{code}

However, the datanode reports at the very moment were not captured. We should 
add more information into the stack trace to better diagnose the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13039) StripedBlockReader#createBlockReader leaks socket on IOException

2018-01-19 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-13039:


 Summary: StripedBlockReader#createBlockReader leaks socket on 
IOException
 Key: HDFS-13039
 URL: https://issues.apache.org/jira/browse/HDFS-13039
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, erasure-coding
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


When running EC on one cluster, DataNode has millions of {{CLOSE_WAIT}} 
connections
{code:java}
$ grep CLOSE_WAIT lsof.out | wc -l
10358700

// All CLOSW_WAITs belong to the same DataNode process (pid=88527)
$ grep CLOSE_WAIT lsof.out | awk '{print $2}' | sort | uniq
88527
{code}

And DN can not open any file / socket, as shown in the log:
{preformat}
2018-01-19 06:47:09,424 WARN io.netty.channel.DefaultChannelPipeline: An 
exceptionCaught() event was fired, and it reached at the tail of the pipeline. 
It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at 
io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135)
at 
io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:563)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:504)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:418)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:390)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
at java.lang.Thread.run(Thread.java:748)
{preformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12994) TestReconstructStripedFile.testNNSendsErasureCodingTasks fails due to socket timeout

2018-01-08 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12994:


 Summary: TestReconstructStripedFile.testNNSendsErasureCodingTasks 
fails due to socket timeout
 Key: HDFS-12994
 URL: https://issues.apache.org/jira/browse/HDFS-12994
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


Occasionally, {{testNNSendsErasureCodingTasks}} fails due to socket timeout

{code}
2017-12-26 20:35:19,961 [StripedBlockReconstruction-0] INFO  datanode.DataNode 
(StripedBlockReader.java:createBlockReader(132)) - Exception while creating 
remote block reader, datanode 127.0.0.1:34145
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReader.newConnectedPeer(StripedBlockReader.java:148)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReader.createBlockReader(StripedBlockReader.java:123)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReader.(StripedBlockReader.java:83)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.createReader(StripedReader.java:169)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.initReaders(StripedReader.java:150)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.init(StripedReader.java:133)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:56)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}

while the target datanode is removed in the test:

{code}
2017-12-26 20:35:18,710 [Thread-2393] INFO  net.NetworkTopology 
(NetworkTopology.java:remove(219)) - Removing a node: 
/default-rack/127.0.0.1:34145
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12953) XORRawDecoder.doDecode throws NullPointerException

2017-12-20 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12953:


 Summary: XORRawDecoder.doDecode throws NullPointerException
 Key: HDFS-12953
 URL: https://issues.apache.org/jira/browse/HDFS-12953
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu


Thanks [~danielpol] report on HDFS-12860.

{noformat}
17/11/30 04:19:55 INFO mapreduce.Job: map 0% reduce 0%
17/11/30 04:20:01 INFO mapreduce.Job: Task Id : 
attempt_1512036058655_0003_m_02_0, Status : FAILED
Error: java.lang.NullPointerException
at 
org.apache.hadoop.io.erasurecode.rawcoder.XORRawDecoder.doDecode(XORRawDecoder.java:83)
at 
org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:106)
at 
org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:170)
at 
org.apache.hadoop.hdfs.StripeReader.decodeAndFillBuffer(StripeReader.java:423)
at 
org.apache.hadoop.hdfs.StatefulStripeReader.decode(StatefulStripeReader.java:94)
at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:382)
at 
org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:318)
at 
org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:391)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:813)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:257)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:563)
at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:794)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12927) Update erasure coding doc to address unsupported APIs

2017-12-14 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12927:


 Summary: Update erasure coding doc to address unsupported APIs
 Key: HDFS-12927
 URL: https://issues.apache.org/jira/browse/HDFS-12927
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


{{Concat}}, {{truncate}}, {{setReplication}} are not (fully) supported with EC 
files. We should update the document to address them explicitly. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12923) DFS.concat should throw exception if files have different EC policies.

2017-12-13 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-12923.
--
   Resolution: Won't Fix
Fix Version/s: 3.0.0

Resolved as not an issue.

> DFS.concat should throw exception if files have different EC policies. 
> ---
>
> Key: HDFS-12923
> URL: https://issues.apache.org/jira/browse/HDFS-12923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: Lei (Eddy) Xu
>Priority: Critical
> Fix For: 3.0.0
>
>
> {{DFS#concat}} appends blocks from different files to a single file. However, 
> if these files have different EC policies, or mixed with replicated and EC 
> files, the resulted file would be problematic to read, because the EC codec 
> is defined in INode instead of in a block. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12921) DFS.setReplication should throw exception on EC files

2017-12-12 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-12921.
--
   Resolution: Won't Fix
Fix Version/s: 3.0.0

It was a no-op in {{FSDirAttrOp#unprotectedSetReplication()}}

> DFS.setReplication should throw exception on EC files
> -
>
> Key: HDFS-12921
> URL: https://issues.apache.org/jira/browse/HDFS-12921
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-beta1
>Reporter: Lei (Eddy) Xu
> Fix For: 3.0.0
>
>
> This was checked from {{o.a.h.fs.shell.SetReplication#processPath}}, however, 
> {{DistributedFileSystem#setReplication()}} API is also a public API, we 
> should move the check to {{DistributedFileSystem}} to prevent directly call 
> this API on EC file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12923) DFS.concat should throw exception if files have different EC policies.

2017-12-12 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12923:


 Summary: DFS.concat should throw exception if files have different 
EC policies. 
 Key: HDFS-12923
 URL: https://issues.apache.org/jira/browse/HDFS-12923
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu
Priority: Critical


{{DFS#concat}} appends blocks from different files to a single file. However, 
if these files have different EC policies, or mixed with replicated and EC 
files, the resulted file would be problematic to read, because the EC codec is 
defined in INode instead of in a block. 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12921) DFS.setReplication should throw IOE on EC files

2017-12-12 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12921:


 Summary: DFS.setReplication should throw IOE on EC files
 Key: HDFS-12921
 URL: https://issues.apache.org/jira/browse/HDFS-12921
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-beta1
Reporter: Lei (Eddy) Xu


This was checked from {{o.a.h.fs.shell.SetReplication#processPath}}, however, 
{{DistributedFileSystem#setReplication()}} API is also a public API, we should 
move the check to {{DistributedFileSystem}} to prevent directly call this API 
on EC file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12860) TeraSort failed on erasure coding directory

2017-11-27 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12860:


 Summary: TeraSort failed on erasure coding directory
 Key: HDFS-12860
 URL: https://issues.apache.org/jira/browse/HDFS-12860
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu


Running terasort on a cluster with 8 datanodes, 256g data, using RS-3-2-1024k.

The terasort benchmark fails with the following stack trace:

{code}
17/11/27 14:44:31 INFO mapreduce.Job:  map 45% reduce 0%
17/11/27 14:44:33 INFO mapreduce.Job: Task Id : 
attempt_1510080297865_0160_m_08_0, Status : FAILED
Error: java.lang.IllegalArgumentException
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil$VerticalRange.(StripedBlockUtil.java:701)
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getRangesForInternalBlocks(StripedBlockUtil.java:442)
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.divideOneStripe(StripedBlockUtil.java:311)
at 
org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:308)
at 
org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:391)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:813)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.examples.terasort.TeraInputFormat$TeraRecordReader.nextKeyValue(TeraInputFormat.java:257)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12840) Creating a replicated file in a EC zone does not correctly serialized in EditLogs

2017-11-20 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12840:


 Summary: Creating a replicated file in a EC zone does not 
correctly serialized in EditLogs
 Key: HDFS-12840
 URL: https://issues.apache.org/jira/browse/HDFS-12840
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-beta1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Blocker


When create a replicated file in an existing EC zone, the edit logs does not 
differentiate it from an EC file. When {{FSEditLogLoader}} to replay edits, 
this file is treated as EC file, as a results, it crashes the NN because the 
blocks of this file are replicated, which does not match with {{INode}}.

{noformat}
ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered 
exception on operation AddBlockOp [path=/system/balancer.id, 
penultimateBlock=NULL, lastBlock=blk_1073743259_2455, RpcClientId=, 
RpcCallId=-2]
java.lang.IllegalArgumentException: reportedBlock is not striped
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoStriped.addStorage(BlockInfoStriped.java:118)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.addBlock(DatanodeStorageInfo.java:256)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlock(BlockManager.java:3141)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addStoredBlockUnderConstruction(BlockManager.java:3068)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processAndHandleReportedBlock(BlockManager.java:3864)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processQueuedMessages(BlockManager.java:2916)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processQueuedMessagesForBlock(BlockManager.java:2903)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.addNewBlock(FSEditLogLoader.java:1069)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:532)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12819) Setting/Unsetting EC policy shows warning if the directory is not empty

2017-11-15 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12819:


 Summary: Setting/Unsetting EC policy shows warning if the 
directory is not empty
 Key: HDFS-12819
 URL: https://issues.apache.org/jira/browse/HDFS-12819
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Because the existing data will not be converted when we set or unset EC policy 
on a directory, a warning from CLI would help to clear user's expectation. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12769) TestReadStripedFileWithDecodingCorruptData and TestReadStripedFileWithDecodingDeletedData timeout in trunk

2017-11-02 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12769:


 Summary: TestReadStripedFileWithDecodingCorruptData and 
TestReadStripedFileWithDecodingDeletedData timeout in trunk
 Key: HDFS-12769
 URL: https://issues.apache.org/jira/browse/HDFS-12769
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-beta1
Reporter: Lei (Eddy) Xu
Priority: Major


Recently, TestReadStripedFileWithDecodingCorruptData and 
TestReadStripedFileWithDecodingDeletedData fail frequently.

For example, in HDFS-12725. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12613) Native EC coder should implement release() as idempotent function.

2017-10-06 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12613:


 Summary: Native EC coder should implement release() as idempotent 
function.
 Key: HDFS-12613
 URL: https://issues.apache.org/jira/browse/HDFS-12613
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-beta1
Reporter: Lei (Eddy) Xu


Recently, we found native EC coder crashes JVM because 
{{NativeRSDecoder#release()}} being called multiple times (HDFS-12612 and 
HDFS-12606). 

We should strength the implement the native code to make {{release()}} 
idempotent  as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12606) JVM crashes when running NNBench on EC enabled.

2017-10-05 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12606:


 Summary: JVM crashes when running NNBench on EC enabled. 
 Key: HDFS-12606
 URL: https://issues.apache.org/jira/browse/HDFS-12606
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-beta1
Reporter: Lei (Eddy) Xu
Priority: Critical


When running NNbench on a RS(6,3) directory, JVM crashes double free or 
corruption:

{code}
08:16:29 Running NNBENCH.
08:16:29 WARNING: Use "yarn jar" to launch YARN applications.
08:16:31 NameNode Benchmark 0.4
08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Test Inputs: 
08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Test Operation: create_write
08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Start time: 2017-10-04 08:18:31,16
:
:
08:18:54 *** Error in `/usr/java/jdk1.8.0_144/bin/java': double free or 
corruption (out): 0x7ffb55dbfab0 ***
08:18:54 === Backtrace: =
08:18:54 /lib64/libc.so.6(+0x7c619)[0x7ffb5b85f619]
08:18:54 [0x7ffb45017774]
08:18:54 === Memory map: 
08:18:54 0040-00401000 r-xp  ca:01 276832134 
/usr/java/jdk1.8.0_144/bin/java
08:18:54 0060-00601000 rw-p  ca:01 276832134 
/usr/java/jdk1.8.0_144/bin/java
08:18:54 0173e000-01f91000 rw-p  00:00 0 [heap]
08:18:54 60360-61470 rw-p  00:00 0 
08:18:54 61470-72bd0 ---p  00:00 0 
08:18:54 72bd0-73a50 rw-p  00:00 0 
08:18:54 73a50-7c000 ---p  00:00 0 
08:18:54 7c000-7c040 rw-p  00:00 0 
08:18:54 7c040-8 ---p  00:00 0 
08:18:54 7ffb20174000-7ffb208ab000 rw-p  00:00 0 
08:18:54 7ffb208ab000-7ffb20975000 ---p  00:00 0 
08:18:54 7ffb20975000-7ffb20b75000 rw-p  00:00 0 
08:18:54 7ffb20b75000-7ffb20d75000 rw-p  00:00 0 
08:18:54 7ffb20d75000-7ffb20d8a000 r-xp  ca:01 209866 
/usr/lib64/libgcc_s-4.8.5-20150702.so.1
08:18:54 7ffb20d8a000-7ffb20f89000 ---p 00015000 ca:01 209866 
/usr/lib64/libgcc_s-4.8.5-20150702.so.1
08:18:54 7ffb20f89000-7ffb20f8a000 r--p 00014000 ca:01 209866 
/usr/lib64/libgcc_s-4.8.5-20150702.so.1
08:18:54 7ffb20f8a000-7ffb20f8b000 rw-p 00015000 ca:01 209866 
/usr/lib64/libgcc_s-4.8.5-20150702.so.1
08:18:54 7ffb20f8b000-7ffb20fbd000 r-xp  ca:01 553654092 
/usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so
08:18:54 7ffb20fbd000-7ffb211bc000 ---p 00032000 ca:01 553654092 
/usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so
08:18:54 7ffb211bc000-7ffb211c2000 rw-p 00031000 ca:01 553654092 
/usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so
:
:
08:18:54 7ffb5c3fb000-7ffb5c3fc000 r--p  00:00 0 
08:18:54 7ffb5c3fc000-7ffb5c3fd000 rw-p  00:00 0 
08:18:54 7ffb5c3fd000-7ffb5c3fe000 r--p 00021000 ca:01 637266 
/usr/lib64/ld-2.17.so
08:18:54 7ffb5c3fe000-7ffb5c3ff000 rw-p 00022000 ca:01 637266 
/usr/lib64/ld-2.17.so
08:18:54 7ffb5c3ff000-7ffb5c40 rw-p  00:00 0 
08:18:54 7ffdf8767000-7ffdf8788000 rw-p  00:00 0 [stack]
08:18:54 7ffdf878b000-7ffdf878d000 r-xp  00:00 0 [vdso]
08:18:54 ff60-ff601000 r-xp  00:00 0 [vsyscall]
{code}

It happens on both {{jdk1.8.0_144}} and {{jdk1.8.0_121}} in our environments. 

It is highly suspicious due to the native code used in erasure coding, i.e., 
ISA-L is not thread safe 
[https://01.org/sites/default/files/documentation/isa-l_open_src_2.10.pdf]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12604) StreamCapability enums are not displayed in javadoc

2017-10-05 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12604:


 Summary: StreamCapability enums are not displayed in javadoc
 Key: HDFS-12604
 URL: https://issues.apache.org/jira/browse/HDFS-12604
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.0.0-beta1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


http://hadoop.apache.org/docs/r3.0.0-beta1/api/org/apache/hadoop/fs/StreamCapabilities.html

{{StreamCapability#HFLUSH}} and {{StreamCapability#HSYNC}} are not displayed in 
the doc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12575) Improve test coverage for EC related edit logs ops

2017-10-02 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12575:


 Summary: Improve test coverage for EC related edit logs ops
 Key: HDFS-12575
 URL: https://issues.apache.org/jira/browse/HDFS-12575
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: SammiChen


HDFS-12569 found that we have little test coverage for edit logs ops of erasure 
coding.

And we've seen the following bug bring down SNN in our test environments:

{code}
6:42:18.177 AM  ERROR   FSEditLogLoader 
Encountered exception on operation AddBlockOp [path=/tmp/foo/bar, 
penultimateBlock=NULL, lastBlock=blk_1073743386_69322, RpcClientId=, 
RpcCallId=-2]
java.lang.IllegalArgumentException: reportedBlock is not striped
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at 

6:42:18.190 AM  FATAL   EditLogTailer   
Unknown error encountered while tailing edits. Shutting down standby NN.
java.io.IOException: java.lang.IllegalArgumentException: reportedBlock is not 
striped
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:251)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:150)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
at 
{code}
We should add coverage for these important edit logs, i.e., set/unset policy, 
enable/remove policies and etc are correctly persisted in edit logs, and test 
the scenarios like:

* Restart NN
* Replay edits after checkpoint
* Apply edits on SNN.
* and etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12569) Unset EC policy logs empty payload in edit log

2017-09-29 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12569:


 Summary: Unset EC policy logs empty payload in edit log
 Key: HDFS-12569
 URL: https://issues.apache.org/jira/browse/HDFS-12569
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Blocker


The edit log generated by {{hdfs ec -unsetPolicy}} generates an 
{{OP_REMOVE_XATTR}} entry in edit logs, but the payload is missing:

{code}
  
OP_REMOVE_XATTR

  420481
  /
  b098e758-9d7f-48b7-aa91-80ca52133b09
  0

  
{code}

As a result, when Active NN restarts, or the Standby NN replay edits, this op 
has not effect.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12523) Thread pools in ErasureCodingWorker do not shutdown

2017-09-20 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12523:


 Summary: Thread pools in ErasureCodingWorker do not shutdown
 Key: HDFS-12523
 URL: https://issues.apache.org/jira/browse/HDFS-12523
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu


There is no code path in {{ErasureCodingWorker}} to shutdown its two thread 
pools: {{stripedReconstructionPool}} and {{stripedReadPool}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12483) Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery

2017-09-19 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-12483.
--
Resolution: Duplicate

> Provide a configuration to adjust the weight of EC recovery tasks to adjust 
> the speed of recovery
> -
>
> Key: HDFS-12483
> URL: https://issues.apache.org/jira/browse/HDFS-12483
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha4
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
>Priority: Minor
>
> The relative speed of EC recovery comparing to 3x replica recovery is a 
> function of (EC codec, number of sources, NIC speed, and CPU speed, and etc). 
> Currently the EC recovery has a fixed {{xmitsInProgress}} of {{max(# of 
> sources, # of targets)}} comparing to {{1}} for 3x replica recovery, and NN 
> uses {{xmitsInProgress}} to decide how much recovery tasks to schedule to the 
> DataNode this we can add a coefficient for user to tune the weight of EC 
> recovery tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12482) Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery

2017-09-18 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12482:


 Summary: Provide a configuration to adjust the weight of EC 
recovery tasks to adjust the speed of recovery
 Key: HDFS-12482
 URL: https://issues.apache.org/jira/browse/HDFS-12482
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


The relative speed of EC recovery comparing to 3x replica recovery is a 
function of (EC codec, number of sources, NIC speed, and CPU speed, and etc). 

Currently the EC recovery has a fixed {{xmitsInProgress}} of {{max(# of 
sources, # of targets)}} comparing to {{1}} for 3x replica recovery, and NN 
uses {{xmitsInProgress}} to decide how much recovery tasks to schedule to the 
DataNode this we can add a coefficient for user to tune the weight of EC 
recovery tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12483) Provide a configuration to adjust the weight of EC recovery tasks to adjust the speed of recovery

2017-09-18 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12483:


 Summary: Provide a configuration to adjust the weight of EC 
recovery tasks to adjust the speed of recovery
 Key: HDFS-12483
 URL: https://issues.apache.org/jira/browse/HDFS-12483
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


The relative speed of EC recovery comparing to 3x replica recovery is a 
function of (EC codec, number of sources, NIC speed, and CPU speed, and etc). 

Currently the EC recovery has a fixed {{xmitsInProgress}} of {{max(# of 
sources, # of targets)}} comparing to {{1}} for 3x replica recovery, and NN 
uses {{xmitsInProgress}} to decide how much recovery tasks to schedule to the 
DataNode this we can add a coefficient for user to tune the weight of EC 
recovery tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12472) Add JUNIT timeout to TestBlockStatsMXBean

2017-09-15 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12472:


 Summary: Add JUNIT timeout to TestBlockStatsMXBean 
 Key: HDFS-12472
 URL: https://issues.apache.org/jira/browse/HDFS-12472
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Lei (Eddy) Xu
Priority: Minor


Add Junit timeout to {{TestBlockStatsMXBean}} so that it can show up in the 
test failure report if timeout occurs.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12439) TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally

2017-09-14 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-12439.
--
Resolution: Duplicate

Close this one because HDFS-12449 has patch available.

> TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally 
> 
>
> Key: HDFS-12439
> URL: https://issues.apache.org/jira/browse/HDFS-12439
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha4
>Reporter: Lei (Eddy) Xu
>  Labels: flaky-test
>
> With error message:
> {code}
> Error Message
> test timed out after 6 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 6 milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:917)
>   at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1199)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:842)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
>   at org.apache.hadoop.hdfs.DFSTestUtil.writeFile(DFSTestUtil.java:835)
>   at 
> org.apache.hadoop.hdfs.TestReconstructStripedFile.writeFile(TestReconstructStripedFile.java:273)
>   at 
> org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:461)
>   at 
> org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12439) TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally

2017-09-12 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12439:


 Summary: TestReconstructStripedFile.testNNSendsErasureCodingTasks 
fails occasionally 
 Key: HDFS-12439
 URL: https://issues.apache.org/jira/browse/HDFS-12439
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu


With error message:

{code}
Error Message

test timed out after 6 milliseconds
Stacktrace

java.lang.Exception: test timed out after 6 milliseconds
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:917)
at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1199)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:842)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at org.apache.hadoop.hdfs.DFSTestUtil.writeFile(DFSTestUtil.java:835)
at 
org.apache.hadoop.hdfs.TestReconstructStripedFile.writeFile(TestReconstructStripedFile.java:273)
at 
org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:461)
at 
org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:439)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12360) TestLeaseRecoveryStriped.testLeaseRecovery failure

2017-09-12 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-12360.
--
Resolution: Duplicate

> TestLeaseRecoveryStriped.testLeaseRecovery failure
> --
>
> Key: HDFS-12360
> URL: https://issues.apache.org/jira/browse/HDFS-12360
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>
> TestLeaseRecoveryStriped.testLeaseRecovery failed:
> {code}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.TestLeaseRecoveryStriped
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.808 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.TestLeaseRecoveryStriped
> testLeaseRecovery(org.apache.hadoop.hdfs.TestLeaseRecoveryStriped)  Time 
> elapsed: 15.509 sec  <<< FAILURE!
> java.lang.AssertionError: failed testCase at i=0, blockLengths=[10485760, 
> 4194304, 6291456, 10485760, 11534336, 11534336, 6291456, 4194304, 3145728]
> java.io.IOException: Failed: the number of failed blocks = 4 > the number of 
> data blocks = 3
>   at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.checkStreamers(DFSStripedOutputStream.java:393)
>   at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.handleStreamerFailure(DFSStripedOutputStream.java:411)
>   at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.flushAllInternals(DFSStripedOutputStream.java:1128)
>   at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.checkStreamerFailures(DFSStripedOutputStream.java:628)
>   at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:564)
>   at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
>   at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164)
>   at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145)
>   at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:79)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:48)
>   at java.io.DataOutputStream.write(DataOutputStream.java:88)
>   at 
> org.apache.hadoop.hdfs.TestLeaseRecoveryStriped.writePartialBlocks(TestLeaseRecoveryStriped.java:182)
>   at 
> org.apache.hadoop.hdfs.TestLeaseRecoveryStriped.runTest(TestLeaseRecoveryStriped.java:158)
>   at 
> org.apache.hadoop.hdfs.TestLeaseRecoveryStriped.testLeaseRecovery(TestLeaseRecoveryStriped.java:147)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.hadoop.hdfs.TestLease

[jira] [Created] (HDFS-12412) Remove ErasureCodingWorker.stripedReadPool

2017-09-08 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12412:


 Summary: Remove ErasureCodingWorker.stripedReadPool
 Key: HDFS-12412
 URL: https://issues.apache.org/jira/browse/HDFS-12412
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


In {{ErasureCodingWorker}}, it uses {{stripedReconstructionPool}} to schedule 
the EC recovery tasks, while uses {{stripedReadPool}} for the reader threads in 
each recovery task.  We only need one of them to throttle the speed of recovery 
process, because each EC recovery task has a fix number of source readers 
(i.e., 3 for RS(3,2)). And because of the findings in HDFS-12044, the speed of 
EC recovery can be throttled by {{strippedReconstructionPool}} with 
{{xmitsInProgress}}. 

Moreover, keeping {{stripedReadPool}} makes customer difficult to understand 
and calculate the right balance between 
{{dfs.datanode.ec.reconstruction.stripedread.threads}}, 
{{dfs.datanode.ec.reconstruction.stripedblock.threads.size}} and 
{{maxReplicationStreams}}.  For example, a small {{stripread.threads}} 
(comparing to which {{reconstruction.threads.size}} implies), will 
unnecessarily limit the speed of recovery, which leads to larger MTTR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12409) Add metrics of execution time of EC recovery tasks

2017-09-08 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12409:


 Summary: Add metrics of execution time of EC recovery tasks
 Key: HDFS-12409
 URL: https://issues.apache.org/jira/browse/HDFS-12409
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Admin can use more metrics to monitor EC recovery tasks, to get insights to 
tune recovery performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12351) Explicitly describe the minimal number of DataNodes required to support an EC policy in EC document.

2017-08-24 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12351:


 Summary: Explicitly describe the minimal number of DataNodes 
required to support an EC policy in EC document.
 Key: HDFS-12351
 URL: https://issues.apache.org/jira/browse/HDFS-12351
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation, erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Should explicitly call out the minimal number of DataNodes (ie.. 5 for RS(3,2)) 
in EC document, to make it easy to understand for non-storage people. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12349) Improve log message when

2017-08-24 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12349:


 Summary: Improve log message when 
 Key: HDFS-12349
 URL: https://issues.apache.org/jira/browse/HDFS-12349
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


When an EC output stream could not alloc enough blocks for parity blocks, it 
sets the warning.
{code}
if (blocks[i] == null) {
LOG.warn("Failed to get block location for parity block, index=" + i);
{code}

We should clarify the cause of this warning message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12263) Revise StreamCapacities doc to describe the API usage and the requirements for customized OutputStream implemetation

2017-08-04 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12263:


 Summary: Revise StreamCapacities doc to describe the API usage and 
the requirements for customized OutputStream implemetation
 Key: HDFS-12263
 URL: https://issues.apache.org/jira/browse/HDFS-12263
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


[~busbey] raised the concerns to call out what is the expected way to call 
{{StreamCapabilities}} from the client side.   And this doc should also 
describe the rules for any {{FSOutputStream}} implementation to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12260) StreamCapabilities.StreamCapability should be public.

2017-08-04 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12260:


 Summary: StreamCapabilities.StreamCapability should be public.
 Key: HDFS-12260
 URL: https://issues.apache.org/jira/browse/HDFS-12260
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu


Client should use {{StreamCapability}} enum instead of raw string to query the 
capability of an OutputStream, for better type safety / IDE supports and etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12251) Add document for StreamCapabilities

2017-08-02 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12251:


 Summary: Add document for StreamCapabilities
 Key: HDFS-12251
 URL: https://issues.apache.org/jira/browse/HDFS-12251
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


Update filesystem docs to describe the purpose and usage of 
{{StreamCapabilities}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12234) [SPS] Allow setting Xattr without SPS running.

2017-07-31 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12234:


 Summary: [SPS] Allow setting Xattr without SPS running.
 Key: HDFS-12234
 URL: https://issues.apache.org/jira/browse/HDFS-12234
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: HDFS-10285
Reporter: Lei (Eddy) Xu


As discussed in HDFS-10285, if this API is widely used by downstream projects 
(i.e., HBase), it should allow the client to call this API without querying the 
running status of SPS service. It would introduce great burden for this API to 
be used.  

Given the constraints this SPS service has (i.e., can not run with Mover , and 
might be disabled by default), it should allow the API call success as long as 
related xattr being persisted. SPS can run later to catch on.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12233) Add API to unset SPS on a path

2017-07-31 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12233:


 Summary: Add API to unset SPS on a path
 Key: HDFS-12233
 URL: https://issues.apache.org/jira/browse/HDFS-12233
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Affects Versions: HDFS-10285
Reporter: Lei (Eddy) Xu


As discussed in HDFS-10285, we should allow to unset SPS on a path.

For example, an user might mistakenly set SPS on "/", and triggers significant 
amount of data movement. Unset SPS will allow user to fix his own mistake.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12221) Replace xcerces in XmlEditsVisitor

2017-07-28 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12221:


 Summary: Replace xcerces in XmlEditsVisitor 
 Key: HDFS-12221
 URL: https://issues.apache.org/jira/browse/HDFS-12221
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu


XmlEditsVisitor should use new XML capability  in the newer JDK, to make JAR 
shading easier (HADOOP-14672)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12215) DataNode#transferBlock does not create its daemon in the xceiver thread group

2017-07-28 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12215:


 Summary: DataNode#transferBlock does not create its daemon in the 
xceiver thread group
 Key: HDFS-12215
 URL: https://issues.apache.org/jira/browse/HDFS-12215
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


As mentioned in HDFS-12044, DataNode#transferBlock daemon is not calculated to 
xceiver count.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12208) NN should consider DataNode#xmitInProgress when placing new block

2017-07-27 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12208:


 Summary: NN should consider DataNode#xmitInProgress when placing 
new block
 Key: HDFS-12208
 URL: https://issues.apache.org/jira/browse/HDFS-12208
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: block placement, erasure-coding
Affects Versions: 3.0.0-alpha4
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


As discussed in HDFS-12044, NN only considers xceiver counts on DN when placing 
new blocks. NN should also consider background reconstruction works, presented 
by xmits on DN.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12072) Provide fairness between EC and non-EC recovery tasks.

2017-06-29 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12072:


 Summary: Provide fairness between EC and non-EC recovery tasks.
 Key: HDFS-12072
 URL: https://issues.apache.org/jira/browse/HDFS-12072
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


In {{DatanodeManager#handleHeartbeat}}, it takes up to {{maxTransfer}} 
reconstruction tasks for non-EC, then if the request can not be full filled, it 
takes more tasks from EC reconstruction tasks.

{code}
List pendingList = nodeinfo.getReplicationCommand(
maxTransfers);
if (pendingList != null) {
  cmds.add(new BlockCommand(DatanodeProtocol.DNA_TRANSFER, blockPoolId,
  pendingList));
  maxTransfers -= pendingList.size();
}
// check pending erasure coding tasks
List pendingECList = nodeinfo
.getErasureCodeCommand(maxTransfers);
if (pendingECList != null) {
  cmds.add(new BlockECReconstructionCommand(
  DNA_ERASURE_CODING_RECONSTRUCTION, pendingECList));
}
{code}

So on a large cluster, if there are large number of constantly non-EC 
reconstruction tasks, EC reconstruction tasks do not have a chance to run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12065) Fix log format in StripedBlockReconstructor

2017-06-28 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12065:


 Summary: Fix log format in StripedBlockReconstructor
 Key: HDFS-12065
 URL: https://issues.apache.org/jira/browse/HDFS-12065
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Trivial


The {{LOG}} is using wrong signature in {{StripedBlockReconstructor}}, and 
results to the following message without the stack:

{code}
Failed to reconstruct striped block: 
BP-1026491657-172.31.114.203-1498498077419:blk_-9223372036854759232_5065
java.lang.NullPointerException
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12044) Mismatch between BlockManager#maxReplicatioStreams and ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst recovery.

2017-06-26 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12044:


 Summary: Mismatch between BlockManager#maxReplicatioStreams and 
ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst 
recovery. 
 Key: HDFS-12044
 URL: https://issues.apache.org/jira/browse/HDFS-12044
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu


{{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}} 
and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is 
full.

When {{BlockManager#maxReplicationStream}} is larger than 
{{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}}, for 
example, {{maxReplicationStream=20}} and {{corePoolSize=2 , maxPoolSize=8}}.  
Meanwhile, NN sends up to {{maxTransfer}} reconstruction tasks to DN for each 
heartbeat, and it is calculated in {{FSNamesystem}}:

{code}
final int maxTransfer = blockManager.getMaxReplicationStreams() - 
xmitsInProgress;
{code}

However, at any giving time, {{{ErasureCodingWorker#stripedReconstructionPool}} 
takes 2 {{xmitInProcess}}. So for each heartbeat in 3s, NN will send about 
{{20-2 = 18}} reconstruction tasks to the DN, and DN throw away most of them if 
there were 8 tasks in the queue already. So NN needs to take longer to 
re-consider these blocks were under-replicated to schedule new tasks.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12033) DatanodeManager picking EC recovery tasks should also consider the number of regular replication tasks.

2017-06-23 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-12033:


 Summary: DatanodeManager picking EC recovery tasks should also 
consider the number of regular replication tasks.
 Key: HDFS-12033
 URL: https://issues.apache.org/jira/browse/HDFS-12033
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


In {{DatanodeManager#handleHeartbeat}}, it choose both pending replication list 
and pending EC list to up to {{maxTransfers}} items.

It should only send {{maxTransfers}} tasks combined to DN.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-11987) DistributedFileSystem#create and append do not honor CreateFlag.CREATE|APPEND

2017-06-16 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-11987:


 Summary: DistributedFileSystem#create and append do not honor 
CreateFlag.CREATE|APPEND
 Key: HDFS-11987
 URL: https://issues.apache.org/jira/browse/HDFS-11987
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.0.0-alpha3, 2.8.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


{{DistributedFileSystem#create()}} and {{DistributedFIleSystem#append()}} do 
not honor the expected behavior on {{CreateFlag.CREATE|APPEND}}.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-11975) Provide a system-default EC policy

2017-06-14 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-11975:


 Summary: Provide a system-default EC policy
 Key: HDFS-11975
 URL: https://issues.apache.org/jira/browse/HDFS-11975
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: erasure-coding
Affects Versions: 3.0.0-alpha3
Reporter: Lei (Eddy) Xu
Assignee: SammiChen


>From the usability point of view, it'd be nice to be able to specify a 
>system-wide EC policy, i.e., in {{hdfs-site.xml}}. For most of users / admin / 
>downstream projects, it is not necessary to know the tradeoffs of the EC 
>policy, considering that it requires the knowledge of EC, the actual physical 
>topology of the clusters, and many other factors (i.e., network, cluster size 
>and etc).

 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-11651) Add a public API for specifying an EC policy at create time

2017-06-05 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-11651.
--
Resolution: Duplicate

This can be addressed in HADOOP-14394.  So I close this JIRA as duplicated.

> Add a public API for specifying an EC policy at create time
> ---
>
> Key: HDFS-11651
> URL: https://issues.apache.org/jira/browse/HDFS-11651
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.0.0-alpha4
>Reporter: Andrew Wang
>  Labels: hdfs-ec-3.0-nice-to-have
>
> Follow-on work from HDFS-10996. We extended the create builder, but it still 
> requires casting to DistributedFileSystem to use, thus is not a public API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-11659) TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten fail due to no

2017-04-17 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-11659:


 Summary: TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten 
fail due to no 
 Key: HDFS-11659
 URL: https://issues.apache.org/jira/browse/HDFS-11659
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0-alpha2, 2.7.3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


The test fails after the following error messages:

{code}
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[127.0.0.1:57377,DS-b4ec61fc-657c-4e2a-9dc3-8d93b7769a2b,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:47448,DS-18bca8d7-048d-4d7f-9594-d2df16096a3d,DISK]],
 
original=[DatanodeInfoWithStorage[127.0.0.1:57377,DS-b4ec61fc-657c-4e2a-9dc3-8d93b7769a2b,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:47448,DS-18bca8d7-048d-4d7f-9594-d2df16096a3d,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1280)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1354)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1512)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1236)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:721)
{code}

In such case, the DataNode that has removed can not be used in the pipeline 
recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-10638) Modifications to remove the assumption that StorageLocation is associated with java.io.File in Datanode.

2016-10-25 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-10638.
--
   Resolution: Fixed
Fix Version/s: 3.0.0-alpha2

+1 . Thanks for the good work!

I re-run the failed test and it passes on my laptop. So I commit the patch to 
trunk.

> Modifications to remove the assumption that StorageLocation is associated 
> with java.io.File in Datanode.
> 
>
> Key: HDFS-10638
> URL: https://issues.apache.org/jira/browse/HDFS-10638
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, fs
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
> Fix For: 3.0.0-alpha2
>
> Attachments: HDFS-10638.001.patch, HDFS-10638.002.patch, 
> HDFS-10638.003.patch, HDFS-10638.004.patch, HDFS-10638.005.patch
>
>
> Changes to ensure that {{StorageLocation}} need not be associated with a 
> {{java.io.File}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-10960) TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWritten fails at disk error verification after volume remove

2016-10-18 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-10960.
--
   Resolution: Fixed
Fix Version/s: 2.9.0
   2.8.0

Re-worked to commit 01 patch to branch-2 and branch-2.8.

Thanks [~kihwal] and [~manojg] for working closely on this patch.



> TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWritten fails at disk error 
> verification after volume remove
> 
>
> Key: HDFS-10960
> URL: https://issues.apache.org/jira/browse/HDFS-10960
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.0.0-alpha2
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Minor
> Fix For: 2.8.0, 2.9.0, 3.0.0-alpha2
>
> Attachments: HDFS-10960.01.patch, HDFS-10960.02.patch
>
>
> TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWritten fails occasionally in 
> the following verification.
> {code}
>   700 // If an IOException thrown from BlockReceiver#run, it triggers
>   701 // DataNode#checkDiskError(). So we can test whether 
> checkDiskError() is called,
>   702 // to see whether there is IOException in BlockReceiver#run().
>   703 assertEquals(lastTimeDiskErrorCheck, dn.getLastDiskErrorCheck());
>   704 
> {code}
> {noformat}
> Error Message
> expected:<0> but was:<6498109>
> Stacktrace
> java.lang.AssertionError: expected:<0> but was:<6498109>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWrittenForDatanode(TestDataNodeHotSwapVolumes.java:703)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten(TestDataNodeHotSwapVolumes.java:620)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10911) Change edit log OP_UPDATE_BLOCKS to store delta blocks only.

2016-09-27 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10911:


 Summary: Change edit log OP_UPDATE_BLOCKS to store delta blocks 
only.
 Key: HDFS-10911
 URL: https://issues.apache.org/jira/browse/HDFS-10911
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.0.0-alpha1, 2.7.3
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


Every time a HDFS client {{close}}  or {{hflush}} an open file, NameNode 
enumerates all the blocks  and stores then into edit log (OP_UPDATE_BLOCKS). 

It would cause problem when the client is appending a large file frequently 
(i.e., WAL). 

Because HDFS is append only, we can only store the blocks that have been 
changed (delta blocks) in edit log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10734) Rename "dfs.ha.tail-edits.rolledits.timeout" to "dfs.ha.log-roll.execution.timeout"

2016-08-08 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10734:


 Summary: Rename "dfs.ha.tail-edits.rolledits.timeout" to 
"dfs.ha.log-roll.execution.timeout"
 Key: HDFS-10734
 URL: https://issues.apache.org/jira/browse/HDFS-10734
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 2.9.0, 3.0.0-alpha2
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


In HDFS-4176, it introduces {{dfs.ha.tail-edits.rolledits.timeout}}.  
[~Surendra Singh Lilhore] kindly suggested to rename it to 
{{dfs.ha.log-roll.execution.timeout}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Reopened] (HDFS-4176) EditLogTailer should call rollEdits with a timeout

2016-07-26 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu reopened HDFS-4176:
-

it was not related to HDFS-9659.

> EditLogTailer should call rollEdits with a timeout
> --
>
> Key: HDFS-4176
> URL: https://issues.apache.org/jira/browse/HDFS-4176
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 2.0.2-alpha, 3.0.0-alpha1
>Reporter: Todd Lipcon
>Assignee: Lei (Eddy) Xu
> Fix For: 3.0.0-alpha1
>
> Attachments: namenode.jstack4
>
>
> When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it 
> currently does so without a timeout. So, if the active NN has frozen (but not 
> actually crashed), this call can hang forever. This can then potentially 
> prevent the standby from becoming active.
> This may actually considered a side effect of HADOOP-6762 -- if the RPC were 
> interruptible, that would also fix the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-4176) EditLogTailer should call rollEdits with a timeout

2016-07-26 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-4176.
-
Resolution: Duplicate

HDFS-9659 adds timeout to RPC calls, thus fixed this issue.

> EditLogTailer should call rollEdits with a timeout
> --
>
> Key: HDFS-4176
> URL: https://issues.apache.org/jira/browse/HDFS-4176
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Affects Versions: 2.0.2-alpha, 3.0.0-alpha1
>Reporter: Todd Lipcon
>Assignee: Lei (Eddy) Xu
> Attachments: namenode.jstack4
>
>
> When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it 
> currently does so without a timeout. So, if the active NN has frozen (but not 
> actually crashed), this call can hang forever. This can then potentially 
> prevent the standby from becoming active.
> This may actually considered a side effect of HADOOP-6762 -- if the RPC were 
> interruptible, that would also fix the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-9310) TestDataNodeHotSwapVolumes fails occasionally

2016-07-07 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-9310.
-
   Resolution: Duplicate
 Assignee: Lei (Eddy) Xu
Fix Version/s: 2.8.0

It was fixed in HDFS-9137.

> TestDataNodeHotSwapVolumes fails occasionally
> -
>
> Key: HDFS-9310
> URL: https://issues.apache.org/jira/browse/HDFS-9310
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Arpit Agarwal
>Assignee: Lei (Eddy) Xu
> Fix For: 2.8.0
>
>
> TestDataNodeHotSwapVolumes fails occasionally in Jenkins and locally. e.g. 
> https://builds.apache.org/job/PreCommit-HDFS-Build/13197/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testRemoveVolumeBeingWritten/
> *Error Message*
> Timed out waiting for /test to reach 3 replicas
> *Stacktrace*
> java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 
> replicas
>   at 
> org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:768)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWrittenForDatanode(TestDataNodeHotSwapVolumes.java:644)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten(TestDataNodeHotSwapVolumes.java:569)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10600) PlanCommand#getThrsholdPercentage should not use throughput value.

2016-07-05 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10600:


 Summary: PlanCommand#getThrsholdPercentage should not use 
throughput value.
 Key: HDFS-10600
 URL: https://issues.apache.org/jira/browse/HDFS-10600
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: diskbalancer
Affects Versions: 2.9.0, 3.0.0-beta1
Reporter: Lei (Eddy) Xu


In {{PlanCommand#getThresholdPercentage}}

{code}
 private double getThresholdPercentage(CommandLine cmd) {

if ((value <= 0.0) || (value > 100.0)) {
  value = getConf().getDouble(
  DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT,
  DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT_DEFAULT);
}
return value;
  }
{code}

{{DISK_THROUGHPUT}} has the unit of "MB", so it does not make sense to return 
{{throughput}} as a percentage value.

Btw, we should use {{THROUGHPUT}} instead of {{THRUPUT}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10598) DiskBalancer does not execute multi-steps plan.

2016-07-05 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10598:


 Summary: DiskBalancer does not execute multi-steps plan.
 Key: HDFS-10598
 URL: https://issues.apache.org/jira/browse/HDFS-10598
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: diskbalancer
Affects Versions: 2.8.0, 3.0.0-beta1
Reporter: Lei (Eddy) Xu
Priority: Critical


I set up a 3 DN node cluster, each one with 2 small disks.  After creating some 
files to fill HDFS, I added two more small disks to one DN.  And run the 
diskbalancer on this DataNode.

The disk usage before running diskbalancer:

{code}
/dev/loop0  3.9G  2.1G  1.6G 58%  /mnt/data1
/dev/loop1  3.9G  2.6G  1.1G 71%  /mnt/data2
/dev/loop2  3.9G  17M  3.6G 1%  /mnt/data3
/dev/loop3  3.9G  17M  3.6G 1%  /mnt/data4
{code}

However, after running diskbalancer (i.e., {{-query}} shows {{PLAN_DONE}})

{code}
/dev/loop0  3.9G  1.2G  2.5G 32%  /mnt/data1
/dev/loop1  3.9G  2.6G  1.1G 71%  /mnt/data2
/dev/loop2  3.9G  953M  2.7G 26%  /mnt/data3
/dev/loop3  3.9G  17M  3.6G 1%   /mnt/data4
{code}

It is suspicious that in {{DiskBalancerMover#copyBlocks}}, every return does 
{{this.setExitFlag}} which prevents {{copyBlocks()}} be called multiple times 
from {{DiskBalancer#executePlan}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10567) Improve plan command help message

2016-06-22 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10567:


 Summary: Improve plan command help message
 Key: HDFS-10567
 URL: https://issues.apache.org/jira/browse/HDFS-10567
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Lei (Eddy) Xu


{code}
--bandwidth  Maximum disk bandwidth to be consumed by
  diskBalancer. e.g. 10
--maxerror   Describes how many errors can be
  tolerated while copying between a pair
  of disks.
--outFile to write output to, if not
  specified defaults will be used.
--plan   creates a plan for datanode.
--thresholdPercentagePercentage skew that wetolerate before
  diskbalancer starts working e.g. 10
--v   Print out the summary of the plan on
  console
{code}

We should 
* Put the unit into {{--bandwidth}}, or its help message. Is it an integer or 
float / double number? Not clear in CLI message.
* Give more details about {{--plan}}. It is not clear what the {{}} is for.
* {{--thresholdPercentage}},  has typo {{wetolerate}} in the error message. 
Also it needs to indicated that it is the difference between space utilization 
between two disks / volumes.  Is it an integer or float / double number?

Thanks.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10552) DiskBalancer "-query" results in NPE if no plan for the node

2016-06-20 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10552:


 Summary: DiskBalancer "-query" results in NPE if no plan for the 
node
 Key: HDFS-10552
 URL: https://issues.apache.org/jira/browse/HDFS-10552
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: HDFS-1312
Reporter: Lei (Eddy) Xu
Priority: Critical


{code}
16/06/20 11:50:16 INFO command.Command: Executing "query plan" command.
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$QueryPlanStatusResponseProto$Builder.setPlanID(ClientDatanodeProtocolProtos.java:12782)
at 
org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.queryDiskBalancerPlan(ClientDatanodeProtocolServerSideTranslatorPB.java:340)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17513)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10551) o.a.h.h.s.diskbalancer.command.Command does not actually verify options as expected.

2016-06-20 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10551:


 Summary: o.a.h.h.s.diskbalancer.command.Command does not actually 
verify options as expected.
 Key: HDFS-10551
 URL: https://issues.apache.org/jira/browse/HDFS-10551
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Lei (Eddy) Xu
Priority: Critical


In {{diskbalancer.command.Command#verifyCommandOptions}}. The following code 
does not do what it expected to do:

{code}
if (!validArgs.containsKey(opt.getArgName())) {
{code}

opt.getArgName() always returns "arg" instead of i.e., {{report}} or {{uri}}, 
which is the expected parameter to check.

It should use {{opt.getLongOpt()}} to get the option names. It can pass on the 
branch because {{opt.getArgName()}} always returns {{"arg"}}, which is 
accidently in {{validArgs}}. However I don't think it is the intention for this 
function.

Additionally, in the following code

{code}
validArguments.append("Valid arguments are : %n");
{code}

This {{%n}} is not used.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10545) PlanCommand should use -fs instead of -uri to be consistent with other hdfs commands

2016-06-17 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10545:


 Summary: PlanCommand should use -fs instead of -uri to be 
consistent with other hdfs commands
 Key: HDFS-10545
 URL: https://issues.apache.org/jira/browse/HDFS-10545
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: HDFS-1312
Reporter: Lei (Eddy) Xu
Priority: Minor


PlanCommand currently uses {{-uri}} to specify NameNode, while in all other 
hdfs commands (i.e., {{hdfs dfsadmin}} and {{hdfs balancer}})) they use {{-fs}} 
to specify NameNode.

It'd be better to use {{-fs}} here.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10541) When no actions in plan, error message says "Plan was generated more than 24 hours ago"

2016-06-17 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10541:


 Summary: When no actions in plan, error message says "Plan was 
generated more than 24 hours ago"
 Key: HDFS-10541
 URL: https://issues.apache.org/jira/browse/HDFS-10541
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: HDFS-1312
Reporter: Lei (Eddy) Xu
Priority: Minor


The message is misleading. Instead, it should make it clear that there are no 
steps (or no action) to take in this plan - and should probably not error out.

{code}
16/06/16 14:56:53 INFO command.Command: Executing "execute plan" command
Plan was generated more than 24 hours ago.
at 
org.apache.hadoop.hdfs.server.datanode.DiskBalancer.verifyTimeStamp(DiskBalancer.java:387)
at 
org.apache.hadoop.hdfs.server.datanode.DiskBalancer.verifyPlan(DiskBalancer.java:315)
at 
org.apache.hadoop.hdfs.server.datanode.DiskBalancer.submitPlan(DiskBalancer.java:173)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.submitDiskBalancerPlan(DataNode.java:3059)
at 
org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.submitDiskBalancerPlan(ClientDatanodeProtocolServerSideTranslatorPB.java:299)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17509)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
{code}

This happens when a plan looks like the following one:
{code}
{"volumeSetPlans":[],"nodeName":"a.b.c","nodeUUID":null,"port":20001,"timeStamp":0}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10540) The CLI error message for disk balancer is not enabled is not clear.

2016-06-16 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10540:


 Summary: The CLI error message for disk balancer is not enabled is 
not clear.
 Key: HDFS-10540
 URL: https://issues.apache.org/jira/browse/HDFS-10540
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: HDFS-1312
Reporter: Lei (Eddy) Xu


When running the {{hdfs diskbalancer}} against a DN whose disk balancer feature 
is not enabled, it reports:

{code}
$ hdfs diskbalancer -plan 127.0.0.1 -uri hdfs://localhost
16/06/16 18:03:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Internal error, Unable to create JSON string.
at 
org.apache.hadoop.hdfs.server.datanode.DiskBalancer.getVolumeNames(DiskBalancer.java:260)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.getDiskBalancerSetting(DataNode.java:3105)
at 
org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.getDiskBalancerSetting(ClientDatanodeProtocolServerSideTranslatorPB.java:359)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17515)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
Caused by: org.apache.hadoop.hdfs.server.diskbalancer.DiskBalancerException: 
Disk Balancer is not enabled.
at 
org.apache.hadoop.hdfs.server.datanode.DiskBalancer.checkDiskBalancerEnabled(DiskBalancer.java:293)
at 
org.apache.hadoop.hdfs.server.datanode.DiskBalancer.getVolumeNames(DiskBalancer.java:251)
... 11 more
{code}


We should not directly throw IOE to the user. And it should explicitly explain 
the reason that the operation fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10496) ExecuteCommand checks planFile in a wrong way

2016-06-07 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10496:


 Summary: ExecuteCommand checks planFile in a wrong way
 Key: HDFS-10496
 URL: https://issues.apache.org/jira/browse/HDFS-10496
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode
Affects Versions: HDFS-1312
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Critical


In {{ExecuteCommand#execute}}, it checks the plan file as 

{code}
 Preconditions.checkArgument(planFile == null || planFile.isEmpty(),
"Invalid plan file specified.");
{code}

Which stops the execution with a correct planFile argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10225) DataNode hot swap drives should recognize storage type tags.

2016-03-28 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-10225:


 Summary: DataNode hot swap drives should recognize storage type 
tags. 
 Key: HDFS-10225
 URL: https://issues.apache.org/jira/browse/HDFS-10225
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.7.2
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


The current hot swap code only differentiate data dirs by their paths. People 
might want to change the types of certain data dirs from the default value in 
an existing cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-9124) NullPointerException when underreplicated blocks are there

2016-02-12 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-9124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-9124.
-
   Resolution: Duplicate
Fix Version/s: 2.7.4

It was fixed in HDFS-9574.

> NullPointerException when underreplicated blocks are there
> --
>
> Key: HDFS-9124
> URL: https://issues.apache.org/jira/browse/HDFS-9124
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Syed Akram
>Assignee: Syed Akram
> Fix For: 2.7.4
>
>
> 2015-09-22 09:48:47,830 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: dn1:50010:DataXceiver error 
> processing WRITE_BLOCK operation  src: /dn1:42973 dst: /dn2:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:186)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:677)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9795) OIV Delimited should show which files are ACL-enabled.

2016-02-11 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9795:
---

 Summary: OIV Delimited should show which files are ACL-enabled.
 Key: HDFS-9795
 URL: https://issues.apache.org/jira/browse/HDFS-9795
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: tools
Affects Versions: 2.7.2
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Trivial


In {{hdfs oiv}} delimited output, there is no easy way to see whether a file 
has ACLs. 

{{FsShell}} shows a {{+}} in the permission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9715) Check storage ID uniqueness on datanode startup

2016-01-27 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9715:
---

 Summary: Check storage ID uniqueness on datanode startup
 Key: HDFS-9715
 URL: https://issues.apache.org/jira/browse/HDFS-9715
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.7.2
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


We should fix this to check storage ID uniqueness on datanode startup. If 
someone has manually edited the storage ID files, or if they have duplicated a 
directory (or re-added an old disk) they could end up with a duplicate storage 
ID and not realize it. 

The HDFS-7575 fix does generate storage UUID for each storage, but not checks 
the uniqueness of these UUIDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-8860) Remove unused Replica copyOnWrite code

2015-12-14 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-8860.
-
Resolution: Fixed

I had a discussion with [~cmccabe] offline and learned that 
{{ReplicaInfo#unlinkBlock}} was designed to append workload before HDFS-1700. 
It was not designed to remove the hardlinks created by {{DN}} upgrade.

Since the code of creating hardlinks when appending a file is gone, the patch 
is still valid to remove dead code.



> Remove unused Replica copyOnWrite code
> --
>
> Key: HDFS-8860
> URL: https://issues.apache.org/jira/browse/HDFS-8860
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Fix For: 2.8.0
>
> Attachments: HDFS-8860.0.patch
>
>
> {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, 
> because {{isUnlinked()}} always returns true.
> {code}
> if (isUnlinked()) {
>   return false;
> }
> {code}
> Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and 
> {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink 
> Lets remove the relevant code to eliminate the confusions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (HDFS-8860) Remove unused Replica copyOnWrite code

2015-12-14 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu reopened HDFS-8860:
-

> Remove unused Replica copyOnWrite code
> --
>
> Key: HDFS-8860
> URL: https://issues.apache.org/jira/browse/HDFS-8860
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Fix For: 2.8.0
>
> Attachments: HDFS-8860.0.patch
>
>
> {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, 
> because {{isUnlinked()}} always returns true.
> {code}
> if (isUnlinked()) {
>   return false;
> }
> {code}
> Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and 
> {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink 
> Lets remove the relevant code to eliminate the confusions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-8860) Remove unused Replica copyOnWrite code

2015-12-10 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-8860.
-
Resolution: Invalid

Revert the change.  {{FinalizedReplica}} can return a valid {{unlinked}} value.

> Remove unused Replica copyOnWrite code
> --
>
> Key: HDFS-8860
> URL: https://issues.apache.org/jira/browse/HDFS-8860
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Fix For: 2.8.0
>
> Attachments: HDFS-8860.0.patch
>
>
> {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, 
> because {{isUnlinked()}} always returns true.
> {code}
> if (isUnlinked()) {
>   return false;
> }
> {code}
> Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and 
> {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink 
> Lets remove the relevant code to eliminate the confusions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (HDFS-8860) Remove unused Replica copyOnWrite code

2015-12-10 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu reopened HDFS-8860:
-

> Remove unused Replica copyOnWrite code
> --
>
> Key: HDFS-8860
> URL: https://issues.apache.org/jira/browse/HDFS-8860
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.8.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Fix For: 2.8.0
>
> Attachments: HDFS-8860.0.patch
>
>
> {{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, 
> because {{isUnlinked()}} always returns true.
> {code}
> if (isUnlinked()) {
>   return false;
> }
> {code}
> Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and 
> {{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink 
> Lets remove the relevant code to eliminate the confusions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9312) Fix TestReplication to be FsDataset-agnostic.

2015-10-26 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9312:
---

 Summary: Fix TestReplication to be FsDataset-agnostic.
 Key: HDFS-9312
 URL: https://issues.apache.org/jira/browse/HDFS-9312
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


{{TestReplication}} uses raw file system access to inject dummy replica files. 
It makes {{TestReplication}} not compatible to non-fs dataset implementation.

We can fix it by using existing {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9292) Make TestFileConcorruption independent to underlying FsDataset Implementation.

2015-10-22 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9292:
---

 Summary: Make TestFileConcorruption independent to underlying 
FsDataset Implementation.
 Key: HDFS-9292
 URL: https://issues.apache.org/jira/browse/HDFS-9292
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


{{TestFileCorruption}} manipulates the block data by directly accessing the 
block files on disk.  {{MiniDFSCluster}} has already offered ways to corrupt 
data. We can use that to make {{TestFileCorruption}} agnostic to the 
implementation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9291) Fix TestInterDatanodeProtocol to be FsDataset-agnostic.

2015-10-22 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9291:
---

 Summary: Fix TestInterDatanodeProtocol to be FsDataset-agnostic.
 Key: HDFS-9291
 URL: https://issues.apache.org/jira/browse/HDFS-9291
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


{{TestInterDatanodeProtocol}} assumes the fsdataset is {{FsDatasetImpl}}. 

This JIRA will make it dataset agnostic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9281) Change TestDiskError to not explicitly use File to check block pool existence.

2015-10-21 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9281:
---

 Summary: Change TestDiskError to not explicitly use File to check 
block pool existence.
 Key: HDFS-9281
 URL: https://issues.apache.org/jira/browse/HDFS-9281
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


{{TestDiskError}} checks the existence of a block pool by checking the 
directories in the file-based block pool exists. However, it does not apply to 
non file based fsdataset. 

We can fix it by abstracting the checking logic behind {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9267) TestDiskError should get stored replicas through FsDatasetTestUtils.

2015-10-19 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9267:
---

 Summary: TestDiskError should get stored replicas through 
FsDatasetTestUtils.
 Key: HDFS-9267
 URL: https://issues.apache.org/jira/browse/HDFS-9267
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


{{TestDiskError#testReplicationError}} scans local directories to verify blocks 
and metadata files, which leaks the details of {{FsDataset}} implementation. 

This JIRA will abstract the "scanning" operation to {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9252) Change TestFileTruncate to FsDatasetTestUtils to get block file size and genstamp.

2015-10-15 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9252:
---

 Summary: Change TestFileTruncate to FsDatasetTestUtils to get 
block file size and genstamp.
 Key: HDFS-9252
 URL: https://issues.apache.org/jira/browse/HDFS-9252
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


{{TestFileTruncate}} verifies block size and genstamp by directly accessing the 
 local filesystem, e.g.:

{code}
assertTrue(cluster.getBlockMetadataFile(dn0,
   newBlock.getBlock()).getName().endsWith(
   newBlock.getBlock().getGenerationStamp() + ".meta"));
{code}

Lets abstract the fsdataset-special logic behind FsDatasetTestUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9251) Refactor TestWriteToReplica and TestFsDatasetImpl to avoid explicitly creating Files in tests code.

2015-10-15 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9251:
---

 Summary: Refactor TestWriteToReplica and TestFsDatasetImpl to 
avoid explicitly creating Files in tests code.
 Key: HDFS-9251
 URL: https://issues.apache.org/jira/browse/HDFS-9251
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


In {{TestWriteToReplica}} and {{TestFsDatasetImpl}}, tests directly creates 
block and metadata files:

{code}
replicaInfo.getBlockFile().createNewFile();
replicaInfo.getMetaFile().createNewFile();
{code}

It leaks the implementation details of {{FsDatasetImpl}}. This JIRA proposes to 
use {{FsDatasetImplTestUtils}} (HDFS-9188) to create replicas. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-9188) Make block corruption related tests FsDataset-agnostic.

2015-10-01 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-9188:
---

 Summary: Make block corruption related tests FsDataset-agnostic. 
 Key: HDFS-9188
 URL: https://issues.apache.org/jira/browse/HDFS-9188
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


Currently, HDFS does block corruption tests by directly accessing the files 
stored on the storage directories, which assumes {{FsDatasetImpl}} is the 
dataset implementation. However, with works like OZone (HDFS-7240) and 
HDFS-8679, there will be different FsDataset implementations. 

So we need a general way to run whitebox tests like corrupting blocks and crc 
files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8860) Remove Replica hardlink / unlink code

2015-08-05 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8860:
---

 Summary: Remove Replica hardlink / unlink code
 Key: HDFS-8860
 URL: https://issues.apache.org/jira/browse/HDFS-8860
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


{{ReplicaInfo#unlinkBlock()}} is effectively disabled by the following code, 
because {{isUnlinked()}} always returns true.

{code}
if (isUnlinked()) {
  return false;
}
{code}

Several test cases, e.g., {{TestFileAppend#testCopyOnWrite}} and 
{{TestDatanodeRestart#testRecoverReplicas}} are testing against the unlink 
Lets remove the relevant code to eliminate the confusions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8834) TestReplication#testReplicationWhenBlockCorruption is not valid after HDFS-6482

2015-07-28 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8834:
---

 Summary: TestReplication#testReplicationWhenBlockCorruption is not 
valid after HDFS-6482
 Key: HDFS-8834
 URL: https://issues.apache.org/jira/browse/HDFS-8834
 Project: Hadoop HDFS
  Issue Type: Test
  Components: datanode
Affects Versions: 2.7.1
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


{{TestReplication#testReplicationWhenBlockCorruption}} assumes DN has one level 
of directory:
{code}
File[] listFiles = participatedNodeDirs.listFiles();
{code}

However, HDFS-6482 changed the layout of block directories and used two level 
directories, which makes the following code invalidate (not running).

{code}
for (File file : listFiles) {
if (file.getName().startsWith(Block.BLOCK_FILE_PREFIX)
&& !file.getName().endsWith("meta")) {
  blockFile = file.getName();
  for (File file1 : nonParticipatedNodeDirs) {
file1.mkdirs();
new File(file1, blockFile).createNewFile();
new File(file1, blockFile + "_1000.meta").createNewFile();
  }
  break;
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-6672) Regression with hdfs oiv tool

2015-07-08 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-6672.
-
   Resolution: Fixed
Fix Version/s: 2.7.0

Hi, [~cnauroth]. Thanks for bring this up.

I have a few other {{oiv}} related JIRAs under umbrella JIRA HDFS-8061. I think 
we can close this JIRA for now.

> Regression with hdfs oiv tool
> -
>
> Key: HDFS-6672
> URL: https://issues.apache.org/jira/browse/HDFS-6672
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
>Priority: Minor
>  Labels: patch, regression, tools
> Fix For: 2.7.0
>
>
> Because the fsimage format changes from Writeable encoding to ProtocolBuffer, 
> a new {{OIV}} tool was written. However it lacks a few features existed in 
> the old {{OIV}} tool, such as a _Delimited_ processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8712) Remove "public" and "abstract" modifiers in FsVolumeSpi and FsDatasetSpi

2015-07-02 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8712:
---

 Summary: Remove "public" and "abstract" modifiers in FsVolumeSpi 
and FsDatasetSpi
 Key: HDFS-8712
 URL: https://issues.apache.org/jira/browse/HDFS-8712
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Trivial


In [Java Language Specification 
9.4|http://docs.oracle.com/javase/specs/jls/se7/html/jls-9.html#jls-9.4]:

bq. It is permitted, but discouraged as a matter of style, to redundantly 
specify the public and/or abstract modifier for a method declared in an 
interface.

{{FsDatasetSpi}} and {{FsVolumeSpi}} mark methods as public, which cause many 
warnings in IDEs and {{checkstyle}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8617) Throttle DiskChecker#checkDirs() speed.

2015-06-17 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8617:
---

 Summary: Throttle DiskChecker#checkDirs() speed.
 Key: HDFS-8617
 URL: https://issues.apache.org/jira/browse/HDFS-8617
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


As described in HDFS-8564,  {{DiskChecker.checkDirs(finalizedDir)}} is causing 
excessive I/Os because {{finalizedDirs}} might have up to 64K sub-directories 
(HDFS-6482).

This patch proposes to limit the rate of IO operations in 
{{DiskChecker.checkDirs()}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8582) Spurious failure messages when running datanode reconfiguration

2015-06-11 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8582:
---

 Summary: Spurious failure messages when running datanode 
reconfiguration
 Key: HDFS-8582
 URL: https://issues.apache.org/jira/browse/HDFS-8582
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


When running a DN reconfig to hotswap some drives, it spits out this output:

{noformat}
$ hdfs dfsadmin -reconfig datanode localhost:9023 status
15/06/09 14:58:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Reconfiguring status for DataNode[localhost:9023]: started at Tue Jun 09 
14:57:37 PDT 2015 and finished at Tue Jun 09 14:57:56 PDT 2015.
FAILED: Change property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolPB
From: "org.apache.hadoop.ipc.ProtobufRpcEngine"
To: ""
Error: Property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolPB is not 
reconfigurable.
FAILED: Change property mapreduce.client.genericoptionsparser.used
From: "true"
To: ""
Error: Property mapreduce.client.genericoptionsparser.used is not 
reconfigurable.
FAILED: Change property rpc.engine.org.apache.hadoop.ipc.ProtocolMetaInfoPB
From: "org.apache.hadoop.ipc.ProtobufRpcEngine"
To: ""
Error: Property rpc.engine.org.apache.hadoop.ipc.ProtocolMetaInfoPB is 
not reconfigurable.
SUCCESS: Change property dfs.datanode.data.dir
From: "file:///data/1/user/dfs"
To: "file:///data/1/user/dfs,file:///data/2/user/dfs"
FAILED: Change property dfs.datanode.startup
From: "REGULAR"
To: ""
Error: Property dfs.datanode.startup is not reconfigurable.
FAILED: Change property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolPB
From: "org.apache.hadoop.ipc.ProtobufRpcEngine"
To: ""
Error: Property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolPB is not 
reconfigurable.
FAILED: Change property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB
From: "org.apache.hadoop.ipc.ProtobufRpcEngine"
To: ""
Error: Property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not 
reconfigurable.
FAILED: Change property 
rpc.engine.org.apache.hadoop.tracing.TraceAdminProtocolPB
From: "org.apache.hadoop.ipc.ProtobufRpcEngine"
To: ""
Error: Property 
rpc.engine.org.apache.hadoop.tracing.TraceAdminProtocolPB is not reconfigurable.
{noformat}

These failed messages are spurious and should not be shown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8573) Move create restartMeta logic from BlockReceiver to ReplicaInPipeline

2015-06-10 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8573:
---

 Summary: Move create restartMeta logic from BlockReceiver to 
ReplicaInPipeline
 Key: HDFS-8573
 URL: https://issues.apache.org/jira/browse/HDFS-8573
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


When DN quick restarts, a {{.restart}} file is created for the 
{{ReplicaInPipeline}}. This logic should not expose the implementation details 
in BlockReceiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8535) Clarify that dfs usage in dfsadmin -report output includes all block replicas.

2015-06-04 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8535:
---

 Summary: Clarify that dfs usage in dfsadmin -report output 
includes all block replicas.
 Key: HDFS-8535
 URL: https://issues.apache.org/jira/browse/HDFS-8535
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Some user get confused about this and think it is just space used by the files 
forgetting about the additional replicas that take up space.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (HDFS-8322) Display warning if hadoop fs -ls is showing the local filesystem

2015-05-26 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu reopened HDFS-8322:
-

Re-open to put warnings behind an optional configuration.

> Display warning if hadoop fs -ls is showing the local filesystem
> 
>
> Key: HDFS-8322
> URL: https://issues.apache.org/jira/browse/HDFS-8322
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS
>Affects Versions: 2.7.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
>Priority: Minor
> Attachments: HDFS-8322.000.patch
>
>
> Using {{LocalFileSystem}} is rarely the intention of running {{hadoop fs 
> -ls}}.
> This JIRA proposes displaying a warning message if hadoop fs -ls is showing 
> the local filesystem or using default fs.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8322) Display warning if hadoop fs -ls is showing the local filesystem

2015-05-04 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8322:
---

 Summary: Display warning if hadoop fs -ls is showing the local 
filesystem
 Key: HDFS-8322
 URL: https://issues.apache.org/jira/browse/HDFS-8322
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Using {{LocalFileSystem}} is rarely the intention of running {{hadoop fs -ls}}.

This JIRA proposes displaying a warning message if hadoop fs -ls is showing the 
local filesystem or using default fs.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8306) Generate ACL and Xattr outputs in OIV XML outputs

2015-04-30 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8306:
---

 Summary: Generate ACL and Xattr outputs in OIV XML outputs
 Key: HDFS-8306
 URL: https://issues.apache.org/jira/browse/HDFS-8306
 Project: Hadoop HDFS
  Issue Type: Sub-task
Affects Versions: 2.7.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Currently, in the {{hdfs oiv}} XML outputs, not all fields of fsimage are 
outputs. It makes inspecting {{fsimage}} from XML outputs less practical. Also 
it prevents recovering a fsimage from XML file.

This JIRA is adding ACL and XAttrs in the XML outputs as the first step to 
achieve the goal described in HDFS-8061.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8051) FsVolumeList#addVolume should release volume reference if not put it into BlockScanner.

2015-04-02 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8051:
---

 Summary: FsVolumeList#addVolume should release volume reference if 
not put it into BlockScanner.
 Key: HDFS-8051
 URL: https://issues.apache.org/jira/browse/HDFS-8051
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


{{FsVolumeList#addVolume()}} passes {{FsVolumeReference}} to blockScanner:

{code}
 if (blockScanner != null) {
  blockScanner.addVolumeScanner(ref);
}
{code}

However, if {{blockScanner == null}}, the {{FsVolumeReference}} will not be 
released. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-8006) Report removed storages after removing them by DataNode#checkDirs()

2015-03-27 Thread Lei (Eddy) Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu resolved HDFS-8006.
-
Resolution: Invalid

{{DataNode#handleDiskError}} schedules block reports already.

> Report removed storages after removing them by DataNode#checkDirs()
> ---
>
> Key: HDFS-8006
> URL: https://issues.apache.org/jira/browse/HDFS-8006
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
>
> Similar to HDFS-7961,  after DN removes storages due to disk errors 
> (HDFS-7722), DN should send a full block report to NN to remove storages 
> (HDFS-7960)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8006) Report removed storages after removing them by DataNode#checkDirs()

2015-03-27 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-8006:
---

 Summary: Report removed storages after removing them by 
DataNode#checkDirs()
 Key: HDFS-8006
 URL: https://issues.apache.org/jira/browse/HDFS-8006
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


Similar to HDFS-7961,  after DN removes storages due to disk errors 
(HDFS-7722), DN should send a full block report to NN to remove storages 
(HDFS-7960)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7996) After swapping a volume, BlockReceiver reports ReplicaNotFoundException

2015-03-26 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-7996:
---

 Summary: After swapping a volume, BlockReceiver reports 
ReplicaNotFoundException
 Key: HDFS-7996
 URL: https://issues.apache.org/jira/browse/HDFS-7996
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Critical


When removing a disk from an actively writing DataNode, the BlockReceiver 
working on the disk throws {{ReplicaNotFoundException}} because the replicas 
are removed from the memory:

{code}
2015-03-26 08:02:43,154 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removed 
volume: /data/2/dfs/dn/current
2015-03-26 08:02:43,163 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Removing block level storage: 
/data/2/dfs/dn/current/BP-51301509-10.20.202.114-1427296597742
2015-03-26 08:02:43,163 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in BlockReceiver.run():
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append 
to a non-existent replica 
BP-51301509-10.20.202.114-1427296597742:blk_1073742979_2160
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaInfo(FsDatasetImpl.java:615)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1362)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.finalizeBlock(BlockReceiver.java:1281)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1241)
at java.lang.Thread.run(Thread.java:745)
{code}

{{FsVolumeList#removeVolume}} waits all threads release {{FsVolumeReference}} 
on the volume to be removed, however, in {{PacketResponder#finalizeBlock()}}, 
it calls

{code}
private void finalizeBlock(long startTime) throws IOException {
  BlockReceiver.this.close();
  final long endTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime()
  : 0;
  block.setNumBytes(replicaInfo.getNumBytes());
  datanode.data.finalizeBlock(block);
{code}

The {{FsVolumeReference}} was released in {{BlockReceiver.this.close()}} before 
calling {{datanode.data.finalizeBlock(block)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7961) Trigger full block report after hot swapping disk

2015-03-19 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-7961:
---

 Summary: Trigger full block report after hot swapping disk
 Key: HDFS-7961
 URL: https://issues.apache.org/jira/browse/HDFS-7961
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
 Fix For: 3.0.0, 2.7.0


As discussed in HDFS-7960, NN could not remove the data storage metadata from 
its memory. 

DN should trigger a full block report immediately after running hot swapping 
drives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7960) NameNode should prune storages that are no longer existed on DataNode

2015-03-19 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-7960:
---

 Summary: NameNode should prune storages that are no longer existed 
on DataNode
 Key: HDFS-7960
 URL: https://issues.apache.org/jira/browse/HDFS-7960
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu


NameNode should be able to remove storages that is not represented on DataNode. 
For example, hot swapped on that DataNode or DN restarted with less data dirs.  
It seems that once a datanode storage is removed from a datanode, those blocks 
on the storage will not be reconciled as gone from the Namenode until the 
namenode has been restarted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7917) Use file to replace data dirs in test to simulate a disk failure.

2015-03-11 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-7917:
---

 Summary: Use file to replace data dirs in test to simulate a disk 
failure. 
 Key: HDFS-7917
 URL: https://issues.apache.org/jira/browse/HDFS-7917
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


Currently, in several tests, e.g., {{TestDataNodeVolumeFailureXXX}} and 
{{TestDataNotHowSwapVolumes}},  we simulate a disk failure by setting a 
directory's executable permission as false. However, it raises the risk that if 
the cleanup code could not be executed, the directory can not be easily removed 
by Jenkins job. 

Since in {{DiskChecker#checkDirAccess}}:

{code}
private static void checkDirAccess(File dir) throws DiskErrorException {
if (!dir.isDirectory()) {
  throw new DiskErrorException("Not a directory: "
   + dir.toString());
}

checkAccessByFileMethods(dir);
  }
{code}

We can replace the DN data directory as a file to achieve the same fault 
injection goal, while it is safer for cleaning up in any circumstance. 
Additionally, as [~cnauroth] suggested: 

bq. That might even let us enable some of these tests that are skipped on 
Windows, because Windows allows access for the owner even after permissions 
have been stripped.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7908) Use larger value for fs.s3a.connection.timeout and change the unit to seconds.

2015-03-09 Thread Lei (Eddy) Xu (JIRA)

Lei (Eddy) Xu created HDFS-7908:
---

 Summary: Use larger value for fs.s3a.connection.timeout and change 
the unit to seconds.
 Key: HDFS-7908
 URL: https://issues.apache.org/jira/browse/HDFS-7908
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Lei (Eddy) Xu
Priority: Minor


The default value of {{fs.s3a.connection.timeout}} is {{5}} milliseconds. 
It causes many {{SocketTimeoutException}} when uploading large files using 
{{hadoop fs -put}}. 

Also, the units for {{fs.s3a.connection.timeout}} and 
{{fs.s3a.connection.estaablish.timeout}} are milliseconds. For s3 connections, 
I think it is not necessary to have sub-seconds timeout value. Thus I suggest 
to change the time unit to seconds, to easy sys admin's job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 133 matches

Mail list logo