[jira] [Commented] (HDFS-7866) Erasure coding: NameNode manages multiple erasure coding policies

2016-03-04 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179541#comment-15179541
 ] 

Rui Li commented on HDFS-7866:
--

Thanks Zhe for your comments.

1. We're using 11 bits to store the policy ID, which means it could excess a 
byte (replication factor is also a short). What do you think?

2. This is to make the tests use the new policy to find any potential issues. 
If there's no further comments, I'll revert it to RS-6-3 in next update. I also 
would like to know your opinions about randomly choosing a policy in the tests. 
You can refer to my previous 
[discussions|https://issues.apache.org/jira/browse/HDFS-7866?focusedCommentId=15171320&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15171320]
 with Kai.

3. I agree we can do this in a follow-on. I think essentially it's hacky 
because we allow user to specify the replication factor when creating a file 
but then overwrite it to store the EC policy ID. Fixing this requires changing 
the API, which I think is unacceptable.

> Erasure coding: NameNode manages multiple erasure coding policies
> -
>
> Key: HDFS-7866
> URL: https://issues.apache.org/jira/browse/HDFS-7866
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Kai Zheng
>Assignee: Rui Li
> Attachments: HDFS-7866-v1.patch, HDFS-7866-v2.patch, 
> HDFS-7866-v3.patch, HDFS-7866.10.patch, HDFS-7866.11.patch, 
> HDFS-7866.4.patch, HDFS-7866.5.patch, HDFS-7866.6.patch, HDFS-7866.7.patch, 
> HDFS-7866.8.patch, HDFS-7866.9.patch
>
>
> This is to extend NameNode to load, list and sync predefine EC schemas in 
> authorized and controlled approach. The provided facilities will be used to 
> implement DFSAdmin commands so admin can list available EC schemas, then 
> could choose some of them for target EC zones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9903) File could be created, but still not found when path is sort of Unicode

2016-03-04 Thread Alexander Shorin (JIRA)
Alexander Shorin created HDFS-9903:
--

 Summary: File could be created, but still not found when path is 
sort of Unicode
 Key: HDFS-9903
 URL: https://issues.apache.org/jira/browse/HDFS-9903
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Affects Versions: 2.6.0
 Environment: {code}
>>> import requests
>>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=MKDIRS')

>>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=GETFILESTATUS')

>>> resp = 
>>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=CREATE',
>>>  allow_redirects=False)
>>> resp

>>> loc = resp.headers['location']
>>> resp = requests.put(loc, data='bug')
>>> resp

>>> resp = 
>>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=GETFILESTATUS')
>>> resp.content
'{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File
 does not exist: /tmp/bug/\xe1\xbf\xb9/test.txt"}}'
>> resp = requests.put(loc, data='bug')
>>> resp

>>> resp.content
'{"RemoteException":{"exception":"FileAlreadyExistsException","javaClassName":"org.apache.hadoop.fs.FileAlreadyExistsException","message":"/tmp/bug/?/test.txt
 for client 127.0.0.1 already exists\\n\\tat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2782)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2674)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:592)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)\\n\\tat
 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)\\n\\tat
 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)\\n\\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)\\n\\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)\\n\\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)\\n\\tat 
java.security.AccessController.doPrivileged(Native Method)\\n\\tat 
javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)\\n\\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)\\n"}}'
{code}

Things to notice:
1. While we receive HTTP 201 Created on file creation, it couldn't be found via 
API. Physically it exists and really created.
2. The GETFILESTATUS against the file returns, again, not the best JSON, but 
Python can parse UTF-8 bits there;
3. The second attempt to create the file on the same location borks the Unicode 
in path within the error message;

Reporter: Alexander Shorin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9903) File could be created, but still not found when path is sort of Unicode

2016-03-04 Thread Alexander Shorin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Shorin updated HDFS-9903:
---
Description: 
{code}
>>> import requests
>>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=MKDIRS')

>>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=GETFILESTATUS')

>>> resp = 
>>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=CREATE',
>>>  allow_redirects=False)
>>> resp

>>> loc = resp.headers['location']
>>> resp = requests.put(loc, data='bug')
>>> resp

>>> resp = 
>>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=GETFILESTATUS')
>>> resp.content
'{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File
 does not exist: /tmp/bug/\xe1\xbf\xb9/test.txt"}}'
>> resp = requests.put(loc, data='bug')
>>> resp

>>> resp.content
'{"RemoteException":{"exception":"FileAlreadyExistsException","javaClassName":"org.apache.hadoop.fs.FileAlreadyExistsException","message":"/tmp/bug/?/test.txt
 for client 127.0.0.1 already exists\\n\\tat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2782)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2674)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:592)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)\\n\\tat
 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)\\n\\tat
 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)\\n\\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)\\n\\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)\\n\\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)\\n\\tat 
java.security.AccessController.doPrivileged(Native Method)\\n\\tat 
javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)\\n\\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)\\n"}}'
{code}

Things to notice:
1. While we receive HTTP 201 Created on file creation, it couldn't be found via 
API. Physically it exists and really created.
2. The GETFILESTATUS against the file returns, again, not the best JSON, but 
Python can parse UTF-8 bits there;
3. The second attempt to create the file on the same location borks the Unicode 
in path within the error message;


> File could be created, but still not found when path is sort of Unicode
> ---
>
> Key: HDFS-9903
> URL: https://issues.apache.org/jira/browse/HDFS-9903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.6.0
> Environment: {code}
> >>> import requests
> >>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=MKDIRS')
> 
> >>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=GETFILESTATUS')
> 
> >>> resp = 
> >>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=CREATE',
> >>>  allow_redirects=False)
> >>> resp
> 
> >>> loc = resp.headers['location']
> >>> resp = requests.put(loc, data='bug')
> >>> resp
> 
> >>> resp = 
> >>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=GETFILESTATUS')
> >>> resp.content
> '{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File
>  does not exist: /tmp/bug/\xe1\xbf\xb9/test.txt"}}'
> >> resp = requests.put(loc, data='bug')
> >>> resp
> 
> >>> resp.content
> '{"RemoteException":{"exception":"FileAlreadyExistsException","javaClassName":"org.apache.hadoop.fs.FileAlreadyExistsException","message":"/tmp/bug/?/test.txt
>  for client 127.0.0.1 already exists\\n\\tat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2782)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2674)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(N

[jira] [Updated] (HDFS-9903) File can be created, but still couldn't be found when path is sort of Unicode

2016-03-04 Thread Alexander Shorin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Shorin updated HDFS-9903:
---
Summary: File can be created, but still couldn't be found when path is sort 
of Unicode  (was: File could be created, but still not found when path is sort 
of Unicode)

> File can be created, but still couldn't be found when path is sort of Unicode
> -
>
> Key: HDFS-9903
> URL: https://issues.apache.org/jira/browse/HDFS-9903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Alexander Shorin
>
> {code}
> >>> import requests
> >>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=MKDIRS')
> 
> >>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=GETFILESTATUS')
> 
> >>> resp = 
> >>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=CREATE',
> >>>  allow_redirects=False)
> >>> resp
> 
> >>> loc = resp.headers['location']
> >>> resp = requests.put(loc, data='bug')
> >>> resp
> 
> >>> resp = 
> >>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=GETFILESTATUS')
> >>> resp.content
> '{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File
>  does not exist: /tmp/bug/\xe1\xbf\xb9/test.txt"}}'
> >> resp = requests.put(loc, data='bug')
> >>> resp
> 
> >>> resp.content
> '{"RemoteException":{"exception":"FileAlreadyExistsException","javaClassName":"org.apache.hadoop.fs.FileAlreadyExistsException","message":"/tmp/bug/?/test.txt
>  for client 127.0.0.1 already exists\\n\\tat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2782)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2674)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:592)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)\\n\\tat
>  
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)\\n\\tat
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)\\n\\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)\\n\\tat 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)\\n\\tat 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)\\n\\tat 
> java.security.AccessController.doPrivileged(Native Method)\\n\\tat 
> javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)\\n\\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)\\n"}}'
> {code}
> Things to notice:
> 1. While we receive HTTP 201 Created on file creation, it couldn't be found 
> via API. Physically it exists and really created.
> 2. The GETFILESTATUS against the file returns, again, not the best JSON, but 
> Python can parse UTF-8 bits there;
> 3. The second attempt to create the file on the same location borks the 
> Unicode in path within the error message;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9903) File could be created, but still not found when path is sort of Unicode

2016-03-04 Thread Alexander Shorin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Shorin updated HDFS-9903:
---
Environment: (was: {code}
>>> import requests
>>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=MKDIRS')

>>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=GETFILESTATUS')

>>> resp = 
>>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=CREATE',
>>>  allow_redirects=False)
>>> resp

>>> loc = resp.headers['location']
>>> resp = requests.put(loc, data='bug')
>>> resp

>>> resp = 
>>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=GETFILESTATUS')
>>> resp.content
'{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File
 does not exist: /tmp/bug/\xe1\xbf\xb9/test.txt"}}'
>> resp = requests.put(loc, data='bug')
>>> resp

>>> resp.content
'{"RemoteException":{"exception":"FileAlreadyExistsException","javaClassName":"org.apache.hadoop.fs.FileAlreadyExistsException","message":"/tmp/bug/?/test.txt
 for client 127.0.0.1 already exists\\n\\tat 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2782)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2674)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:592)\\n\\tat
 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)\\n\\tat
 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)\\n\\tat
 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\\n\\tat
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)\\n\\tat
 org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)\\n\\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)\\n\\tat 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)\\n\\tat 
java.security.AccessController.doPrivileged(Native Method)\\n\\tat 
javax.security.auth.Subject.doAs(Subject.java:415)\\n\\tat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)\\n\\tat
 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)\\n"}}'
{code}

Things to notice:
1. While we receive HTTP 201 Created on file creation, it couldn't be found via 
API. Physically it exists and really created.
2. The GETFILESTATUS against the file returns, again, not the best JSON, but 
Python can parse UTF-8 bits there;
3. The second attempt to create the file on the same location borks the Unicode 
in path within the error message;
)

> File could be created, but still not found when path is sort of Unicode
> ---
>
> Key: HDFS-9903
> URL: https://issues.apache.org/jira/browse/HDFS-9903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Alexander Shorin
>
> {code}
> >>> import requests
> >>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=MKDIRS')
> 
> >>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9?user.name=test&op=GETFILESTATUS')
> 
> >>> resp = 
> >>> requests.put('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=CREATE',
> >>>  allow_redirects=False)
> >>> resp
> 
> >>> loc = resp.headers['location']
> >>> resp = requests.put(loc, data='bug')
> >>> resp
> 
> >>> resp = 
> >>> requests.get('http://localhost:50070/webhdfs/v1/tmp/bug/%E1%BF%B9/test.txt?user.name=test&op=GETFILESTATUS')
> >>> resp.content
> '{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File
>  does not exist: /tmp/bug/\xe1\xbf\xb9/test.txt"}}'
> >> resp = requests.put(loc, data='bug')
> >>> resp
> 
> >>> resp.content
> '{"RemoteException":{"exception":"FileAlreadyExistsException","javaClassName":"org.apache.hadoop.fs.FileAlreadyExistsException","message":"/tmp/bug/?/test.txt
>  for client 127.0.0.1 already exists\\n\\tat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2782)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2674)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)\\n\\tat
>  
> org.apache.hadoop.hdfs.server.name

[jira] [Commented] (HDFS-9478) Reason for failing ipc.FairCallQueue contruction should be thrown

2016-03-04 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179665#comment-15179665
 ] 

Ajith S commented on HDFS-9478:
---

I think the *RuntimeException* thrown by *FairCallQueue* is wrapped as 
*InvocationTargetException*, so *RuntimeException* catch block will be bypassed 
and instead caught by following *Exception* catch block
Reference : 
http://docs.oracle.com/javase/8/docs/api/java/lang/reflect/Constructor.html#newInstance-java.lang.Object...-
_InvocationTargetException - if the underlying constructor throws an exception._

> Reason for failing ipc.FairCallQueue contruction should be thrown
> -
>
> Key: HDFS-9478
> URL: https://issues.apache.org/jira/browse/HDFS-9478
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Archana T
>Assignee: Ajith S
>Priority: Minor
> Attachments: HDFS-9478.patch
>
>
> When FairCallQueue Construction fails, NN fails to start throwing 
> RunTimeException without throwing any reason on why it fails.
> 2015-11-30 17:45:26,661 INFO org.apache.hadoop.ipc.FairCallQueue: 
> FairCallQueue is in use with 4 queues.
> 2015-11-30 17:45:26,665 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Registered Hadoop:service=ipc.65110,name=DecayRpcScheduler
> 2015-11-30 17:45:26,666 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.RuntimeException: org.apache.hadoop.ipc.FairCallQueue could not be 
> constructed.
> at 
> org.apache.hadoop.ipc.CallQueueManager.createCallQueueInstance(CallQueueManager.java:96)
> at org.apache.hadoop.ipc.CallQueueManager.(CallQueueManager.java:55)
> at org.apache.hadoop.ipc.Server.(Server.java:2241)
> at org.apache.hadoop.ipc.RPC$Server.(RPC.java:942)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:534)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509)
> at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:784)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:346)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:687)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:889)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:872)
> Example: reason for above failure could have been --
> 1. the weights were not equal to the number of queues configured.
> 2. decay-scheduler.thresholds not in sync with number of queues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9478) Reason for failing ipc.FairCallQueue contruction should be thrown

2016-03-04 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179669#comment-15179669
 ] 

Ajith S commented on HDFS-9478:
---

guess we better refactor 
*org.apache.hadoop.ipc.CallQueueManager.createCallQueueInstance* and check for 
_InvocationTargetException_
any thoughts.?

> Reason for failing ipc.FairCallQueue contruction should be thrown
> -
>
> Key: HDFS-9478
> URL: https://issues.apache.org/jira/browse/HDFS-9478
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Archana T
>Assignee: Ajith S
>Priority: Minor
> Attachments: HDFS-9478.patch
>
>
> When FairCallQueue Construction fails, NN fails to start throwing 
> RunTimeException without throwing any reason on why it fails.
> 2015-11-30 17:45:26,661 INFO org.apache.hadoop.ipc.FairCallQueue: 
> FairCallQueue is in use with 4 queues.
> 2015-11-30 17:45:26,665 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Registered Hadoop:service=ipc.65110,name=DecayRpcScheduler
> 2015-11-30 17:45:26,666 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.RuntimeException: org.apache.hadoop.ipc.FairCallQueue could not be 
> constructed.
> at 
> org.apache.hadoop.ipc.CallQueueManager.createCallQueueInstance(CallQueueManager.java:96)
> at org.apache.hadoop.ipc.CallQueueManager.(CallQueueManager.java:55)
> at org.apache.hadoop.ipc.Server.(Server.java:2241)
> at org.apache.hadoop.ipc.RPC$Server.(RPC.java:942)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:534)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509)
> at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:784)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:346)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:687)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:889)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:872)
> Example: reason for above failure could have been --
> 1. the weights were not equal to the number of queues configured.
> 2. decay-scheduler.thresholds not in sync with number of queues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9875) HDFS client requires compromising permission when running under JVM security manager

2016-03-04 Thread Costin Leau (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179789#comment-15179789
 ] 

Costin Leau commented on HDFS-9875:
---

Sure.
I've forked the 2.8 branch on github; change is available as a 
[PR|https://github.com/costin/hadoop/pull/1], 
[diff|https://patch-diff.githubusercontent.com/raw/costin/hadoop/pull/1.diff] 
and 
[patch|https://patch-diff.githubusercontent.com/raw/costin/hadoop/pull/1.patch].

Thanks,

> HDFS client requires compromising permission when running under JVM security 
> manager
> 
>
> Key: HDFS-9875
> URL: https://issues.apache.org/jira/browse/HDFS-9875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, security
>Affects Versions: 2.7.2
> Environment: Linux
>Reporter: Costin Leau
>
> HDFS _client_ requires dangerous permission, in particular _execute_ on _all 
> files_ despite only trying to connect to an HDFS cluster.
> A full list (for both Hadoop 1 and 2) is available here along with the place 
> in code where they occur.
> While it is understandable for some permissions to be used, requiring 
> {{FilePermission <> execute}} to simply initialize a class field 
> [Shell|https://github.com/apache/hadoop/blob/0fa54d45b1cf8a29f089f64d24f35bd221b4803f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java#L728]
>  which in the end is not used (since it's just a client) simply *compromises* 
> the entire security system.
> To make matters worse, the code is executed to initialize a field so in case 
> the permissions is not granted, the VM fails with {{InitializationError}} 
> which is unrecoverable.
> Ironically enough, on Windows this problem does not appear since the code 
> simply bypasses it and initializes the field with a fall back value 
> ({{false}}).
> A quick fix would be to simply take into account that the JVM 
> {{SecurityManager}} might be active and the permission not granted or that 
> the external process fails and use a fall back value.
> A proper and long-term fix would be to minimize the use of permissions for 
> hdfs client since it is simply not required. A client should be as light as 
> possible and not have the server requirements leaked onto.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9875) HDFS client requires compromising permission when running under JVM security manager

2016-03-04 Thread Costin Leau (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179791#comment-15179791
 ] 

Costin Leau commented on HDFS-9875:
---

Not sure if it helps, but I've opened a 
[PR|https://github.com/apache/hadoop/pull/82] against the Hadoop clone on 
Github as well.
Cheers,

> HDFS client requires compromising permission when running under JVM security 
> manager
> 
>
> Key: HDFS-9875
> URL: https://issues.apache.org/jira/browse/HDFS-9875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, security
>Affects Versions: 2.7.2
> Environment: Linux
>Reporter: Costin Leau
>
> HDFS _client_ requires dangerous permission, in particular _execute_ on _all 
> files_ despite only trying to connect to an HDFS cluster.
> A full list (for both Hadoop 1 and 2) is available here along with the place 
> in code where they occur.
> While it is understandable for some permissions to be used, requiring 
> {{FilePermission <> execute}} to simply initialize a class field 
> [Shell|https://github.com/apache/hadoop/blob/0fa54d45b1cf8a29f089f64d24f35bd221b4803f/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java#L728]
>  which in the end is not used (since it's just a client) simply *compromises* 
> the entire security system.
> To make matters worse, the code is executed to initialize a field so in case 
> the permissions is not granted, the VM fails with {{InitializationError}} 
> which is unrecoverable.
> Ironically enough, on Windows this problem does not appear since the code 
> simply bypasses it and initializes the field with a fall back value 
> ({{false}}).
> A quick fix would be to simply take into account that the JVM 
> {{SecurityManager}} might be active and the permission not granted or that 
> the external process fails and use a fall back value.
> A proper and long-term fix would be to minimize the use of permissions for 
> hdfs client since it is simply not required. A client should be as light as 
> possible and not have the server requirements leaked onto.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7648) Verify the datanode directory layout

2016-03-04 Thread David Watzke (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179806#comment-15179806
 ] 

David Watzke commented on HDFS-7648:


Hi guys. I ran into trouble because I used 
https://github.com/killerwhile/volume-balancer with Hadoop 2.6.0 and it messed 
up my datadirs because that software makes invalid assumptions about what 
directory moves can it do. Now the DN logs are filled with these:

WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while 
finding block BP-680964103-77.234.46.18-1375882473930:blk_5822441067008155275_0 
on volume /data/19/cdfs/dn

What can I do to fix this? I don't know what dirs were moved and from where but 
is there a reasonable way out of this? Such as editing VERSION file to a 
previous version when DN is down so that it fixes the layout by itself - would 
that work?

Please note that I've lost the other replica due to a filesystem error so I 
can't just ignore it. This is literally my only option to recover some missing 
blocks.

Thanks

> Verify the datanode directory layout
> 
>
> Key: HDFS-7648
> URL: https://issues.apache.org/jira/browse/HDFS-7648
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Rakesh R
> Attachments: HDFS-7648-3.patch, HDFS-7648-4.patch, HDFS-7648-5.patch, 
> HDFS-7648.patch, HDFS-7648.patch
>
>
> HDFS-6482 changed datanode layout to use block ID to determine the directory 
> to store the block.  We should have some mechanism to verify it.  Either 
> DirectoryScanner or block report generation could do the check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9868) add reading source cluster with HA access mode feature for DistCp

2016-03-04 Thread NING DING (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

NING DING updated HDFS-9868:

Attachment: HDFS-9868.3.patch

> add reading source cluster with HA access mode feature for DistCp
> -
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9868) add reading source cluster with HA access mode feature for DistCp

2016-03-04 Thread NING DING (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179952#comment-15179952
 ] 

NING DING commented on HDFS-9868:
-

I added a test case about switching active namenode in HDFS-9868.3.patch.
Please review. Thank you.

> add reading source cluster with HA access mode feature for DistCp
> -
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9868) add reading source cluster with HA access mode feature for DistCp

2016-03-04 Thread NING DING (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179954#comment-15179954
 ] 

NING DING commented on HDFS-9868:
-

Hi, [~jojochuang]
I added a test case about switching active namenode in HDFS-9868.3.patch.
Please review. Thank you.

> add reading source cluster with HA access mode feature for DistCp
> -
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9868) add reading source cluster with HA access mode feature for DistCp

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179991#comment-15179991
 ] 

Hadoop QA commented on HDFS-9868:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 16s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 13s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 13s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 42s {color} 
| {color:red} hadoop-tools_hadoop-distcp-jdk1.8.0_74 with JDK v1.8.0_74 
generated 2 new + 4 unchanged - 0 fixed = 6 total (was 4) {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 13s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 58s {color} 
| {color:red} hadoop-tools_hadoop-distcp-jdk1.7.0_95 with JDK v1.7.0_95 
generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3) {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 15s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s 
{color} | {color:red} hadoop-tools/hadoop-distcp: patch generated 3 new + 174 
unchanged - 0 fixed = 177 total (was 174) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 21s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
1s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 11s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 32s 
{color} | {color:green} hadoop-distcp in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 48s 
{color} | {color:green} hadoop-distcp in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 30m 40s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791

[jira] [Commented] (HDFS-9871) "Bytes Being Moved" -ve(-1 B) when cluster was already balanced.

2016-03-04 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180060#comment-15180060
 ] 

Rushabh S Shah commented on HDFS-9871:
--

ltgm. +1 (non-binding)

> "Bytes Being Moved" -ve(-1 B) when cluster was already balanced.
> 
>
> Key: HDFS-9871
> URL: https://issues.apache.org/jira/browse/HDFS-9871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: HDFS-9871-002.patch, HDFS-9871.patch
>
>
> Run balancer when there is no {{over}} and {{under}} utlized nodes.
> {noformat}
> 16/02/29 02:39:40 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/**.120:50076
> 16/02/29 02:39:40 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/**.121:50076
> 16/02/29 02:39:40 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/**.122:50076
> 16/02/29 02:39:41 INFO balancer.Balancer: 0 over-utilized: []
> 16/02/29 02:39:41 INFO balancer.Balancer: 0 underutilized: []
> The cluster is balanced. Exiting...
> Feb 29, 2016 2:40:57 AM   0  0 B 0 B  
>  -1 B
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7166) SbNN Web UI shows #Under replicated blocks and #pending deletion blocks

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-7166:
--
Attachment: HDFS-7166.001.patch

Rev01:
SbNN will not show corrupted blocks, under-replicated blocks, missing block. In 
addition, it should also not show blocks that are pending deletion, according 
to {{TestStandbyBlockManagement}}.

I tested this patch on a CDH cluster with HA and works as expected.

> SbNN Web UI shows #Under replicated blocks and #pending deletion blocks
> ---
>
> Key: HDFS-7166
> URL: https://issues.apache.org/jira/browse/HDFS-7166
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: Juan Yu
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-7166.001.patch
>
>
> I believe that's an regression of HDFS-5333 
> According to HDFS-2901 and HDFS-6178
> The Standby Namenode doesn't compute replication queues, we shouldn't show 
> under-replicated/missing blocks or corrupt files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7166) SbNN Web UI shows #Under replicated blocks and #pending deletion blocks

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-7166:
--
Status: Patch Available  (was: Open)

> SbNN Web UI shows #Under replicated blocks and #pending deletion blocks
> ---
>
> Key: HDFS-7166
> URL: https://issues.apache.org/jira/browse/HDFS-7166
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: Juan Yu
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-7166.001.patch
>
>
> I believe that's an regression of HDFS-5333 
> According to HDFS-2901 and HDFS-6178
> The Standby Namenode doesn't compute replication queues, we shouldn't show 
> under-replicated/missing blocks or corrupt files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7166) SbNN Web UI shows #Under replicated blocks and #pending deletion blocks

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180103#comment-15180103
 ] 

Hadoop QA commented on HDFS-7166:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 23s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
28s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 1m 11s {color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791488/HDFS-7166.001.patch |
| JIRA Issue | HDFS-7166 |
| Optional Tests |  asflicense  |
| uname | Linux ff0f744c6dc1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / cbd3132 |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/14718/console |
| Powered by | Apache Yetus 0.3.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> SbNN Web UI shows #Under replicated blocks and #pending deletion blocks
> ---
>
> Key: HDFS-7166
> URL: https://issues.apache.org/jira/browse/HDFS-7166
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: Juan Yu
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-7166.001.patch
>
>
> I believe that's an regression of HDFS-5333 
> According to HDFS-2901 and HDFS-6178
> The Standby Namenode doesn't compute replication queues, we shouldn't show 
> under-replicated/missing blocks or corrupt files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-9902) dfs.datanode.du.reserved should be difference between StorageType DISK and RAM_DISK

2016-03-04 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula reassigned HDFS-9902:
--

Assignee: Brahma Reddy Battula

> dfs.datanode.du.reserved should be difference between StorageType DISK and 
> RAM_DISK
> ---
>
> Key: HDFS-9902
> URL: https://issues.apache.org/jira/browse/HDFS-9902
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.2
>Reporter: Pan Yuxuan
>Assignee: Brahma Reddy Battula
>
> Now Hadoop support different storage type for DISK, SSD, ARCHIVE and 
> RAM_DISK, but they share one configuration dfs.datanode.du.reserved.
> The DISK size may be several TB and the RAM_DISK size may be only several 
> tens of GB.
> The problem is that when I configure DISK and RAM_DISK (tmpfs) in the same 
> DN, and I set  dfs.datanode.du.reserved values 10GB, this will waste a lot of 
> RAM_DISK size. 
> Since the usage of RAM_DISK can be 100%, so I don't want 
> dfs.datanode.du.reserved configured for DISK impacts the usage of tmpfs.
> So can we make a new configuration for RAM_DISK or just skip this 
> configuration for RAM_DISK?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-04 Thread Kihwal Lee (JIRA)
Kihwal Lee created HDFS-9904:


 Summary: testCheckpointCancellationDuringUpload occasionally fails 
 Key: HDFS-9904
 URL: https://issues.apache.org/jira/browse/HDFS-9904
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 2.7.3
Reporter: Kihwal Lee


The failure was at the end of the test case where the txid of the standby 
(former active) is checked. Since the checkpoint/uploading was canceled , it is 
not supposed to have the new checkpoint. Looking at the test log, that was 
still the case, but the standby then did checkpoint on its own and bumped up 
the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9904) testCheckpointCancellationDuringUpload occasionally fails

2016-03-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180154#comment-15180154
 ] 

Kihwal Lee commented on HDFS-9904:
--

The stack trace from the test failure.
{noformat}
java.lang.AssertionError: expected:<0> but was:<106>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testCheckpointCancellationDuringUpload(TestStandbyCheckpoints.java:328)
{noformat}

We could set DFS_NAMENODE_CHECKPOINT_TXNS_KEY differently on the first NN to 
avoid it doing checkpoint when it becomes a standby.

> testCheckpointCancellationDuringUpload occasionally fails 
> --
>
> Key: HDFS-9904
> URL: https://issues.apache.org/jira/browse/HDFS-9904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>
> The failure was at the end of the test case where the txid of the standby 
> (former active) is checked. Since the checkpoint/uploading was canceled , it 
> is not supposed to have the new checkpoint. Looking at the test log, that was 
> still the case, but the standby then did checkpoint on its own and bumped up 
> the txid, right before the check was performed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9905) TestWebHdfsTimeouts fails occasionally

2016-03-04 Thread Kihwal Lee (JIRA)
Kihwal Lee created HDFS-9905:


 Summary: TestWebHdfsTimeouts fails occasionally
 Key: HDFS-9905
 URL: https://issues.apache.org/jira/browse/HDFS-9905
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: t, test
Affects Versions: 2.7.3
Reporter: Kihwal Lee


When checking for a timeout, it does get {{SocketTimeoutException}}, but the 
message sometimes does not contain "connect timed out". Since the original 
exception is not logged, we do not know details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9888) TestBalancer#testBalancerWithKeytabs should reset KerberosName in test case setup

2016-03-04 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180223#comment-15180223
 ] 

Xiao Chen commented on HDFS-9888:
-

Yes Zhe, HDFS-3016 is not in branch-2.
Additionally, since {{TestBalancer#testBalancerWithKeytabs}} is added by 
HDFS-9804 which is not in branch-2, I think trunk is enough for this patch. 
I've added a link to that jira, so that if later we're backporting it, we'll 
bring this as well. Thanks!

> TestBalancer#testBalancerWithKeytabs should reset KerberosName in test case 
> setup
> -
>
> Key: HDFS-9888
> URL: https://issues.apache.org/jira/browse/HDFS-9888
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Minor
> Attachments: HDFS-9888.01.patch
>
>
> In some local environments, {{TestBalancer#testBalancerWithKeytabs}} may 
> fail. Specifically, running itself passes, but running {{TestBalancer}} suite 
> always fail. This is due to:
> # Kerberos setup is done at the test case setup
> # static variable {{KerberosName#defaultRealm}} is set when class 
> initialization - before {{testBalancerWithKeytabs}} setup
> # local default realm is different than test case default realm
> This is mostly an environment specific problem, but let's not make such 
> assumption in the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9905) TestWebHdfsTimeouts fails occasionally

2016-03-04 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-9905:
-
Component/s: (was: t)

> TestWebHdfsTimeouts fails occasionally
> --
>
> Key: HDFS-9905
> URL: https://issues.apache.org/jira/browse/HDFS-9905
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>
> When checking for a timeout, it does get {{SocketTimeoutException}}, but the 
> message sometimes does not contain "connect timed out". Since the original 
> exception is not logged, we do not know details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9905) TestWebHdfsTimeouts fails occasionally

2016-03-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180225#comment-15180225
 ] 

Kihwal Lee commented on HDFS-9905:
--

This is from a precommit for HDFS-9239.

{noformat}
Running org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts
Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.611 sec <<< 
FAILURE! - in org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts
testAuthUrlReadTimeout[timeoutSource=ConnectionFactory](org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts)
  Time elapsed: 0.083 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected: but 
was:
at org.junit.Assert.assertEquals(Assert.java:115)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts.testAuthUrlReadTimeout(TestWebHdfsTimeouts.java:195)
{noformat}

We also saw this from our own build.

{noformat}
org.junit.ComparisonFailure: expected: 
but was:
at org.junit.Assert.assertEquals(Assert.java:115)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts.testTwoStepWriteConnectTimeout(TestWebHdfsTimeouts.java:206)
{noformat}

> TestWebHdfsTimeouts fails occasionally
> --
>
> Key: HDFS-9905
> URL: https://issues.apache.org/jira/browse/HDFS-9905
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.3
>Reporter: Kihwal Lee
>
> When checking for a timeout, it does get {{SocketTimeoutException}}, but the 
> message sometimes does not contain "connect timed out". Since the original 
> exception is not logged, we do not know details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180227#comment-15180227
 ] 

Kihwal Lee commented on HDFS-9239:
--

Filed HDFS-9905 for TestWebHdfsTimeouts.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9888) Allow reseting KerberosName in unit tests

2016-03-04 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HDFS-9888:

Summary: Allow reseting KerberosName in unit tests  (was: 
TestBalancer#testBalancerWithKeytabs should reset KerberosName in test case 
setup)

> Allow reseting KerberosName in unit tests
> -
>
> Key: HDFS-9888
> URL: https://issues.apache.org/jira/browse/HDFS-9888
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Minor
> Attachments: HDFS-9888.01.patch
>
>
> In some local environments, {{TestBalancer#testBalancerWithKeytabs}} may 
> fail. Specifically, running itself passes, but running {{TestBalancer}} suite 
> always fail. This is due to:
> # Kerberos setup is done at the test case setup
> # static variable {{KerberosName#defaultRealm}} is set when class 
> initialization - before {{testBalancerWithKeytabs}} setup
> # local default realm is different than test case default realm
> This is mostly an environment specific problem, but let's not make such 
> assumption in the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9888) Allow reseting KerberosName in unit tests

2016-03-04 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HDFS-9888:

  Resolution: Fixed
Hadoop Flags: Reviewed
   Fix Version/s: 3.0.0
Target Version/s: 3.0.0
  Status: Resolved  (was: Patch Available)

Thanks Xiao. I just committed the patch to trunk. Also updated the JIRA title 
since we added a public method to non-test code.

> Allow reseting KerberosName in unit tests
> -
>
> Key: HDFS-9888
> URL: https://issues.apache.org/jira/browse/HDFS-9888
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-9888.01.patch
>
>
> In some local environments, {{TestBalancer#testBalancerWithKeytabs}} may 
> fail. Specifically, running itself passes, but running {{TestBalancer}} suite 
> always fail. This is due to:
> # Kerberos setup is done at the test case setup
> # static variable {{KerberosName#defaultRealm}} is set when class 
> initialization - before {{testBalancerWithKeytabs}} setup
> # local default realm is different than test case default realm
> This is mostly an environment specific problem, but let's not make such 
> assumption in the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9888) Allow reseting KerberosName in unit tests

2016-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180255#comment-15180255
 ] 

Hudson commented on HDFS-9888:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9424 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9424/])
HDFS-9888. Allow reseting KerberosName in unit tests. Contributed by (zhz: rev 
3e8099a45a4cfd4c5c0e3dce4370514cb2c90da9)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer.java
* 
hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/util/KerberosName.java


> Allow reseting KerberosName in unit tests
> -
>
> Key: HDFS-9888
> URL: https://issues.apache.org/jira/browse/HDFS-9888
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-9888.01.patch
>
>
> In some local environments, {{TestBalancer#testBalancerWithKeytabs}} may 
> fail. Specifically, running itself passes, but running {{TestBalancer}} suite 
> always fail. This is due to:
> # Kerberos setup is done at the test case setup
> # static variable {{KerberosName#defaultRealm}} is set when class 
> initialization - before {{testBalancerWithKeytabs}} setup
> # local default realm is different than test case default realm
> This is mostly an environment specific problem, but let's not make such 
> assumption in the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Elliott Clark (JIRA)
Elliott Clark created HDFS-9906:
---

 Summary: Remove spammy log spew when a datanode is restarted
 Key: HDFS-9906
 URL: https://issues.apache.org/jira/browse/HDFS-9906
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.2
Reporter: Elliott Clark


{code}
WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request 
received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 268435456
{code}

This happens wy too much to add any useful information. We should either 
move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9888) Allow reseting KerberosName in unit tests

2016-03-04 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180281#comment-15180281
 ] 

Xiao Chen commented on HDFS-9888:
-

Thanks very much [~zhz]!

> Allow reseting KerberosName in unit tests
> -
>
> Key: HDFS-9888
> URL: https://issues.apache.org/jira/browse/HDFS-9888
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-9888.01.patch
>
>
> In some local environments, {{TestBalancer#testBalancerWithKeytabs}} may 
> fail. Specifically, running itself passes, but running {{TestBalancer}} suite 
> always fail. This is due to:
> # Kerberos setup is done at the test case setup
> # static variable {{KerberosName#defaultRealm}} is set when class 
> initialization - before {{testBalancerWithKeytabs}} setup
> # local default realm is different than test case default realm
> This is mostly an environment specific problem, but let's not make such 
> assumption in the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9888) Allow reseting KerberosName in unit tests

2016-03-04 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180302#comment-15180302
 ] 

Xiao Chen commented on HDFS-9888:
-

FYI, I just noticed that the commit to trunk has a bunch of whitespace changes 
in both files. Nothing functionally different than patch 01 here though.

> Allow reseting KerberosName in unit tests
> -
>
> Key: HDFS-9888
> URL: https://issues.apache.org/jira/browse/HDFS-9888
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-9888.01.patch
>
>
> In some local environments, {{TestBalancer#testBalancerWithKeytabs}} may 
> fail. Specifically, running itself passes, but running {{TestBalancer}} suite 
> always fail. This is due to:
> # Kerberos setup is done at the test case setup
> # static variable {{KerberosName#defaultRealm}} is set when class 
> initialization - before {{testBalancerWithKeytabs}} setup
> # local default realm is different than test case default realm
> This is mostly an environment specific problem, but let's not make such 
> assumption in the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7285) Erasure Coding Support inside HDFS

2016-03-04 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HDFS-7285:

Release Note: 

HDFS now provides native support for erasure coding (EC) to store data more 
efficiently. Each individual directory can be configured with an EC policy with 
command {{hdfs erasurecode -setPolicy}}. When a file is created, it will 
inherit the EC policy from its nearest ancestor to determine how its blocks are 
stored. Compared with 3-way replication, the default EC policy saves 50% of 
storage space for configured directories, while tolerating more storage 
failures.

To support small files, the currently phase of HDFS-EC stores blocks in 
_striped_ layout, where a logical file block is divided into small units (64KB 
by default) and distributed to a set of {{DataNodes}}. This enables parallel 
I/O but also decreases data locality. Therefore, the cluster environment and 
I/O workloads should be considered before configuring EC policies.

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Fix For: 3.0.0
>
> Attachments: Compare-consolidated-20150824.diff, 
> Consolidated-20150707.patch, Consolidated-20150806.patch, 
> Consolidated-20150810.patch, ECAnalyzer.py, ECParser.py, 
> HDFS-7285-Consolidated-20150911.patch, HDFS-7285-initial-PoC.patch, 
> HDFS-7285-merge-consolidated-01.patch, 
> HDFS-7285-merge-consolidated-trunk-01.patch, 
> HDFS-7285-merge-consolidated.trunk.03.patch, 
> HDFS-7285-merge-consolidated.trunk.04.patch, 
> HDFS-EC-Merge-PoC-20150624.patch, HDFS-EC-merge-consolidated-01.patch, 
> HDFS-bistriped.patch, HDFSErasureCodingDesign-20141028.pdf, 
> HDFSErasureCodingDesign-20141217.pdf, HDFSErasureCodingDesign-20150204.pdf, 
> HDFSErasureCodingDesign-20150206.pdf, HDFSErasureCodingPhaseITestPlan.pdf, 
> HDFSErasureCodingSystemTestPlan-20150824.pdf, 
> HDFSErasureCodingSystemTestReport-20150826.pdf, fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9882) Add heartbeatsTotal in Datanode metrics

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180348#comment-15180348
 ] 

Arpit Agarwal commented on HDFS-9882:
-

Thanks that makes sense. This is a good find.

Do you think it's a better idea to fix heartbeat handling to remove expensive 
operations?

> Add heartbeatsTotal in Datanode metrics
> ---
>
> Key: HDFS-9882
> URL: https://issues.apache.org/jira/browse/HDFS-9882
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 2.7.2
>Reporter: Hua Liu
>Assignee: Hua Liu
>Priority: Minor
> Attachments: 
> 0001-HDFS-9882.Add-heartbeatsTotal-in-Datanode-metrics.patch, 
> 0002-HDFS-9882.Add-heartbeatsTotal-in-Datanode-metrics.patch
>
>
> Heartbeat latency only reflects the time spent on generating reports and 
> sending reports to NN. When heartbeats are delayed due to processing 
> commands, this latency does not help investigation. I would like to propose 
> to add another metric counter to show the total time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9882) Add heartbeatsTotal in Datanode metrics

2016-03-04 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180364#comment-15180364
 ] 

Inigo Goiri commented on HDFS-9882:
---

Just for the record, this happened in Windows where the Hadoop code might not 
be that optimized.

Not sure if we can remove those operations; it might be a little too deep of a 
change.
For now, our internal solution has been to make these operations into a 
different thread and make checks from the heartbeat one.

> Add heartbeatsTotal in Datanode metrics
> ---
>
> Key: HDFS-9882
> URL: https://issues.apache.org/jira/browse/HDFS-9882
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 2.7.2
>Reporter: Hua Liu
>Assignee: Hua Liu
>Priority: Minor
> Attachments: 
> 0001-HDFS-9882.Add-heartbeatsTotal-in-Datanode-metrics.patch, 
> 0002-HDFS-9882.Add-heartbeatsTotal-in-Datanode-metrics.patch
>
>
> Heartbeat latency only reflects the time spent on generating reports and 
> sending reports to NN. When heartbeats are delayed due to processing 
> commands, this latency does not help investigation. I would like to propose 
> to add another metric counter to show the total time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7285) Erasure Coding Support inside HDFS

2016-03-04 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-7285:
--
Release Note: 

HDFS now provides native support for erasure coding (EC) to store data more 
efficiently. Each individual directory can be configured with an EC policy with 
command `hdfs erasurecode -setPolicy`. When a file is created, it will inherit 
the EC policy from its nearest ancestor directory to determine how its blocks 
are stored. Compared to 3-way replication, the default EC policy saves 50% of 
storage space while also tolerating more storage failures.

To support small files, the currently phase of HDFS-EC stores blocks in 
_striped_ layout, where a logical file block is divided into small units (64KB 
by default) and distributed to a set of DataNodes. This enables parallel I/O 
but also decreases data locality. Therefore, the cluster environment and I/O 
workloads should be considered before configuring EC policies.

  was:

HDFS now provides native support for erasure coding (EC) to store data more 
efficiently. Each individual directory can be configured with an EC policy with 
command {{hdfs erasurecode -setPolicy}}. When a file is created, it will 
inherit the EC policy from its nearest ancestor to determine how its blocks are 
stored. Compared with 3-way replication, the default EC policy saves 50% of 
storage space for configured directories, while tolerating more storage 
failures.

To support small files, the currently phase of HDFS-EC stores blocks in 
_striped_ layout, where a logical file block is divided into small units (64KB 
by default) and distributed to a set of {{DataNodes}}. This enables parallel 
I/O but also decreases data locality. Therefore, the cluster environment and 
I/O workloads should be considered before configuring EC policies.


> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Fix For: 3.0.0
>
> Attachments: Compare-consolidated-20150824.diff, 
> Consolidated-20150707.patch, Consolidated-20150806.patch, 
> Consolidated-20150810.patch, ECAnalyzer.py, ECParser.py, 
> HDFS-7285-Consolidated-20150911.patch, HDFS-7285-initial-PoC.patch, 
> HDFS-7285-merge-consolidated-01.patch, 
> HDFS-7285-merge-consolidated-trunk-01.patch, 
> HDFS-7285-merge-consolidated.trunk.03.patch, 
> HDFS-7285-merge-consolidated.trunk.04.patch, 
> HDFS-EC-Merge-PoC-20150624.patch, HDFS-EC-merge-consolidated-01.patch, 
> HDFS-bistriped.patch, HDFSErasureCodingDesign-20141028.pdf, 
> HDFSErasureCodingDesign-20141217.pdf, HDFSErasureCodingDesign-20150204.pdf, 
> HDFSErasureCodingDesign-20150206.pdf, HDFSErasureCodingPhaseITestPlan.pdf, 
> HDFSErasureCodingSystemTestPlan-20150824.pdf, 
> HDFSErasureCodingSystemTestReport-20150826.pdf, fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180374#comment-15180374
 ] 

Chris Nauroth commented on HDFS-9239:
-

The test failures are unrelated.  The remaining style warnings are not worth 
addressing.  [~szetszwo], would you please take a look at patch v003 and my 
comments that go with it?  Thank you.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7866) Erasure coding: NameNode manages multiple erasure coding policies

2016-03-04 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180373#comment-15180373
 ] 

Zhe Zhang commented on HDFS-7866:
-

bq. We're using 11 bits to store the policy ID, which means it could excess a 
byte (replication factor is also a short). What do you think?
Right, on the {{INodeFile}} level we can't save space by switching to {{byte}}. 
But by constraining the ID to be a byte we are saving memory on block level. I 
don't think we will possibly support more than 256 EC policies in the 
foreseeable future; so maybe set it to byte now and consider changing to short 
when there's a clear need? Pinging [~drankye] and [~andrew.wang] for more 
insights here.

bq. I also would like to know your opinions about randomly choosing a policy in 
the tests.
Very interesting idea. I think we should do a follow-on {{test}} JIRA. And yes 
I think we are safe to switch back to RS-6-3 in the next rev.

bq. I think essentially it's hacky because we allow user to specify the 
replication factor when creating a file but then overwrite it to store the EC 
policy ID. Fixing this requires changing the API, which I think is unacceptable.
I don't think there's a compatibility issue here, because we are only changing 
the {{INodeFile}} constructor with the {{isStriped}} flag. I think we can just 
change the boolean to a byte, representing the EC policy ID. But given the size 
of the current patch I recommend we leave it as follow-on.

> Erasure coding: NameNode manages multiple erasure coding policies
> -
>
> Key: HDFS-7866
> URL: https://issues.apache.org/jira/browse/HDFS-7866
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Kai Zheng
>Assignee: Rui Li
> Attachments: HDFS-7866-v1.patch, HDFS-7866-v2.patch, 
> HDFS-7866-v3.patch, HDFS-7866.10.patch, HDFS-7866.11.patch, 
> HDFS-7866.4.patch, HDFS-7866.5.patch, HDFS-7866.6.patch, HDFS-7866.7.patch, 
> HDFS-7866.8.patch, HDFS-7866.9.patch
>
>
> This is to extend NameNode to load, list and sync predefine EC schemas in 
> authorized and controlled approach. The provided facilities will be used to 
> implement DFSAdmin commands so admin can list available EC schemas, then 
> could choose some of them for target EC zones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9882) Add heartbeatsTotal in Datanode metrics

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180384#comment-15180384
 ] 

Arpit Agarwal commented on HDFS-9882:
-

Disk operations can be slow on any platform if the disk is loaded or bad. So I 
think it is a good idea to move those operations out of the heartbeat 
processing path which is perf-sensitive. Would you consider filing a separate 
Jira to fix the {{checkBlock}} issue described by [~hualiu]?

Meanwhile we can also add this new metric. Can you rename it to something like 
{{HeartbeatTotalTime}} and describe it in Metrics.md? 

> Add heartbeatsTotal in Datanode metrics
> ---
>
> Key: HDFS-9882
> URL: https://issues.apache.org/jira/browse/HDFS-9882
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 2.7.2
>Reporter: Hua Liu
>Assignee: Hua Liu
>Priority: Minor
> Attachments: 
> 0001-HDFS-9882.Add-heartbeatsTotal-in-Datanode-metrics.patch, 
> 0002-HDFS-9882.Add-heartbeatsTotal-in-Datanode-metrics.patch
>
>
> Heartbeat latency only reflects the time spent on generating reports and 
> sending reports to NN. When heartbeats are delayed due to processing 
> commands, this latency does not help investigation. I would like to propose 
> to add another metric counter to show the total time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9126) namenode crash in fsimage download/transfer

2016-03-04 Thread Brian P Spallholtz (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180393#comment-15180393
 ] 

Brian P Spallholtz commented on HDFS-9126:
--

Any resolution to this? 

> namenode crash in fsimage download/transfer
> ---
>
> Key: HDFS-9126
> URL: https://issues.apache.org/jira/browse/HDFS-9126
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
> Environment: OS:Centos 6.5(final)
> Apache Hadoop:2.6.0
> namenode ha base 5 journalnodes
>Reporter: zengyongping
>Priority: Critical
>
> In our product Hadoop cluster,when active namenode begin download/transfer 
> fsimage from standby namenode.some times zkfc monitor health of NameNode 
> socket timeout,zkfs judge active namenode status SERVICE_NOT_RESPONDING 
> ,happen hadoop namenode ha failover,fence old active namenode.
> zkfc logs:
> 2015-09-24 11:44:44,739 WARN org.apache.hadoop.ha.HealthMonitor: 
> Transport-level exception trying to monitor health of NameNode at 
> hostname1/192.168.10.11:8020: Call From hostname1/192.168.10.11 to 
> hostname1:8020 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 45000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.10.11:22614 remote=hostname1/192.168.10.11:8020]; For more 
> details see:  http://wiki.apache.org/hadoop/SocketTimeout
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.HealthMonitor: Entering 
> state SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
> service NameNode at hostname1/192.168.10.11:8020 entered state: 
> SERVICE_NOT_RESPONDING
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ZKFailoverController: 
> Quitting master election for NameNode at hostname1/192.168.10.11:8020 and 
> marking that fencing is necessary
> 2015-09-24 11:44:44,740 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election
> 2015-09-24 11:44:44,761 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x54d81348fe503e3 closed
> 2015-09-24 11:44:44,761 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x54d81348fe503e3
> 2015-09-24 11:44:44,764 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> namenode logs:
> 2015-09-24 11:43:34,074 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 
> 192.168.10.12
> 2015-09-24 11:43:34,074 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
> 2015-09-24 11:43:34,075 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 
> 2317430129
> 2015-09-24 11:43:34,253 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 
> 272988 Total time for transactions(ms): 5502 Number of transactions batched 
> in Syncs: 146274 Number of syncs: 32375 SyncTimes(ms): 274465 319599
> 2015-09-24 11:43:46,005 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: 
> Rescanning after 3 milliseconds
> 2015-09-24 11:44:21,054 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReplicationMonitor timed out blk_1185804191_112164210
> 2015-09-24 11:44:36,076 INFO 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
> file 
> /software/data/hadoop-data/hdfs/namenode/current/edits_inprogress_02317430129
>  -> 
> /software/data/hadoop-data/hdfs/namenode/current/edits_02317430129-02317703116
> 2015-09-24 11:44:36,077 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 
> 2317703117
> 2015-09-24 11:45:38,008 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 
> Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 
> Number of syncs: 0 SyncTimes(ms): 0 61585
> 2015-09-24 11:45:38,009 INFO 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 222.88s 
> at 63510.29 KB/s
> 2015-09-24 11:45:38,009 INFO 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file 
> fsimage.ckpt_02317430128 size 14495092105 bytes.
> 2015-09-24 11:45:38,416 WARN 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 
> 192.168.10.13:8485 failed to write txns 2317703117-2317703117. Will try to 
> write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 44 is 
> less than the last promised epoch 45
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
> at 
> 

[jira] [Updated] (HDFS-9891) Ozone: Add container transport client

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9891:

Status: Open  (was: Patch Available)

> Ozone: Add container transport client
> -
>
> Key: HDFS-9891
> URL: https://issues.apache.org/jira/browse/HDFS-9891
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Attachments: HDFS-9891-HDFS-7240.001.patch
>
>
> Add ozone container transport client -- that makes it easy to talk to server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9891) Ozone: Add container transport client

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9891:

Status: Patch Available  (was: Open)

> Ozone: Add container transport client
> -
>
> Key: HDFS-9891
> URL: https://issues.apache.org/jira/browse/HDFS-9891
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Attachments: HDFS-9891-HDFS-7240.001.patch
>
>
> Add ozone container transport client -- that makes it easy to talk to server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8356) Document missing properties in hdfs-default.xml

2016-03-04 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated HDFS-8356:
-
Attachment: HDFS-8356.006.patch

- Implement all of [~ajisakaa]'s feedback
- Add dfs.blockreport.incremental.intervalMsec property from HDFS-9710
- Add dfs.namenode.edits.asynclogging from HDFS-7964
- Remove waivers for dfs.client.failover.* properties due to HDFS-8084 making 
them accessible

> Document missing properties in hdfs-default.xml
> ---
>
> Key: HDFS-8356
> URL: https://issues.apache.org/jira/browse/HDFS-8356
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.7.0
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability, test
> Attachments: HDFS-8356.001.patch, HDFS-8356.002.patch, 
> HDFS-8356.003.patch, HDFS-8356.004.patch, HDFS-8356.005.patch, 
> HDFS-8356.006.patch
>
>
> The following properties are currently not defined in hdfs-default.xml. These 
> properties should either be
> A) documented in hdfs-default.xml OR
> B) listed as an exception (with comments, e.g. for internal use) in the 
> TestHdfsConfigFields unit test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8356) Document missing properties in hdfs-default.xml

2016-03-04 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180403#comment-15180403
 ] 

Ray Chiang commented on HDFS-8356:
--

Thanks for all the feedback so far [~ajisakaa].  I've caught up on all the 
issues you've found so far, plus a few more that have since gotten added to 
trunk.

> Document missing properties in hdfs-default.xml
> ---
>
> Key: HDFS-8356
> URL: https://issues.apache.org/jira/browse/HDFS-8356
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.7.0
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability, test
> Attachments: HDFS-8356.001.patch, HDFS-8356.002.patch, 
> HDFS-8356.003.patch, HDFS-8356.004.patch, HDFS-8356.005.patch, 
> HDFS-8356.006.patch
>
>
> The following properties are currently not defined in hdfs-default.xml. These 
> properties should either be
> A) documented in hdfs-default.xml OR
> B) listed as an exception (with comments, e.g. for internal use) in the 
> TestHdfsConfigFields unit test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-9907:
---

 Summary: Exclude Ozone protobuf-generated classes from Findbugs 
analysis.
 Key: HDFS-9907
 URL: https://issues.apache.org/jira/browse/HDFS-9907
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial


Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
protobuf-generated classes with warnings.  These warnings aren't relevant, 
because we don't directly control the code generated by protoc.  We can exclude 
these classes in the Findbugs configuration, just like we do for other existing 
protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9907:

Status: Patch Available  (was: Open)

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9907:

Attachment: HDFS-9907-HDFS-7240.001.patch

Attaching patch v001.  I expect we'll see a drop in Findbugs warnings when 
pre-commit runs this.  [~anu], could you please review?

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9907:

Component/s: build

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180408#comment-15180408
 ] 

Anu Engineer commented on HDFS-9907:


[~cnauroth] Thanks for the patch. +1, pending jenkins.

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Anu Engineer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer updated HDFS-9907:
---
Issue Type: Sub-task  (was: Improvement)
Parent: HDFS-7240

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8786) Erasure coding: DataNode should transfer striped blocks before being decommissioned

2016-03-04 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180415#comment-15180415
 ] 

Jing Zhao commented on HDFS-8786:
-

I think we can use this jira to fix ErasureCodingWork first. The sorting logic 
can be done as a follow-on. Please let me know if you have further questions, 
[~rakeshr]. 

> Erasure coding: DataNode should transfer striped blocks before being 
> decommissioned
> ---
>
> Key: HDFS-8786
> URL: https://issues.apache.org/jira/browse/HDFS-8786
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Zhe Zhang
>Assignee: Rakesh R
> Attachments: HDFS-8786-001.patch, HDFS-8786-002.patch, 
> HDFS-8786-003.patch, HDFS-8786-draft.patch
>
>
> Per [discussion | 
> https://issues.apache.org/jira/browse/HDFS-8697?focusedCommentId=14609004&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14609004]
>  under HDFS-8697, it's too expensive to reconstruct block groups for decomm 
> purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9873) Ozone: Add container transport server

2016-03-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180429#comment-15180429
 ] 

Chris Nauroth commented on HDFS-9873:
-

Hi [~anu].  Nice work!

Do you think {{XceiverServer#start}} can be simplified so that it doesn't need 
to submit a background thread to bootstrap Netty?  With this background thread, 
the server startup happens asynchronous with respect to callers of {{start}}.  
That might cause some tricky non-deterministic behavior, such as for tests that 
want a guarantee the server is fully up and running before sending test 
requests.  Let me know your thoughts.

Would you please check the indentation in {{TestContainerServer}}?

Thank you.

> Ozone: Add container transport server
> -
>
> Key: HDFS-9873
> URL: https://issues.apache.org/jira/browse/HDFS-9873
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ozone
>Affects Versions: HDFS-7240
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Fix For: HDFS-7240
>
> Attachments: HDFS-9873-HDFS-7240.001.patch, 
> HDFS-9873-HDFS-7240.002.patch
>
>
> Add server part of the container transport



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180433#comment-15180433
 ] 

Hadoop QA commented on HDFS-9907:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
1s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
36s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 1m 16s {color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791520/HDFS-9907-HDFS-7240.001.patch
 |
| JIRA Issue | HDFS-9907 |
| Optional Tests |  asflicense  xml  |
| uname | Linux 50581ad82679 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-7240 / 1244d8f |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/14720/console |
| Powered by | Apache Yetus 0.3.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Anu Engineer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer updated HDFS-9907:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

I have committed this patch. Thanks for fixing this [~cnauroth]

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9907) Exclude Ozone protobuf-generated classes from Findbugs analysis.

2016-03-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180446#comment-15180446
 ] 

Chris Nauroth commented on HDFS-9907:
-

[~anu], thank you for committing!

> Exclude Ozone protobuf-generated classes from Findbugs analysis.
> 
>
> Key: HDFS-9907
> URL: https://issues.apache.org/jira/browse/HDFS-9907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: build
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-9907-HDFS-7240.001.patch
>
>
> Pre-commit runs on the HDFS-7240 feature branch are currently flagging Ozone 
> protobuf-generated classes with warnings.  These warnings aren't relevant, 
> because we don't directly control the code generated by protoc.  We can 
> exclude these classes in the Findbugs configuration, just like we do for 
> other existing protobuf-generated classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9908) Datanode should tolerate disk failure during NN handshake

2016-03-04 Thread Wei-Chiu Chuang (JIRA)
Wei-Chiu Chuang created HDFS-9908:
-

 Summary: Datanode should tolerate disk failure during NN handshake
 Key: HDFS-9908
 URL: https://issues.apache.org/jira/browse/HDFS-9908
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.5.0
 Environment: CDH5.3.3
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Priority: Critical


DN may treat a disk failure exception as NN handshake exception.

During NN handshake, DN initializes block pools. It will create a lock files 
per disk, and then scan the volumes. However, if the scanning throws exceptions 
due to disk failure, DN will think it's an exception because NN is inconsistent 
with the local storage. As a result, it will attempt to reconnect to NN again.

However, at this point, DN has not delete its lock file on the disks. If it 
reconnects to NN again, it will think the same disk is already being used, and 
then it will fail handshake again because all disks can not be used (due to 
locking), and repeatedly. This will happen even if the DN has multiple disks, 
and only one of them fails. The DN will not be able to connect to NN despite 
just one failing disk. Note that it is possible to successfully create a lock 
file on a disk, and then has error scanning the disk.

We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, and 
we still see the same code in 3.0.0 trunk). The root cause is that DN treats an 
internal error (single disk failure) as an external error (NN handshake) and we 
should fix it.

{code:title=DataNode.java}
/**
   * One of the Block Pools has successfully connected to its NN.
   * This initializes the local storage for that block pool,
   * checks consistency of the NN's cluster ID, etc.
   * 
   * If this is the first block pool to register, this also initializes
   * the datanode-scoped storage.
   * 
   * @param bpos Block pool offer service
   * @throws IOException if the NN is inconsistent with the local storage.
   */
  void initBlockPool(BPOfferService bpos) throws IOException {
NamespaceInfo nsInfo = bpos.getNamespaceInfo();
if (nsInfo == null) {
  throw new IOException("NamespaceInfo not found: Block pool " + bpos
  + " should have retrieved namespace info before initBlockPool.");
}

setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

// Register the new block pool with the BP manager.
blockPoolManager.addBlockPool(bpos);

// In the case that this is the first block pool to connect, initialize
// the dataset, block scanners, etc.
initStorage(nsInfo);

// Exclude failed disks before initializing the block pools to avoid startup
// failures.
checkDiskError();

data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <- this line throws 
disk error exception
blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
initDirectoryScanner(conf);
  }
{code}

{{FsVolumeList#addBlockPool}} is the source of exception.
{code:title=FsVolumeList.java}
  void addBlockPool(final String bpid, final Configuration conf) throws 
IOException {
long totalStartTime = Time.monotonicNow();

final List exceptions = Collections.synchronizedList(
new ArrayList());
List blockPoolAddingThreads = new ArrayList();
for (final FsVolumeImpl v : volumes) {
  Thread t = new Thread() {
public void run() {
  try (FsVolumeReference ref = v.obtainReference()) {
FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
" on volume " + v + "...");
long startTime = Time.monotonicNow();
v.addBlockPool(bpid, conf);
long timeTaken = Time.monotonicNow() - startTime;
FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
" on " + v + ": " + timeTaken + "ms");
  } catch (ClosedChannelException e) {
// ignore.
  } catch (IOException ioe) {
FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
". Will throw later.", ioe);
exceptions.add(ioe);
  }
}
  };
  blockPoolAddingThreads.add(t);
  t.start();
}
for (Thread t : blockPoolAddingThreads) {
  try {
t.join();
  } catch (InterruptedException ie) {
throw new IOException(ie);
  }
}
if (!exceptions.isEmpty()) {
  throw exceptions.get(0); <- here's the original of exception
}

long totalTimeTaken = Time.monotonicNow() - totalStartTime;
FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
bpid + ": " + totalTimeTaken + "ms");
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9908) Datanode should tolerate disk failure during NN handshake

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9908:
--
Description: 
DN may treat a disk failure exception as NN handshake exception, and this can 
prevent a DN to join a cluster even if most of its disks are healthy.

During NN handshake, DN initializes block pools. It will create a lock files 
per disk, and then scan the volumes. However, if the scanning throws exceptions 
due to disk failure, DN will think it's an exception because NN is inconsistent 
with the local storage (see {{DataNode#initBlockPool}}. As a result, it will 
attempt to reconnect to NN again.

However, at this point, DN has not deleted its lock files on the disks. If it 
reconnects to NN again, it will think the same disks are already being used, 
and then it will fail handshake again because all disks can not be used (due to 
locking), and repeatedly. This will happen even if the DN has multiple disks, 
and only one of them fails. The DN will not be able to connect to NN despite 
just one failing disk. Note that it is possible to successfully create a lock 
file on a disk, and then has error scanning the disk.

We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, and 
we still see the same bug in 3.0.0 trunk branch). The root cause is that DN 
treats an internal error (single disk failure) as an external one (NN handshake 
failure) and we should fix it.

{code:title=DataNode.java}
/**
   * One of the Block Pools has successfully connected to its NN.
   * This initializes the local storage for that block pool,
   * checks consistency of the NN's cluster ID, etc.
   * 
   * If this is the first block pool to register, this also initializes
   * the datanode-scoped storage.
   * 
   * @param bpos Block pool offer service
   * @throws IOException if the NN is inconsistent with the local storage.
   */
  void initBlockPool(BPOfferService bpos) throws IOException {
NamespaceInfo nsInfo = bpos.getNamespaceInfo();
if (nsInfo == null) {
  throw new IOException("NamespaceInfo not found: Block pool " + bpos
  + " should have retrieved namespace info before initBlockPool.");
}

setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

// Register the new block pool with the BP manager.
blockPoolManager.addBlockPool(bpos);

// In the case that this is the first block pool to connect, initialize
// the dataset, block scanners, etc.
initStorage(nsInfo);

// Exclude failed disks before initializing the block pools to avoid startup
// failures.
checkDiskError();

data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <- this line throws 
disk error exception
blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
initDirectoryScanner(conf);
  }
{code}

{{FsVolumeList#addBlockPool}} is the source of exception.
{code:title=FsVolumeList.java}
  void addBlockPool(final String bpid, final Configuration conf) throws 
IOException {
long totalStartTime = Time.monotonicNow();

final List exceptions = Collections.synchronizedList(
new ArrayList());
List blockPoolAddingThreads = new ArrayList();
for (final FsVolumeImpl v : volumes) {
  Thread t = new Thread() {
public void run() {
  try (FsVolumeReference ref = v.obtainReference()) {
FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
" on volume " + v + "...");
long startTime = Time.monotonicNow();
v.addBlockPool(bpid, conf);
long timeTaken = Time.monotonicNow() - startTime;
FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
" on " + v + ": " + timeTaken + "ms");
  } catch (ClosedChannelException e) {
// ignore.
  } catch (IOException ioe) {
FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
". Will throw later.", ioe);
exceptions.add(ioe);
  }
}
  };
  blockPoolAddingThreads.add(t);
  t.start();
}
for (Thread t : blockPoolAddingThreads) {
  try {
t.join();
  } catch (InterruptedException ie) {
throw new IOException(ie);
  }
}
if (!exceptions.isEmpty()) {
  throw exceptions.get(0); <- here's the original of exception
}

long totalTimeTaken = Time.monotonicNow() - totalStartTime;
FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
bpid + ": " + totalTimeTaken + "ms");
  }
{code}

  was:
DN may treat a disk failure exception as NN handshake exception.

During NN handshake, DN initializes block pools. It will create a lock files 
per disk, and then scan the volumes. However, if the scanning throws exceptions 
due to disk failure, DN will think it's an exception because NN is inconsistent 
with the local storage. 

[jira] [Updated] (HDFS-9908) Datanode should tolerate disk failure during NN handshake

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9908:
--
Priority: Major  (was: Critical)

> Datanode should tolerate disk failure during NN handshake
> -
>
> Key: HDFS-9908
> URL: https://issues.apache.org/jira/browse/HDFS-9908
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.5.0
> Environment: CDH5.3.3
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>
> DN may treat a disk failure exception as NN handshake exception, and this can 
> prevent a DN to join a cluster even if most of its disks are healthy.
> During NN handshake, DN initializes block pools. It will create a lock files 
> per disk, and then scan the volumes. However, if the scanning throws 
> exceptions due to disk failure, DN will think it's an exception because NN is 
> inconsistent with the local storage (see {{DataNode#initBlockPool}}. As a 
> result, it will attempt to reconnect to NN again.
> However, at this point, DN has not deleted its lock files on the disks. If it 
> reconnects to NN again, it will think the same disks are already being used, 
> and then it will fail handshake again because all disks can not be used (due 
> to locking), and repeatedly. This will happen even if the DN has multiple 
> disks, and only one of them fails. The DN will not be able to connect to NN 
> despite just one failing disk. Note that it is possible to successfully 
> create a lock file on a disk, and then has error scanning the disk.
> We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, 
> and we still see the same bug in 3.0.0 trunk branch). The root cause is that 
> DN treats an internal error (single disk failure) as an external one (NN 
> handshake failure) and we should fix it.
> {code:title=DataNode.java}
> /**
>* One of the Block Pools has successfully connected to its NN.
>* This initializes the local storage for that block pool,
>* checks consistency of the NN's cluster ID, etc.
>* 
>* If this is the first block pool to register, this also initializes
>* the datanode-scoped storage.
>* 
>* @param bpos Block pool offer service
>* @throws IOException if the NN is inconsistent with the local storage.
>*/
>   void initBlockPool(BPOfferService bpos) throws IOException {
> NamespaceInfo nsInfo = bpos.getNamespaceInfo();
> if (nsInfo == null) {
>   throw new IOException("NamespaceInfo not found: Block pool " + bpos
>   + " should have retrieved namespace info before initBlockPool.");
> }
> 
> setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
> // Register the new block pool with the BP manager.
> blockPoolManager.addBlockPool(bpos);
> 
> // In the case that this is the first block pool to connect, initialize
> // the dataset, block scanners, etc.
> initStorage(nsInfo);
> // Exclude failed disks before initializing the block pools to avoid 
> startup
> // failures.
> checkDiskError();
> data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <- this line 
> throws disk error exception
> blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
> initDirectoryScanner(conf);
>   }
> {code}
> {{FsVolumeList#addBlockPool}} is the source of exception.
> {code:title=FsVolumeList.java}
>   void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
> long totalStartTime = Time.monotonicNow();
> 
> final List exceptions = Collections.synchronizedList(
> new ArrayList());
> List blockPoolAddingThreads = new ArrayList();
> for (final FsVolumeImpl v : volumes) {
>   Thread t = new Thread() {
> public void run() {
>   try (FsVolumeReference ref = v.obtainReference()) {
> FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
> " on volume " + v + "...");
> long startTime = Time.monotonicNow();
> v.addBlockPool(bpid, conf);
> long timeTaken = Time.monotonicNow() - startTime;
> FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
> " on " + v + ": " + timeTaken + "ms");
>   } catch (ClosedChannelException e) {
> // ignore.
>   } catch (IOException ioe) {
> FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
> ". Will throw later.", ioe);
> exceptions.add(ioe);
>   }
> }
>   };
>   blockPoolAddingThreads.add(t);
>   t.start();
> }
> for (Thread t : blockPoolAddingThreads) {
>   try {
> t.join();
>   } catch (InterruptedException ie) {
> throw new IOException(ie);
>   }
>

[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180528#comment-15180528
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9239:
---

+1 the new patch looks good.  Thanks for the update.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9908) Datanode should tolerate disk failure during NN handshake

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180534#comment-15180534
 ] 

Wei-Chiu Chuang commented on HDFS-9908:
---

For completeness, here's the related logs in DN:

*DN connects to NN:*
2016-02-18 02:20:37,949 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Opened IPC server at /10.107.162.126:50020
2016-02-18 02:20:38,034 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Refresh request received for nameservices: nameservice1
2016-02-18 02:20:38,067 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Starting BPOfferServices for nameservices: nameservice1
2016-02-18 02:20:38,076 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool  (Datanode Uuid unassigned) service to 
namenode1.weichiu.com/10.107.162.110:8022 starting to offer service
2016-02-18 02:20:38,077 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool  (Datanode Uuid unassigned) service to 
namenode2.weichiu.com10.107.162.120:8022 starting to offer service
2016-02-18 02:20:38,085 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2016-02-18 02:20:38,085 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 50020: starting
2016-02-18 02:20:39,211 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: namenode1.weichiu.com/10.107.162.110:8022. Already tried 0 time(s); 
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)

*Then DN does handshake, gets bpid from NN, and then analyze storage:*

2016-02-18 02:20:53,512 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock 
on /data/1/dfs/dn/in_use.lock acquired by nodename 5...@namenode1.weichiu.com
2016-02-18 02:20:53,563 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Analyzing storage directories for bpid BP-1018136951-49.4.167.110-1403564146510
2016-02-18 02:20:53,563 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Locking is disabled
2016-02-18 02:20:53,606 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Restored 0 block files from trash.

all of the disks are successful

*But one of them failed to scan:*

2016-02-18 02:23:36,224 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught 
exception while scanning /data/8/dfs/dn/current.
Will throw later.
ExitCodeException exitCode=1: du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir228/subdir11/blk_
1088686909': Input/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir228/subdir11/blk_1088686909_14954023.meta':
 Inp
ut/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir69/blk_1093551560':
 Input/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir69/blk_1093551560_19818947.meta':
 Inpu
t/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir116/blk_1093563577':
 Input/output erro
r
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir116/blk_1093563577_19830979.meta':
 Inp
ut/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir71/blk_1093552125':
 Input/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir71/blk_1093551897':
 Input/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir71/blk_1093551897_19819284.meta':
 Inpu
t/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir71/blk_1093552003_19819390.meta':
 Inpu
t/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir71/blk_1093552003':
 Input/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir71/blk_1093552125_19819512.meta':
 Inpu
t/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir70/blk_1093551747':
 Input/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir70/blk_1093551747_19819134.meta':
 Inpu
t/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir46/subdir68/blk_1093551249_19818632.meta':
 Inpu
t/output error
du: cannot access 
`/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/curren

[jira] [Commented] (HDFS-7648) Verify the datanode directory layout

2016-03-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180541#comment-15180541
 ] 

Colin Patrick McCabe commented on HDFS-7648:


Hi [~dwatzke], this question sounds like it should be asked on the hdfs-user 
list, not on JIRA, since it is a question about recovering from a specific 
admin mistake rather than a bug with the code.

> Verify the datanode directory layout
> 
>
> Key: HDFS-7648
> URL: https://issues.apache.org/jira/browse/HDFS-7648
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Rakesh R
> Attachments: HDFS-7648-3.patch, HDFS-7648-4.patch, HDFS-7648-5.patch, 
> HDFS-7648.patch, HDFS-7648.patch
>
>
> HDFS-6482 changed datanode layout to use block ID to determine the directory 
> to store the block.  We should have some mechanism to verify it.  Either 
> DirectoryScanner or block report generation could do the check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9478) Reason for failing ipc.FairCallQueue contruction should be thrown

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180542#comment-15180542
 ] 

Arpit Agarwal commented on HDFS-9478:
-

bq. guess we better refactor 
org.apache.hadoop.ipc.CallQueueManager.createCallQueueInstance and check for 
InvocationTargetException
Thanks [~ajithshetty]. Yes that sounds right. That would be a more general 
solution to this problem. Also cc [~chrilisf].

> Reason for failing ipc.FairCallQueue contruction should be thrown
> -
>
> Key: HDFS-9478
> URL: https://issues.apache.org/jira/browse/HDFS-9478
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Archana T
>Assignee: Ajith S
>Priority: Minor
> Attachments: HDFS-9478.patch
>
>
> When FairCallQueue Construction fails, NN fails to start throwing 
> RunTimeException without throwing any reason on why it fails.
> 2015-11-30 17:45:26,661 INFO org.apache.hadoop.ipc.FairCallQueue: 
> FairCallQueue is in use with 4 queues.
> 2015-11-30 17:45:26,665 DEBUG org.apache.hadoop.metrics2.util.MBeans: 
> Registered Hadoop:service=ipc.65110,name=DecayRpcScheduler
> 2015-11-30 17:45:26,666 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.lang.RuntimeException: org.apache.hadoop.ipc.FairCallQueue could not be 
> constructed.
> at 
> org.apache.hadoop.ipc.CallQueueManager.createCallQueueInstance(CallQueueManager.java:96)
> at org.apache.hadoop.ipc.CallQueueManager.(CallQueueManager.java:55)
> at org.apache.hadoop.ipc.Server.(Server.java:2241)
> at org.apache.hadoop.ipc.RPC$Server.(RPC.java:942)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:534)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509)
> at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:784)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:346)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:687)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:889)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:872)
> Example: reason for above failure could have been --
> 1. the weights were not equal to the number of queues configured.
> 2. decay-scheduler.thresholds not in sync with number of queues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9908) Datanode should tolerate disk scan failure during NN handshake

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9908:
--
Description: 
DN may treat a disk scan failure exception as an NN handshake exception, and 
this can prevent a DN to join a cluster even if most of its disks are healthy.

During NN handshake, DN initializes block pools. It will create a lock files 
per disk, and then scan the volumes. However, if the scanning throws exceptions 
due to disk failure, DN will think it's an exception because NN is inconsistent 
with the local storage (see {{DataNode#initBlockPool}}. As a result, it will 
attempt to reconnect to NN again.

However, at this point, DN has not deleted its lock files on the disks. If it 
reconnects to NN again, it will think the same disks are already being used, 
and then it will fail handshake again because all disks can not be used (due to 
locking), and repeatedly. This will happen even if the DN has multiple disks, 
and only one of them fails. The DN will not be able to connect to NN despite 
just one failing disk. Note that it is possible to successfully create a lock 
file on a disk, and then has error scanning the disk.

We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, and 
we still see the same bug in 3.0.0 trunk branch). The root cause is that DN 
treats an internal error (single disk failure) as an external one (NN handshake 
failure) and we should fix it.

{code:title=DataNode.java}
/**
   * One of the Block Pools has successfully connected to its NN.
   * This initializes the local storage for that block pool,
   * checks consistency of the NN's cluster ID, etc.
   * 
   * If this is the first block pool to register, this also initializes
   * the datanode-scoped storage.
   * 
   * @param bpos Block pool offer service
   * @throws IOException if the NN is inconsistent with the local storage.
   */
  void initBlockPool(BPOfferService bpos) throws IOException {
NamespaceInfo nsInfo = bpos.getNamespaceInfo();
if (nsInfo == null) {
  throw new IOException("NamespaceInfo not found: Block pool " + bpos
  + " should have retrieved namespace info before initBlockPool.");
}

setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

// Register the new block pool with the BP manager.
blockPoolManager.addBlockPool(bpos);

// In the case that this is the first block pool to connect, initialize
// the dataset, block scanners, etc.
initStorage(nsInfo);

// Exclude failed disks before initializing the block pools to avoid startup
// failures.
checkDiskError();

data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <- this line throws 
disk error exception
blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
initDirectoryScanner(conf);
  }
{code}

{{FsVolumeList#addBlockPool}} is the source of exception.
{code:title=FsVolumeList.java}
  void addBlockPool(final String bpid, final Configuration conf) throws 
IOException {
long totalStartTime = Time.monotonicNow();

final List exceptions = Collections.synchronizedList(
new ArrayList());
List blockPoolAddingThreads = new ArrayList();
for (final FsVolumeImpl v : volumes) {
  Thread t = new Thread() {
public void run() {
  try (FsVolumeReference ref = v.obtainReference()) {
FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
" on volume " + v + "...");
long startTime = Time.monotonicNow();
v.addBlockPool(bpid, conf);
long timeTaken = Time.monotonicNow() - startTime;
FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
" on " + v + ": " + timeTaken + "ms");
  } catch (ClosedChannelException e) {
// ignore.
  } catch (IOException ioe) {
FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
". Will throw later.", ioe);
exceptions.add(ioe);
  }
}
  };
  blockPoolAddingThreads.add(t);
  t.start();
}
for (Thread t : blockPoolAddingThreads) {
  try {
t.join();
  } catch (InterruptedException ie) {
throw new IOException(ie);
  }
}
if (!exceptions.isEmpty()) {
  throw exceptions.get(0); <- here's the original of exception
}

long totalTimeTaken = Time.monotonicNow() - totalStartTime;
FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
bpid + ": " + totalTimeTaken + "ms");
  }
{code}

  was:
DN may treat a disk failure exception as NN handshake exception, and this can 
prevent a DN to join a cluster even if most of its disks are healthy.

During NN handshake, DN initializes block pools. It will create a lock files 
per disk, and then scan the volumes. However, if the scanning throws exceptions 
due to disk f

[jira] [Updated] (HDFS-9908) Datanode should tolerate disk scan failure during NN handshake

2016-03-04 Thread Wei-Chiu Chuang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9908:
--
Summary: Datanode should tolerate disk scan failure during NN handshake  
(was: Datanode should tolerate disk failure during NN handshake)

> Datanode should tolerate disk scan failure during NN handshake
> --
>
> Key: HDFS-9908
> URL: https://issues.apache.org/jira/browse/HDFS-9908
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.5.0
> Environment: CDH5.3.3
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>
> DN may treat a disk failure exception as NN handshake exception, and this can 
> prevent a DN to join a cluster even if most of its disks are healthy.
> During NN handshake, DN initializes block pools. It will create a lock files 
> per disk, and then scan the volumes. However, if the scanning throws 
> exceptions due to disk failure, DN will think it's an exception because NN is 
> inconsistent with the local storage (see {{DataNode#initBlockPool}}. As a 
> result, it will attempt to reconnect to NN again.
> However, at this point, DN has not deleted its lock files on the disks. If it 
> reconnects to NN again, it will think the same disks are already being used, 
> and then it will fail handshake again because all disks can not be used (due 
> to locking), and repeatedly. This will happen even if the DN has multiple 
> disks, and only one of them fails. The DN will not be able to connect to NN 
> despite just one failing disk. Note that it is possible to successfully 
> create a lock file on a disk, and then has error scanning the disk.
> We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, 
> and we still see the same bug in 3.0.0 trunk branch). The root cause is that 
> DN treats an internal error (single disk failure) as an external one (NN 
> handshake failure) and we should fix it.
> {code:title=DataNode.java}
> /**
>* One of the Block Pools has successfully connected to its NN.
>* This initializes the local storage for that block pool,
>* checks consistency of the NN's cluster ID, etc.
>* 
>* If this is the first block pool to register, this also initializes
>* the datanode-scoped storage.
>* 
>* @param bpos Block pool offer service
>* @throws IOException if the NN is inconsistent with the local storage.
>*/
>   void initBlockPool(BPOfferService bpos) throws IOException {
> NamespaceInfo nsInfo = bpos.getNamespaceInfo();
> if (nsInfo == null) {
>   throw new IOException("NamespaceInfo not found: Block pool " + bpos
>   + " should have retrieved namespace info before initBlockPool.");
> }
> 
> setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
> // Register the new block pool with the BP manager.
> blockPoolManager.addBlockPool(bpos);
> 
> // In the case that this is the first block pool to connect, initialize
> // the dataset, block scanners, etc.
> initStorage(nsInfo);
> // Exclude failed disks before initializing the block pools to avoid 
> startup
> // failures.
> checkDiskError();
> data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <- this line 
> throws disk error exception
> blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
> initDirectoryScanner(conf);
>   }
> {code}
> {{FsVolumeList#addBlockPool}} is the source of exception.
> {code:title=FsVolumeList.java}
>   void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
> long totalStartTime = Time.monotonicNow();
> 
> final List exceptions = Collections.synchronizedList(
> new ArrayList());
> List blockPoolAddingThreads = new ArrayList();
> for (final FsVolumeImpl v : volumes) {
>   Thread t = new Thread() {
> public void run() {
>   try (FsVolumeReference ref = v.obtainReference()) {
> FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
> " on volume " + v + "...");
> long startTime = Time.monotonicNow();
> v.addBlockPool(bpid, conf);
> long timeTaken = Time.monotonicNow() - startTime;
> FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
> " on " + v + ": " + timeTaken + "ms");
>   } catch (ClosedChannelException e) {
> // ignore.
>   } catch (IOException ioe) {
> FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
> ". Will throw later.", ioe);
> exceptions.add(ioe);
>   }
> }
>   };
>   blockPoolAddingThreads.add(t);
>   t.start();
> }
> for (Thread t : blockPoolAddingThreads) {
>   tr

[jira] [Comment Edited] (HDFS-9889) Update balancer/mover document about HDFS-6133 feature

2016-03-04 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180569#comment-15180569
 ] 

Yongjun Zhang edited comment on HDFS-9889 at 3/4/16 9:33 PM:
-

Committed to trunk, branch-2, and branch-2.8.

Thanks [~andrew.wang]  for the review!



was (Author: yzhangal):
Committed to trunk, branch-2, and branch-2.8.

Thanks [~awang] for the review.


> Update balancer/mover document about HDFS-6133 feature
> --
>
> Key: HDFS-9889
> URL: https://issues.apache.org/jira/browse/HDFS-9889
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Fix For: 2.8.0
>
> Attachments: HDFS-9889.001.patch, HDFS-9889.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9889) Update balancer/mover document about HDFS-6133 feature

2016-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180571#comment-15180571
 ] 

Hudson commented on HDFS-9889:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9425 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9425/])
HDFS-9889. Update balancer/mover document about HDFS-6133 feature. (yzhang: rev 
8e08861a14cb5b6adce338543d7da08e9926ad46)
* hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md


> Update balancer/mover document about HDFS-6133 feature
> --
>
> Key: HDFS-9889
> URL: https://issues.apache.org/jira/browse/HDFS-9889
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Fix For: 2.8.0
>
> Attachments: HDFS-9889.001.patch, HDFS-9889.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9889) Update balancer/mover document about HDFS-6133 feature

2016-03-04 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-9889:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk, branch-2, and branch-2.8.

Thanks [~awang] for the review.


> Update balancer/mover document about HDFS-6133 feature
> --
>
> Key: HDFS-9889
> URL: https://issues.apache.org/jira/browse/HDFS-9889
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Minor
>  Labels: supportability
> Fix For: 2.8.0
>
> Attachments: HDFS-9889.001.patch, HDFS-9889.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6133) Add a feature for replica pinning so that a pinned replica will not be moved by Balancer/Mover.

2016-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180570#comment-15180570
 ] 

Hudson commented on HDFS-6133:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9425 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9425/])
HDFS-9889. Update balancer/mover document about HDFS-6133 feature. (yzhang: rev 
8e08861a14cb5b6adce338543d7da08e9926ad46)
* hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md


> Add a feature for replica pinning so that a pinned replica will not be moved 
> by Balancer/Mover.
> ---
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, datanode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Fix For: 2.7.0
>
> Attachments: HDFS-6133-1.patch, HDFS-6133-10.patch, 
> HDFS-6133-11.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, 
> HDFS-6133-5.patch, HDFS-6133-6.patch, HDFS-6133-7.patch, HDFS-6133-8.patch, 
> HDFS-6133-9.patch, HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180572#comment-15180572
 ] 

Haohui Mai commented on HDFS-9906:
--

The information is useful for debugging but I agree that we don't need to print 
all of them. We can consider doing a rate limit on the logs. For example, print 
only 5 logs / minutes at most.

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9817) Use SLF4J in new classes

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180582#comment-15180582
 ] 

Arpit Agarwal commented on HDFS-9817:
-

Thanks for addressing this [~anu]. One minor comment.

When calling LOG.debug with SLF4J we prefer to use the curly brace {{\{\}}} 
notation to pass arguments. This avoids the cost of string formatting when 
debug logging is disabled which is the normal configuration.

So this,
{code}
LOG.debug("Cluster URI : " + clusterURI);
{code}

Would be written as
{code}
LOG.debug("Cluster URI : {}", clusterURI);
{code}

+1 otherwise.

> Use SLF4J in new classes
> 
>
> Key: HDFS-9817
> URL: https://issues.apache.org/jira/browse/HDFS-9817
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: logging
>Affects Versions: HDFS-1312
>Reporter: Arpit Agarwal
>Assignee: Anu Engineer
> Attachments: HDFS-9817-HDFS-1312.001.patch
>
>
> We are trying to use SLF4J for new classes as far as possible so let's change 
> all the newly added classes to use SLF4J instead of depending on Log4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9909) Can't read file after hdfs restart

2016-03-04 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created HDFS-9909:
-

 Summary: Can't read file after hdfs restart
 Key: HDFS-9909
 URL: https://issues.apache.org/jira/browse/HDFS-9909
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.7.1
Reporter: Bogdan Raducanu


If HDFS is restarted while a file is open for writing then new clients can't 
read that file until the hard lease limit expires and block recovery starts.

Scenario:
1. write to file, call hflush
2. without closing the file, restart hdfs 
3. after hdfs is back up, try to open file from reading from a new client

Repro attached.

Thoughts:
* As far as I can tell this happens because the last block is RWR and 
getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
lease limit expires (so file is readable only after 1 hour).
* one can call recoverLease which will start the lease recovery sooner, BUT, 
how can one know when to call this? The exception thrown is IOException which 
can happen for other reasons.

I think a reasonable solution would be to return a specialized exception 
(similar to AlreadyBeingCreatedException when trying to write to open file).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9909) Can't read file after hdfs restart

2016-03-04 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated HDFS-9909:
--
Attachment: Main.java

> Can't read file after hdfs restart
> --
>
> Key: HDFS-9909
> URL: https://issues.apache.org/jira/browse/HDFS-9909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.1
>Reporter: Bogdan Raducanu
> Attachments: Main.java
>
>
> If HDFS is restarted while a file is open for writing then new clients can't 
> read that file until the hard lease limit expires and block recovery starts.
> Scenario:
> 1. write to file, call hflush
> 2. without closing the file, restart hdfs 
> 3. after hdfs is back up, try to open file from reading from a new client
> Repro attached.
> Thoughts:
> * As far as I can tell this happens because the last block is RWR and 
> getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
> lease limit expires (so file is readable only after 1 hour).
> * one can call recoverLease which will start the lease recovery sooner, BUT, 
> how can one know when to call this? The exception thrown is IOException which 
> can happen for other reasons.
> I think a reasonable solution would be to return a specialized exception 
> (similar to AlreadyBeingCreatedException when trying to write to open file).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9909) Can't read file after hdfs restart

2016-03-04 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated HDFS-9909:
--
Description: 
If HDFS is restarted while a file is open for writing then new clients can't 
read that file until the hard lease limit expires and block recovery starts.

Scenario:
1. write to file, call hflush
2. without closing the file, restart hdfs 
3. after hdfs is back up, opening file for reading from a new client fails for 
1 hour

Repro attached.

Thoughts:
* As far as I can tell this happens because the last block is RWR and 
getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
lease limit expires (so file is readable only after 1 hour).
* one can call recoverLease which will start the lease recovery sooner, BUT, 
how can one know when to call this? The exception thrown is IOException which 
can happen for other reasons.

I think a reasonable solution would be to return a specialized exception 
(similar to AlreadyBeingCreatedException when trying to write to open file).

  was:
If HDFS is restarted while a file is open for writing then new clients can't 
read that file until the hard lease limit expires and block recovery starts.

Scenario:
1. write to file, call hflush
2. without closing the file, restart hdfs 
3. after hdfs is back up, try to open file from reading from a new client

Repro attached.

Thoughts:
* As far as I can tell this happens because the last block is RWR and 
getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
lease limit expires (so file is readable only after 1 hour).
* one can call recoverLease which will start the lease recovery sooner, BUT, 
how can one know when to call this? The exception thrown is IOException which 
can happen for other reasons.

I think a reasonable solution would be to return a specialized exception 
(similar to AlreadyBeingCreatedException when trying to write to open file).


> Can't read file after hdfs restart
> --
>
> Key: HDFS-9909
> URL: https://issues.apache.org/jira/browse/HDFS-9909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.1
>Reporter: Bogdan Raducanu
> Attachments: Main.java
>
>
> If HDFS is restarted while a file is open for writing then new clients can't 
> read that file until the hard lease limit expires and block recovery starts.
> Scenario:
> 1. write to file, call hflush
> 2. without closing the file, restart hdfs 
> 3. after hdfs is back up, opening file for reading from a new client fails 
> for 1 hour
> Repro attached.
> Thoughts:
> * As far as I can tell this happens because the last block is RWR and 
> getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
> lease limit expires (so file is readable only after 1 hour).
> * one can call recoverLease which will start the lease recovery sooner, BUT, 
> how can one know when to call this? The exception thrown is IOException which 
> can happen for other reasons.
> I think a reasonable solution would be to return a specialized exception 
> (similar to AlreadyBeingCreatedException when trying to write to open file).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9909) Can't read file after hdfs restart

2016-03-04 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated HDFS-9909:
--
Description: 
If HDFS is restarted while a file is open for writing then new clients can't 
read that file until the hard lease limit expires and block recovery starts.

Scenario:
1. write to file, call hflush
2. without closing the file, restart hdfs 
3. after hdfs is back up, opening file for reading from a new client fails for 
1 hour

Repro attached.

Thoughts:
* possibly this also happens in other cases not just when hdfs is restarted 
(e.g. only all datanodes in pipeline are restarted)
* As far as I can tell this happens because the last block is RWR and 
getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
lease limit expires (so file is readable only after 1 hour).
* one can call recoverLease which will start the lease recovery sooner, BUT, 
how can one know when to call this? The exception thrown is IOException which 
can happen for other reasons.

I think a reasonable solution would be to return a specialized exception 
(similar to AlreadyBeingCreatedException when trying to write to open file).

  was:
If HDFS is restarted while a file is open for writing then new clients can't 
read that file until the hard lease limit expires and block recovery starts.

Scenario:
1. write to file, call hflush
2. without closing the file, restart hdfs 
3. after hdfs is back up, opening file for reading from a new client fails for 
1 hour

Repro attached.

Thoughts:
* As far as I can tell this happens because the last block is RWR and 
getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
lease limit expires (so file is readable only after 1 hour).
* one can call recoverLease which will start the lease recovery sooner, BUT, 
how can one know when to call this? The exception thrown is IOException which 
can happen for other reasons.

I think a reasonable solution would be to return a specialized exception 
(similar to AlreadyBeingCreatedException when trying to write to open file).


> Can't read file after hdfs restart
> --
>
> Key: HDFS-9909
> URL: https://issues.apache.org/jira/browse/HDFS-9909
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.1
>Reporter: Bogdan Raducanu
> Attachments: Main.java
>
>
> If HDFS is restarted while a file is open for writing then new clients can't 
> read that file until the hard lease limit expires and block recovery starts.
> Scenario:
> 1. write to file, call hflush
> 2. without closing the file, restart hdfs 
> 3. after hdfs is back up, opening file for reading from a new client fails 
> for 1 hour
> Repro attached.
> Thoughts:
> * possibly this also happens in other cases not just when hdfs is restarted 
> (e.g. only all datanodes in pipeline are restarted)
> * As far as I can tell this happens because the last block is RWR and 
> getReplicaVisibleLength returns -1 for this. The recovery starts after hard 
> lease limit expires (so file is readable only after 1 hour).
> * one can call recoverLease which will start the lease recovery sooner, BUT, 
> how can one know when to call this? The exception thrown is IOException which 
> can happen for other reasons.
> I think a reasonable solution would be to return a specialized exception 
> (similar to AlreadyBeingCreatedException when trying to write to open file).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180633#comment-15180633
 ] 

Arpit Agarwal commented on HDFS-9906:
-

Rate limiting is overkill. +1 on just making it debug.

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8789) Block Placement policy migrator

2016-03-04 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180649#comment-15180649
 ] 

Ming Ma commented on HDFS-8789:
---

Maybe we can use migrator as the first scenario for doing block movement 
scheduling inside namenode? Changing balancer from client-side to server-side 
requires more work to make the command line backward compatible given it is 
used by admins and automation. Migrator is a brand new tool and we can use it 
as the starting point for the proper design to accommodate migrator, balancer 
and other scenarios done inside namenode.

Other useful statistics provided by the migrator tool such as block size 
distribution, block rack diversity distribution and block replication 
distribution is somewhat independent of migrator. Maybe we can have another 
tool to generate those stats, or maybe fsck.

> Block Placement policy migrator
> ---
>
> Key: HDFS-8789
> URL: https://issues.apache.org/jira/browse/HDFS-8789
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: HDFS-8789-trunk-STRAWMAN-v1.patch
>
>
> As we start to add new block placement policies to HDFS, it will be necessary 
> to have a robust tool that can migrate HDFS blocks between placement 
> policies. This jira is for the design and implementation of that tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8356) Document missing properties in hdfs-default.xml

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180651#comment-15180651
 ] 

Hadoop QA commented on HDFS-8356:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
6s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
22s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 6s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 51s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
48s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 48s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 15s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 72m 25s 
{color} | {color:green} hadoop-hdfs in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
26s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 168m 52s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_74 Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791515/HDFS-8356.006.patch |
| JIRA Issue | HDFS-8356 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  xml  findbugs  checkstyle  |
| uname | Linux 328697275adc 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86

[jira] [Updated] (HDFS-9118) Add logging system for libdhfs++

2016-03-04 Thread James Clampffer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Clampffer updated HDFS-9118:
--
Status: Patch Available  (was: Open)

> Add logging system for libdhfs++
> 
>
> Key: HDFS-9118
> URL: https://issues.apache.org/jira/browse/HDFS-9118
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: HDFS-8707
>Reporter: Bob Hansen
>Assignee: James Clampffer
> Attachments: HDFS-9118.HDFS-8707.000.patch
>
>
> With HDFS-9505, we've starting logging data from libhdfs++.  Consumers of the 
> library are going to have their own logging infrastructure that we're going 
> to want to provide data to.  
> libhdfs++ should have a logging library that:
> * Is overridable and can provide sufficient information to work well with 
> common C++ logging frameworks
> * Has a rational default implementation 
> * Is performant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9118) Add logging system for libdhfs++

2016-03-04 Thread James Clampffer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Clampffer updated HDFS-9118:
--
Attachment: HDFS-9118.HDFS-8707.000.patch

Initial patch:
-extend logging to be able to filter by severity level and origin component of 
the log message
-add optional timestamp to output message
-add optional thread id to output message
-put a mutex around output so messages aren't interleaving
-added message generation to rpc, block reader, file handle, and file system

todo:
-want to make this accept a void ()(const char*) callback through the c 
interface so it can hand messages directly to client application
-maybe add a real logging library, maybe put that off until someone really 
wants/needs it.
-might be some more things worth logging
-some more testing to make sure the various masking functionality works


> Add logging system for libdhfs++
> 
>
> Key: HDFS-9118
> URL: https://issues.apache.org/jira/browse/HDFS-9118
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: HDFS-8707
>Reporter: Bob Hansen
>Assignee: James Clampffer
> Attachments: HDFS-9118.HDFS-8707.000.patch
>
>
> With HDFS-9505, we've starting logging data from libhdfs++.  Consumers of the 
> library are going to have their own logging infrastructure that we're going 
> to want to provide data to.  
> libhdfs++ should have a logging library that:
> * Is overridable and can provide sufficient information to work well with 
> common C++ logging frameworks
> * Has a rational default implementation 
> * Is performant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9005) Provide support for upgrade domain script

2016-03-04 Thread Lei (Eddy) Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180669#comment-15180669
 ] 

Lei (Eddy) Xu commented on HDFS-9005:
-

Hey, [~mingma] sorry for the late reply and thanks for the explanation.

Looking forward to see the newly updated patch!

> Provide support for upgrade domain script
> -
>
> Key: HDFS-9005
> URL: https://issues.apache.org/jira/browse/HDFS-9005
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-9005.patch
>
>
> As part of the upgrade domain feature, we need to provide a mechanism to 
> specify upgrade domain for each datanode. One way to accomplish that is to 
> allow admins specify an upgrade domain script that takes DN ip or hostname as 
> input and return the upgrade domain. Then namenode will use it at run time to 
> set {{DatanodeInfo}}'s upgrade domain string. The configuration can be 
> something like:
> {noformat}
> 
> dfs.namenode.upgrade.domain.script.file.name
> /etc/hadoop/conf/upgrade-domain.sh
> 
> {noformat}
> just like topology script, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-1477) Support reconfiguring dfs.heartbeat.interval and dfs.namenode.heartbeat.recheck-interval without NN restart

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180677#comment-15180677
 ] 

Arpit Agarwal commented on HDFS-1477:
-

Hi [~xiaobingo], the patch looks almost ready. A few stylistic comments:
# Multiple calls to {{namesystem.getBlockManager.getDatanodeManager()}} in 
NameNode#reconfigurePropertyImpl. Let's just make it a local variable to 
improve readability. e.g.
{code}
  protected String reconfigurePropertyImpl(String property, String newVal)
  throws ReconfigurationException {
final DatanodeManager datanodeManager =
namesystem.getBlockManager().getDatanodeManager();
{code}
# Similar change in testReconfigure for readability.
{code}
  public void testReconfigure() throws ReconfigurationException {
// change properties
final NameNode nameNode = cluster.getNameNode();
final DatanodeManager datanodeManager = nameNode.namesystem
.getBlockManager().getDatanodeManager();
{code}
# It is better to replace the sleep calls in 
{{TestDFSAdmin#testNameNodeGetReconfigurationStatus}} with 
{{GenericTestUtils#waitFor}}.

Thanks for your patience with the patch revisions. 

> Support reconfiguring dfs.heartbeat.interval and 
> dfs.namenode.heartbeat.recheck-interval without NN restart
> ---
>
> Key: HDFS-1477
> URL: https://issues.apache.org/jira/browse/HDFS-1477
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Patrick Kling
>Assignee: Xiaobing Zhou
> Attachments: HDFS-1477-HDFS-9000.006.patch, 
> HDFS-1477-HDFS-9000.007.patch, HDFS-1477-HDFS-9000.008.patch, 
> HDFS-1477.005.patch, HDFS-1477.2.patch, HDFS-1477.3.patch, HDFS-1477.4.patch, 
> HDFS-1477.patch
>
>
> Modify NameNode to implement the interface Reconfigurable proposed in 
> HADOOP-7001. This would allow us to change certain configuration properties 
> without restarting the name node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8356) Document missing properties in hdfs-default.xml

2016-03-04 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180692#comment-15180692
 ] 

Ray Chiang commented on HDFS-8356:
--

RE: Failing JDK8 unit test

Unit test passes in my tree


> Document missing properties in hdfs-default.xml
> ---
>
> Key: HDFS-8356
> URL: https://issues.apache.org/jira/browse/HDFS-8356
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.7.0
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability, test
> Attachments: HDFS-8356.001.patch, HDFS-8356.002.patch, 
> HDFS-8356.003.patch, HDFS-8356.004.patch, HDFS-8356.005.patch, 
> HDFS-8356.006.patch
>
>
> The following properties are currently not defined in hdfs-default.xml. These 
> properties should either be
> A) documented in hdfs-default.xml OR
> B) listed as an exception (with comments, e.g. for internal use) in the 
> TestHdfsConfigFields unit test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9891) Ozone: Add container transport client

2016-03-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180707#comment-15180707
 ] 

Chris Nauroth commented on HDFS-9891:
-

Hi [~anu].  This looks good overall.  Here are a few comments.

# I tried applying both HDFS-9873 and this one, but it didn't apply cleanly.  I 
assume there is just some trivial rebase to be done for compatibility with the 
current revision of the HDFS-9873 patch.
# {{DatanodeID#getContainerPort}} has a typo in the JavaDocs: "Retruns".
# Typically, methods for translating between protobuf objects and our domain 
objects are placed into classes in the {{org.apache.hadoop.hdfs.protocolPB}} 
package.  Do you think it would be appropriate to move some of these methods 
over there, or is there a reason that they need to stay in {{DatanodeID}} and 
{{Pipeline}}?
# {{XceiverClient#close}}: Should {{group.shutdownGracefully()}} be called 
before {{channelFuture.channel().close()}}?  I believe that means any pending 
I/O events would be drained first before closing the socket that event 
processing depends on.
# {{TestContainerServer}}: There is a risk of resource leaks in these tests.  I 
think this can be addressed by changing {{testPipeline}} to call 
{{EmbeddedChannel#close}} and changing {{testClientServer}} to call 
{{XceiverServer#stop}} and {{XceiverClient#close}}.  These calls should be 
guaranteed by using a {{finally}} block.


> Ozone: Add container transport client
> -
>
> Key: HDFS-9891
> URL: https://issues.apache.org/jira/browse/HDFS-9891
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Attachments: HDFS-9891-HDFS-7240.001.patch
>
>
> Add ozone container transport client -- that makes it easy to talk to server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9427) HDFS should not default to ephemeral ports

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180733#comment-15180733
 ] 

Arpit Agarwal commented on HDFS-9427:
-

The v002 patch lgtm. I'll postpone committing at least 3-4 days in case there 
are additional comments.

9070-9079 [are 
unassigned|https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt]
 so this looks like a good range to use.

I know nothing about KMS so not sure if it is safe to change the port number as 
[~jmhsieh] proposed. I'll file a separate Jira for that and we can address the 
ephemeral ports here.

> HDFS should not default to ephemeral ports
> --
>
> Key: HDFS-9427
> URL: https://issues.apache.org/jira/browse/HDFS-9427
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client, namenode
>Affects Versions: 3.0.0
>Reporter: Arpit Agarwal
>Assignee: Xiaobing Zhou
>Priority: Critical
>  Labels: Incompatible
> Attachments: HDFS-9427.000.patch, HDFS-9427.001.patch, 
> HDFS-9427.002.patch
>
>
> HDFS defaults to ephemeral ports for the some HTTP/RPC endpoints. This can 
> cause bind exceptions on service startup if the port is in use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9239:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

I have committed this to trunk, branch-2 and branch-2.8.  [~anu] and 
[~szetszwo], thank you for the code reviews.  Everyone, thank you for 
participating in the discussion.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9239:

Release Note: This release adds a new feature called the DataNode Lifeline 
Protocol.  If configured, then DataNodes can report that they are still alive 
to the NameNode via a fallback protocol, separate from the existing heartbeat 
messages.  This can prevent the NameNode from incorrectly marking DataNodes as 
stale or dead in highly overloaded clusters where heartbeat processing is 
suffering delays.  For more information, please refer to the hdfs-default.xml 
documentation for several new configuration properties: 
dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host, 
dfs.datanode.lifeline.interval.seconds, dfs.namenode.lifeline.handler.ratio and 
dfs.namenode.lifeline.handler.count.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9702) DiskBalancer : getVolumeMap implementation

2016-03-04 Thread Lei (Eddy) Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181316#comment-15181316
 ] 

Lei (Eddy) Xu commented on HDFS-9702:
-

Hey, [~anu]

The patch looks good in general. Will +1 once addressing the following concerns:

* In many tests, it has the following code
{code}
 thrown.expect(DiskBalancerException.class);
{code}

Could you directly check it against the {{Enum Result}} in 
{{DiskBalanceException}}, otherwise, it might just mark other potential errors.

Thanks.

> DiskBalancer : getVolumeMap implementation
> --
>
> Key: HDFS-9702
> URL: https://issues.apache.org/jira/browse/HDFS-9702
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: balancer & mover
>Affects Versions: HDFS-1312
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Fix For: HDFS-1312
>
> Attachments: HDFS-9702-HDFS-1312.001.patch
>
>
> Add get volume map 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9817) Use SLF4J in new classes

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180582#comment-15180582
 ] 

Arpit Agarwal edited comment on HDFS-9817 at 3/5/16 12:09 AM:
--

Thanks for addressing this [~anu]. One minor comment.

When calling LOG.debug with SLF4J we prefer to use the curly brace {} notation 
to pass arguments. This avoids the cost of string formatting when debug logging 
is disabled which is the normal configuration.

So this,
{code}
LOG.debug("Cluster URI : " + clusterURI);
{code}

Would be written as
{code}
LOG.debug("Cluster URI : {}", clusterURI);
{code}

+1 otherwise.


was (Author: arpitagarwal):
Thanks for addressing this [~anu]. One minor comment.

When calling LOG.debug with SLF4J we prefer to use the curly brace {{\{\}}} 
notation to pass arguments. This avoids the cost of string formatting when 
debug logging is disabled which is the normal configuration.

So this,
{code}
LOG.debug("Cluster URI : " + clusterURI);
{code}

Would be written as
{code}
LOG.debug("Cluster URI : {}", clusterURI);
{code}

+1 otherwise.

> Use SLF4J in new classes
> 
>
> Key: HDFS-9817
> URL: https://issues.apache.org/jira/browse/HDFS-9817
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: logging
>Affects Versions: HDFS-1312
>Reporter: Arpit Agarwal
>Assignee: Anu Engineer
> Attachments: HDFS-9817-HDFS-1312.001.patch
>
>
> We are trying to use SLF4J for new classes as far as possible so let's change 
> all the newly added classes to use SLF4J instead of depending on Log4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9118) Add logging system for libdhfs++

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181341#comment-15181341
 ] 

Hadoop QA commented on HDFS-9118:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 16m 22s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
22s {color} | {color:green} HDFS-8707 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 20s 
{color} | {color:green} HDFS-8707 passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 22s 
{color} | {color:green} HDFS-8707 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 17s 
{color} | {color:green} HDFS-8707 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} HDFS-8707 passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 22s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 4m 22s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 22s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 18s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 4m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 18s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 12s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
9s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 21s 
{color} | {color:green} hadoop-hdfs-native-client in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 15s 
{color} | {color:green} hadoop-hdfs-native-client in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
20s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 56m 13s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0cf5e66 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12791547/HDFS-9118.HDFS-8707.000.patch
 |
| JIRA Issue | HDFS-9118 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 604840463e95 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-8707 / 8da3bbd |
| Default Java | 1.7.0_95 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_74 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 |
| JDK v1.7.0_95  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/14722/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-native-client U: 
hadoop-hdfs-project/hadoop-hdfs-native-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/14722/console |
| Powered by | Apache Yetus 0.3.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Add logging system for libdhfs++
> 
>
> Key: HDFS-9118
> URL: https://issues.apache.org/jira/browse/HDFS-9118
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>   

[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181353#comment-15181353
 ] 

Hudson commented on HDFS-9239:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9426 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9426/])
HDFS-9239. DataNode Lifeline Protocol: an alternative protocol for (cnauroth: 
rev 2759689d7d23001f007cb0dbe2521de90734dd5c)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolPB.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeRegister.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* hadoop-common-project/hadoop-common/src/site/markdown/Metrics.md
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeLifelineProtocol.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBpServiceActorScheduler.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
* hadoop-hdfs-project/hadoop-hdfs/pom.xml
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestBlockPoolManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeLifeline.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/NamenodeProtocols.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolServerSideTranslatorPB.java
* hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeLifelineProtocol.proto
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeLifelineProtocolClientSideTranslatorPB.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DNConf.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java


> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9817) Use SLF4J in new classes

2016-03-04 Thread Anu Engineer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer updated HDFS-9817:
---
Attachment: HDFS-9817-HDFS-1312.002.patch

[~arpitagarwal] Thanks for the comment. I have fixed in the issue in all of 
diskbalancer in the new patch.

> Use SLF4J in new classes
> 
>
> Key: HDFS-9817
> URL: https://issues.apache.org/jira/browse/HDFS-9817
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: logging
>Affects Versions: HDFS-1312
>Reporter: Arpit Agarwal
>Assignee: Anu Engineer
> Attachments: HDFS-9817-HDFS-1312.001.patch, 
> HDFS-9817-HDFS-1312.002.patch
>
>
> We are trying to use SLF4J for new classes as far as possible so let's change 
> all the newly added classes to use SLF4J instead of depending on Log4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9817) Use SLF4J in new classes

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181365#comment-15181365
 ] 

Arpit Agarwal commented on HDFS-9817:
-

Thanks Anu. +1 pending Jenkins.

> Use SLF4J in new classes
> 
>
> Key: HDFS-9817
> URL: https://issues.apache.org/jira/browse/HDFS-9817
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: logging
>Affects Versions: HDFS-1312
>Reporter: Arpit Agarwal
>Assignee: Anu Engineer
> Attachments: HDFS-9817-HDFS-1312.001.patch, 
> HDFS-9817-HDFS-1312.002.patch
>
>
> We are trying to use SLF4J for new classes as far as possible so let's change 
> all the newly added classes to use SLF4J instead of depending on Log4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9579) Provide bytes-read-by-network-distance metrics at FileSystem.Statistics level

2016-03-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181388#comment-15181388
 ] 

Hadoop QA commented on HDFS-9579:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m 38s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 30s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
54s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 50s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 40s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
9s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 23s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s 
{color} | {color:green} trunk passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 13s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
56s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 5m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 5m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 32s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 11s 
{color} | {color:red} root: patch generated 4 new + 546 unchanged - 10 fixed = 
550 total (was 556) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 19s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 
48s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed with JDK v1.8.0_74 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 12s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 51s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_74. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 49s 
{color} | {color:green} hadoop-hdfs-client in the patch passed with JDK 
v1.8.0_74. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 53m 31s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_74. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 39s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 58s 
{color} | {color:green} hadoop-hdfs-client in the patch passed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 51m 44s 
{color} | {color:gr

[jira] [Updated] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated HDFS-9906:
---
Status: Patch Available  (was: Open)

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
> Attachments: HDFS-9906.patch
>
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated HDFS-9906:
---
Attachment: HDFS-9906.patch

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
> Attachments: HDFS-9906.patch
>
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181443#comment-15181443
 ] 

Brahma Reddy Battula commented on HDFS-9906:


Me too +1 on making it debug..

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
> Attachments: HDFS-9906.patch
>
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9817) Use SLF4J in new classes

2016-03-04 Thread Mingliang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181445#comment-15181445
 ] 

Mingliang Liu commented on HDFS-9817:
-

Hi [~anu], I see you changed log level at several places. Is this on purpose? 
Generally I'm in favor of the change. Thanks.

Nit:

{code:title=ConnectorFactory.java#getCluster()}
41  LOG.debug("Cluster URI : {}" , clusterURI);
42  LOG.debug("scheme : {}" , clusterURI.getScheme());
{code}
You can merge them in one line, along with some meaningful message.

> Use SLF4J in new classes
> 
>
> Key: HDFS-9817
> URL: https://issues.apache.org/jira/browse/HDFS-9817
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: logging
>Affects Versions: HDFS-1312
>Reporter: Arpit Agarwal
>Assignee: Anu Engineer
> Attachments: HDFS-9817-HDFS-1312.001.patch, 
> HDFS-9817-HDFS-1312.002.patch
>
>
> We are trying to use SLF4J for new classes as far as possible so let's change 
> all the newly added classes to use SLF4J instead of depending on Log4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-9906:

Assignee: Brahma Reddy Battula

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
>Assignee: Brahma Reddy Battula
> Attachments: HDFS-9906.patch
>
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9906) Remove spammy log spew when a datanode is restarted

2016-03-04 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181449#comment-15181449
 ] 

Arpit Agarwal commented on HDFS-9906:
-

+1 pending Jenkins. Thanks for the quick patch [~brahmareddy].

I'll hold off committing until next week in case there are any objections.

> Remove spammy log spew when a datanode is restarted
> ---
>
> Key: HDFS-9906
> URL: https://issues.apache.org/jira/browse/HDFS-9906
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: Elliott Clark
> Attachments: HDFS-9906.patch
>
>
> {code}
> WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock 
> request received for blk_1109897077_36157149 on node 192.168.1.1:50010 size 
> 268435456
> {code}
> This happens wy too much to add any useful information. We should either 
> move this to a different level or only warn once per machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9910) Datanode heartbeats can get blocked by disk in {{FsDatasetImpl#checkBlock()}}

2016-03-04 Thread Inigo Goiri (JIRA)
Inigo Goiri created HDFS-9910:
-

 Summary: Datanode heartbeats can get blocked by disk in 
{{FsDatasetImpl#checkBlock()}}
 Key: HDFS-9910
 URL: https://issues.apache.org/jira/browse/HDFS-9910
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.2
Reporter: Inigo Goiri
Assignee: Hua Liu


When a data node needs to transfer a block, it validates the block in the 
heartbeat thread invoking the {{checkBlock()}} method of {{FsDatasetImpl}}, 
where it checks whether the block exists and gets the block length. If the 
block is valid, it then spins off a thread to do the actual block transfer. We 
found that during heavy disk IO the heartbeat thread hangs on 
{{replicaInfo.getBlockFile().exists()}} for more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >