[jira] [Commented] (HDFS-7454) Reduce memory footprint for AclEntries in NameNode

2014-12-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235115#comment-14235115
 ] 

Chris Nauroth commented on HDFS-7454:
-

[~vinayrpet], thank you for the patch.  [~wheat9], thank you for taking care of 
code review and commit.  I see there was one specific question directed to me, 
and I apologize for not being able to reply sooner.

{quote}
I have one doubt here.. 
whether we really need to append all ACL entries along with permission bits in 
the exception message?
By seeing these AclEntries, caller could easily access by impersonating one of 
the user in the entries? Right?
{quote}

File system permissions and ACLs assume strong authentication is in place 
first.  In a cluster using Kerberos, I don't expect seeing ACL entries alone 
would compromise our security.  The user wouldn't be able to impersonate 
another user anyway, unless there was some other misconfiguration, such as 
allowing the user access to private keytab files.

I'd suggest we either restore the old exception message or just append the '+' 
indicator if an ACL is present, like the ls command.  This will let users know 
that they should consider ACLs if they are dealing with an unexpected access 
denied.  We can do it in a follow-up jira.

Thanks again, Vinay!  I'm aiming to review HDFS-7456 tomorrow, and of course 
finishing out HDFS-7384 too.

> Reduce memory footprint for AclEntries in NameNode
> --
>
> Key: HDFS-7454
> URL: https://issues.apache.org/jira/browse/HDFS-7454
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Fix For: 2.7.0
>
> Attachments: HDFS-7454-001.patch, HDFS-7454-002.patch, 
> HDFS-7454-003.patch, HDFS-7454-004.patch
>
>
> HDFS-5620 indicated a GlobalAclSet containing unique {{AclFeature}} can be 
> de-duplicated to save the memory in NameNode. However it was not implemented 
> at that time.
> This Jira re-proposes same implementation, along with de-duplication of 
> unique {{AclEntry}} across all ACLs.
> One simple usecase is:
> A mapreduce user's home directory with the set of default ACLs, under which 
> lot of other files/directories could be created when jobs is run. Here all 
> the default ACLs of parent directory will be duplicated till the explicit 
> delete of those ACLs. With de-duplication,only one object will be in memory 
> for the same Entry across all ACLs of all files/directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6833) DirectoryScanner should not register a deleting block with memory of DataNode

2014-12-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234366#comment-14234366
 ] 

Chris Nauroth commented on HDFS-6833:
-

Thank you for the update, [~yamashitasni].

> DirectoryScanner should not register a deleting block with memory of DataNode
> -
>
> Key: HDFS-6833
> URL: https://issues.apache.org/jira/browse/HDFS-6833
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.5.0, 2.5.1
>Reporter: Shinichi Yamashita
>Assignee: Shinichi Yamashita
>Priority: Critical
> Attachments: HDFS-6833-6-2.patch, HDFS-6833-6-3.patch, 
> HDFS-6833-6.patch, HDFS-6833-7-2.patch, HDFS-6833-7.patch, HDFS-6833.8.patch, 
> HDFS-6833.9.patch, HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, 
> HDFS-6833.patch, HDFS-6833.patch
>
>
> When a block is deleted in DataNode, the following messages are usually 
> output.
> {code}
> 2014-08-07 17:53:11,606 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:11,617 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> However, DirectoryScanner may be executed when DataNode deletes the block in 
> the current implementation. And the following messsages are output.
> {code}
> 2014-08-07 17:53:30,519 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:31,426 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1887080305-172.28.0.101-1407398838872 Total blocks: 1, missing metadata 
> files:0, missing block files:0, missing blocks in memory:1, mismatched 
> blocks:0
> 2014-08-07 17:53:31,426 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
> missing block to memory FinalizedReplica, blk_1073741825_1001, FINALIZED
>   getNumBytes() = 21230663
>   getBytesOnDisk()  = 21230663
>   getVisibleLength()= 21230663
>   getVolume()   = /hadoop/data1/dfs/data/current
>   getBlockFile()= 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>   unlinked  =false
> 2014-08-07 17:53:31,531 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> Deleting block information is registered in DataNode's memory.
> And when DataNode sends a block report, NameNode receives wrong block 
> information.
> For example, when we execute recommission or change the number of 
> replication, NameNode may delete the right block as "ExcessReplicate" by this 
> problem.
> And "Under-Replicated Blocks" and "Missing Blocks" occur.
> When DataNode run DirectoryScanner, DataNode should not register a deleting 
> block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-12-02 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232100#comment-14232100
 ] 

Chris Nauroth commented on HDFS-7188:
-

I agree with the direction of using separately compiled platform-specific files 
over lots of ifdefs (something that we're unfortunately still maintaining on 
the libhadoop.so/hadoop.dll side).

[~thanhdo], you might want to review the libhdfs patch I committed for 
HDFS-573.  I believe this was the first time we split Windows code into 
separate files.  I ran into similar issues to what you described: different 
parameter orders, different signatures on the printf family of functions, and 
missing functions.  The HDFS-573 patch almost entirely avoided ifdefs by using 
a set of common headers, platform-specific implementation files grouped under 
separate directories, and some indirection with macros and typedefs.  Maybe 
looking at this prior patch will demonstrate techniques that you can use here.

> support build libhdfs3 on windows
> -
>
> Key: HDFS-7188
> URL: https://issues.apache.org/jira/browse/HDFS-7188
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
> Environment: Windows System, Visual Studio 2010
>Reporter: Zhanwei Wang
>Assignee: Thanh Do
> Attachments: HDFS-7188-branch-HDFS-6994-0.patch
>
>
> libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7431) log message for InvalidMagicNumberException may be incorrect

2014-12-02 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231916#comment-14231916
 ] 

Chris Nauroth commented on HDFS-7431:
-

Hi, Yi.  This looks good.

This is really minor, but I wonder if it's possible to simplify the patch by 
avoiding changing the signature of {{SaslDataTransferServer#doSaslHandshake}}.  
The {{handshake4Encryption}} is {{true}} if and only if 
{{dnConf.getEncryptDataTransfer()}} is also {{true}}.  We could check 
{{dnConf}} directly and avoid the need to pass around an additional flag.  Let 
me know if you agree.

> log message for InvalidMagicNumberException may be incorrect
> 
>
> Key: HDFS-7431
> URL: https://issues.apache.org/jira/browse/HDFS-7431
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: HDFS-7431.001.patch, HDFS-7431.002.patch
>
>
> For security mode, HDFS now supports that Datanodes don't require root or 
> jsvc if {{dfs.data.transfer.protection}} is configured.
> Log message for {{InvalidMagicNumberException}}, we miss one case: 
> when the datanodes run on unprivileged port and 
> {{dfs.data.transfer.protection}} is configured to {{authentication}} but 
> {{dfs.encrypt.data.transfer}} is not configured. SASL handshake is required 
> and a low version dfs client is used, then {{InvalidMagicNumberException}} is 
> thrown and we write log:
> {quote}
> Failed to read expected encryption handshake from client at  Perhaps the 
> client is running an older version of Hadoop which does not support encryption
> {quote}
> Recently I run HDFS built on trunk and security is enabled, but the client is 
> 2.5.1 version. Then I got the above log message, but actually I have not 
> configured encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-4552) For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as opposed to 1.x versions

2014-12-02 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-4552.
-
Resolution: Duplicate

> For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as 
> opposed to 1.x versions
> -
>
> Key: HDFS-4552
> URL: https://issues.apache.org/jira/browse/HDFS-4552
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs
>Affects Versions: 2.0.3-alpha
> Environment: Ubuntu 12.04 32 bit, java version "1.7.0_03"
> c++ application
>Reporter: Shubhangi Garg
>
> I am writing an application in c++, which uses API provided by libhdfs to 
> manipulate Hadoop DFS.
> I could run the application with 1.0.4 and 1.1.1; setting classpath equal to 
> $(hadoop classpath).
> For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not load 
> necessary classes required forlibhdfs; as opposed to 1.x versions; giving the 
> following error:
> loadFileSystems error:
> (unable to get stack trace for java.lang.NoClassDefFoundError exception: 
> ExceptionUtils::getStackTrace error.)
> hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=(NULL)) error:
> (unable to get stack trace for java.lang.NoClassDefFoundError exception: 
> ExceptionUtils::getStackTrace error.)
> I tried loading the jar files with their full path specified (as opposed to 
> wildcard characters used in the classpath); and the application runs, but 
> gives the following warning:
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> 13/03/04 11:17:23 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4552) For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as opposed to 1.x versions

2014-12-02 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231865#comment-14231865
 ] 

Chris Nauroth commented on HDFS-4552:
-

Hello, [~shubhangi].

The reason for this problem is that Hadoop 2 started using wildcard syntax in 
the output of {{hadoop classpath}}.  Hadoop 1 used the full path to every jar 
without wildcards.  Unfortunately, Java does not expand the wildcards 
automatically when launching an embedded JVM via JNI, so your existing script 
may have stopped working.

In Hadoop 2.6.0, I shipped an enhancement to the {{hadoop classpath}} command 
in issue HADOOP-10903.  It now accepts command line arguments that expand all 
of the wildcards for you, or bundle the whole classpath into a jar file's 
manifest (very helpful on Windows, where our classpath can blow past the 
maximum command line length of 8191).  The new command line arguments are 
documented here:

http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/CommandsManual.html#classpath

I expect this gives you what you need to restore the functionality of your 
script, so I'm going to resolve this as a duplicate of HADOOP-10903.  Thank you!

> For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not work, as 
> opposed to 1.x versions
> -
>
> Key: HDFS-4552
> URL: https://issues.apache.org/jira/browse/HDFS-4552
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs
>Affects Versions: 2.0.3-alpha
> Environment: Ubuntu 12.04 32 bit, java version "1.7.0_03"
> c++ application
>Reporter: Shubhangi Garg
>
> I am writing an application in c++, which uses API provided by libhdfs to 
> manipulate Hadoop DFS.
> I could run the application with 1.0.4 and 1.1.1; setting classpath equal to 
> $(hadoop classpath).
> For Hadoop 2.0.3; setting CLASSPATH=$(hadoop classpath) does not load 
> necessary classes required forlibhdfs; as opposed to 1.x versions; giving the 
> following error:
> loadFileSystems error:
> (unable to get stack trace for java.lang.NoClassDefFoundError exception: 
> ExceptionUtils::getStackTrace error.)
> hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=(NULL)) error:
> (unable to get stack trace for java.lang.NoClassDefFoundError exception: 
> ExceptionUtils::getStackTrace error.)
> I tried loading the jar files with their full path specified (as opposed to 
> wildcard characters used in the classpath); and the application runs, but 
> gives the following warning:
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> 13/03/04 11:17:23 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-12-01 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230131#comment-14230131
 ] 

Chris Nauroth commented on HDFS-7384:
-

Thanks for the updated patch, Vinay.  Nice, the unit tests caught a legitimate 
problem!  :-)

I believe the default ACL case can be made to work in 
{{AclStatus#getEffectivePermission}} by checking for default scope, and using 
the second-to-last ACL entry in the list as the mask.  The sort order enforced 
on the NameNode side guarantees this.  This second-to-last logic currently in 
{{AclCommands}} would then be unnecessary:
{code}
  } else {
// ACL sort order guarantees mask is the second-to-last entry.
FsAction maskPerm = entries.get(entries.size() - 2).getPermission();
for (AclEntry entry: entries) {
  printExtendedAclEntry(entry, maskPerm);
}
  }
{code}
For the default ACL case, we'd never consider the {{permArg}} passed to 
{{AclStatus#getEffectivePermission}} as a candidate for the mask.  A default 
ACL always has the mask stored directly in the ACL entry list.

What do you think?  I think this is better than publishing an API that works 
for access ACL entries but gives incorrect results for default ACL entries.

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch, HDFS-7384-002.patch, 
> HDFS-7384-003.patch, HDFS-7384-004.patch, HDFS-7384-005.patch, 
> HDFS-7384-006.patch, HDFS-7384-007.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6833) DirectoryScanner should not register a deleting block with memory of DataNode

2014-12-01 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230078#comment-14230078
 ] 

Chris Nauroth commented on HDFS-6833:
-

Hi [~sinchii].  Are you planning on posting an updated patch or otherwise 
addressing the last round of feedback?  I'm not sure of the timeline for the 
2.6.1 release, but if you want to include this, it would be good to get it in 
quickly.  Thanks!

> DirectoryScanner should not register a deleting block with memory of DataNode
> -
>
> Key: HDFS-6833
> URL: https://issues.apache.org/jira/browse/HDFS-6833
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.5.0, 2.5.1
>Reporter: Shinichi Yamashita
>Assignee: Shinichi Yamashita
>Priority: Critical
> Attachments: HDFS-6833-6-2.patch, HDFS-6833-6-3.patch, 
> HDFS-6833-6.patch, HDFS-6833-7-2.patch, HDFS-6833-7.patch, HDFS-6833.8.patch, 
> HDFS-6833.9.patch, HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, 
> HDFS-6833.patch, HDFS-6833.patch
>
>
> When a block is deleted in DataNode, the following messages are usually 
> output.
> {code}
> 2014-08-07 17:53:11,606 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:11,617 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> However, DirectoryScanner may be executed when DataNode deletes the block in 
> the current implementation. And the following messsages are output.
> {code}
> 2014-08-07 17:53:30,519 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:31,426 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1887080305-172.28.0.101-1407398838872 Total blocks: 1, missing metadata 
> files:0, missing block files:0, missing blocks in memory:1, mismatched 
> blocks:0
> 2014-08-07 17:53:31,426 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
> missing block to memory FinalizedReplica, blk_1073741825_1001, FINALIZED
>   getNumBytes() = 21230663
>   getBytesOnDisk()  = 21230663
>   getVisibleLength()= 21230663
>   getVolume()   = /hadoop/data1/dfs/data/current
>   getBlockFile()= 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>   unlinked  =false
> 2014-08-07 17:53:31,531 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> Deleting block information is registered in DataNode's memory.
> And when DataNode sends a block report, NameNode receives wrong block 
> information.
> For example, when we execute recommission or change the number of 
> replication, NameNode may delete the right block as "ExcessReplicate" by this 
> problem.
> And "Under-Replicated Blocks" and "Missing Blocks" occur.
> When DataNode run DirectoryScanner, DataNode should not register a deleting 
> block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7454) Implement Global ACL Set for memory optimization in NameNode

2014-11-30 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229454#comment-14229454
 ] 

Chris Nauroth commented on HDFS-7454:
-

I actually think there is value in both optimizations.  For Vinay's use case of 
a default ACL on a popular directory, it's more realistic to think of a larger 
ACL entry list than 3.  A default ACL can never be fewer than 5 entries, so 
every directory will get a full copy of that + the access ACL entries.  The 
files will only get the access ACL entries, so a number of 2-3 makes sense 
there.  We could say 5 across both files and directories for a rough cut, which 
would put the earlier example at 6 GB.  De-duplication could effectively reduce 
this to just 2 distinct {{AclFeature}} instances: 1 for all directories and 1 
for all files, so the memory usage would be almost unnoticeable.

I would suggest making the move to an {{int}} representation though rather than 
de-duplicating the individual entries.  De-duplication is really valuable at 
the level of the whole {{AclFeature}} instance.  The 2 optimizations don't 
necessarily need to be coupled to one another.  They could be done in 2 
different patches.

Vinay, based on your observed usage pattern, what do you think is the best 
option for proceeding with these 2 possible optimization paths?  Ideally, we'd 
drive the choice from a real-world use case.

bq. With these number, the scheme seems a pretty good thing to have before we 
really thinking of getting into the mud of implementing an interner.

I fear I may have caused more confusion than help by posting an initial patch 
using the Guava interner.  It turns out that a full interning implementation 
really isn't necessary, because we can trust that all ACL modification 
operations are executing under the namesystem write lock.  All we really need 
is some logic over a set data structure to check for existence of an identical 
prior {{AclFeature}} and reuse it.  It's a much simpler code change than what 
my initial patch hinted at.

> Implement Global ACL Set for memory optimization in NameNode
> 
>
> Key: HDFS-7454
> URL: https://issues.apache.org/jira/browse/HDFS-7454
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7454-001.patch
>
>
> HDFS-5620 indicated a GlobalAclSet containing unique {{AclFeature}} can be 
> de-duplicated to save the memory in NameNode. However it was not implemented 
> at that time.
> This Jira re-proposes same implementation, along with de-duplication of 
> unique {{AclEntry}} across all ACLs.
> One simple usecase is:
> A mapreduce user's home directory with the set of default ACLs, under which 
> lot of other files/directories could be created when jobs is run. Here all 
> the default ACLs of parent directory will be duplicated till the explicit 
> delete of those ACLs. With de-duplication,only one object will be in memory 
> for the same Entry across all ACLs of all files/directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7454) Implement Global ACL Set for memory optimization in NameNode

2014-11-30 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229266#comment-14229266
 ] 

Chris Nauroth commented on HDFS-7454:
-

Last time I checked on this, an {{AclEntry}} could theoretically be represented 
as:
* 1 bit for scope (access or default)
* 2 bits for type (user, group, mask or other)
* 3 bits for permission (none, execute, write, write-execute, read, 
read-execute, read-write, all)
* 25 bits for name, which is the larger of user and group as represented by 
{{INodeWithAdditionalFields#PermissionStatusFormat}}.
* Perhaps 1 additional bit would be necessary to represent whether or not name 
is defined at all (essentially to support null).
* That's a total of either 31 or 32 bits depending on if we need the null 
indicator.

With that scheme, an {{AclEntry}} fits in a 32-bit Java {{int}}.  Using 
{{INodeWithAdditionalFields#PermissionStatusFormat}} instead of 
{{SerialNumberManager}} keeps it inside an {{int}} instead of requiring a 
{{long}}.  I believe this would be a safe change for any practical existing 
data, but I'd want to think through the edge cases a bit more, and we could 
jump to {{long}} if necessary.

However, none of this would address the duplication problem.  My earlier RAM 
sizing estimates in the HDFS-4685 design document showed that duplication of 
the {{AclFeature}} instances and the arrays they contain is a more significant 
factor than the size of the array elements.  That would indicate there is still 
value in what Vinay's patch has done here.

> Implement Global ACL Set for memory optimization in NameNode
> 
>
> Key: HDFS-7454
> URL: https://issues.apache.org/jira/browse/HDFS-7454
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7454-001.patch
>
>
> HDFS-5620 indicated a GlobalAclSet containing unique {{AclFeature}} can be 
> de-duplicated to save the memory in NameNode. However it was not implemented 
> at that time.
> This Jira re-proposes same implementation, along with de-duplication of 
> unique {{AclEntry}} across all ACLs.
> One simple usecase is:
> A mapreduce user's home directory with the set of default ACLs, under which 
> lot of other files/directories could be created when jobs is run. Here all 
> the default ACLs of parent directory will be duplicated till the explicit 
> delete of those ACLs. With de-duplication,only one object will be in memory 
> for the same Entry across all ACLs of all files/directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-26 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226786#comment-14226786
 ] 

Chris Nauroth commented on HDFS-7384:
-

Hi [~vinayrpet].  Thank you for the follow-up.  For this version of the patch, 
I have one compatibility concern and a few minor nitpicks.

# {{AclStatus#getEffectivePermission}}: I think there is a compatibility 
problem in this method.  Let's assume this patch goes into 2.7.0, and then we 
run a 2.7.0 client connected to a 2.6.0 NameNode.  The old NameNode will not 
populate the new permissions field in the outbound {{AclStatus}}.  The 2.7.0 
client would go into the null check path and not apply any mask, resulting in 
{{hdfs dfs -getfacl}} reporting incorrect effective permissions.  For 
compatibility, I think the shell will need a way to detect that the NameNode 
didn't populate permissions, and fall back to the current logic of using 
permissions from {{FileStatus}}.
# {{AclStatus#getPermission}}: I suggest adding JavaDocs.
# {{AclStatus#Builder#setPermission}}: I suggest removing the word "default" 
here, just to prevent any confusion that this is somehow related to default 
ACLs.  Same thing for the private {{AclStatus}} constructor.
# Let's update the documentation in WebHDFS.apt.vm to show the new fields in 
the GETACLSTATUS example JSON response.


> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch, HDFS-7384-002.patch, 
> HDFS-7384-003.patch, HDFS-7384-004.patch, HDFS-7384-005.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7431) log message for InvalidMagicNumberException may be incorrect

2014-11-26 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226538#comment-14226538
 ] 

Chris Nauroth commented on HDFS-7431:
-

Hello, Yi.  Thank you for investigating this and posting a patch.  I have a 
possible idea for distinguishing the 2 cases.  We throw 
{{InvalidMagicNumberException}} from 
{{SaslDataTransferServer#doSaslHandshake}}.  Within this method, we have the 
information we need to distinguish between the 2 cases:
* {{if (dnConf.getEncryptDataTransfer())}}, then it's the encrypted case.
* {{if (dnConf.getSaslPropsResolver() != null)}}, then it's the data transfer 
protection case.

After checking that, we could throw exceptions with different messages 
depending on the case.  This could either be done with 2 distinct subclasses of 
{{InvalidMagicNumberException}} or adding some kind of type tag as a member.  
For the text of the messages, I suggest:

{code}
LOG.info("Failed to read expected encryption handshake from client " +
 "at " + peer.getRemoteAddressString() + ". Perhaps the client " +
 "is running an older version of Hadoop which does not support " +
"encryption");
{code}

{code}
LOG.info("Failed to read expected SASL data transfer protection 
handshake from client " +
 "at " + peer.getRemoteAddressString() + ". Perhaps the client " +
 "is running an older version of Hadoop which does not support " +
"encryption");
{code}

What are your thoughts on this?

> log message for InvalidMagicNumberException may be incorrect
> 
>
> Key: HDFS-7431
> URL: https://issues.apache.org/jira/browse/HDFS-7431
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Minor
> Attachments: HDFS-7431.001.patch
>
>
> For security mode, HDFS now supports that Datanodes don't require root or 
> jsvc if {{dfs.data.transfer.protection}} is configured.
> Log message for {{InvalidMagicNumberException}}, we miss one case: 
> when the datanodes run on unprivileged port and 
> {{dfs.data.transfer.protection}} is configured to {{authentication}} but 
> {{dfs.encrypt.data.transfer}} is not configured. SASL handshake is required 
> and a low version dfs client is used, then {{InvalidMagicNumberException}} is 
> thrown and we write log:
> {quote}
> Failed to read expected encryption handshake from client at  Perhaps the 
> client is running an older version of Hadoop which does not support encryption
> {quote}
> Recently I run HDFS built on trunk and security is enabled, but the client is 
> 2.5.1 version. Then I got the above log message, but actually I have not 
> configured encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7188) support build libhdfs3 on windows

2014-11-24 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223177#comment-14223177
 ] 

Chris Nauroth commented on HDFS-7188:
-

Hi, [~thanhdo].  That sounds like an acceptable plan.  This is consistent with 
the way other external dependencies are being handled, like Snappy and zlib.  
We can always do another enhancement later if necessary to improve the dev 
experience of getting these dependencies.

> support build libhdfs3 on windows
> -
>
> Key: HDFS-7188
> URL: https://issues.apache.org/jira/browse/HDFS-7188
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Zhanwei Wang
>Assignee: Thanh Do
>
> libhdfs3 should work on windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7425) NameNode block deletion logging uses incorrect appender.

2014-11-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-7425.
-
   Resolution: Fixed
Fix Version/s: 2.6.1
 Hadoop Flags: Reviewed

I committed this to branch-2 and branch-2.6.  Haohui, thank you for the code 
review.

> NameNode block deletion logging uses incorrect appender.
> 
>
> Key: HDFS-7425
> URL: https://issues.apache.org/jira/browse/HDFS-7425
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Fix For: 2.6.1
>
> Attachments: HDFS-7425-branch-2.1.patch
>
>
> The NameNode uses 2 separate Log4J appenders for tracking state changes.  The 
> appenders are named "org.apache.hadoop.hdfs.StateChange" and 
> "BlockStateChange".  The intention of BlockStateChange is to separate more 
> verbose block state change logging and allow it to be configured separately.  
> In branch-2, there is some block state change logging that incorrectly goes 
> to the "org.apache.hadoop.hdfs.StateChange" appender though.  The bug is not 
> present in trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7425) NameNode block deletion logging uses incorrect appender.

2014-11-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7425:

Attachment: HDFS-7425-branch-2.1.patch

Here is a branch-2 patch that corrects this logging.  This is in agreement with 
what we have currently in trunk.  (There will be no trunk patch for this.)

> NameNode block deletion logging uses incorrect appender.
> 
>
> Key: HDFS-7425
> URL: https://issues.apache.org/jira/browse/HDFS-7425
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Attachments: HDFS-7425-branch-2.1.patch
>
>
> The NameNode uses 2 separate Log4J appenders for tracking state changes.  The 
> appenders are named "org.apache.hadoop.hdfs.StateChange" and 
> "BlockStateChange".  The intention of BlockStateChange is to separate more 
> verbose block state change logging and allow it to be configured separately.  
> In branch-2, there is some block state change logging that incorrectly goes 
> to the "org.apache.hadoop.hdfs.StateChange" appender though.  The bug is not 
> present in trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7425) NameNode block deletion logging uses incorrect appender.

2014-11-21 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7425:

Priority: Minor  (was: Major)

> NameNode block deletion logging uses incorrect appender.
> 
>
> Key: HDFS-7425
> URL: https://issues.apache.org/jira/browse/HDFS-7425
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
>
> The NameNode uses 2 separate Log4J appenders for tracking state changes.  The 
> appenders are named "org.apache.hadoop.hdfs.StateChange" and 
> "BlockStateChange".  The intention of BlockStateChange is to separate more 
> verbose block state change logging and allow it to be configured separately.  
> In branch-2, there is some block state change logging that incorrectly goes 
> to the "org.apache.hadoop.hdfs.StateChange" appender though.  The bug is not 
> present in trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7425) NameNode block deletion logging uses incorrect appender.

2014-11-21 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7425:
---

 Summary: NameNode block deletion logging uses incorrect appender.
 Key: HDFS-7425
 URL: https://issues.apache.org/jira/browse/HDFS-7425
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth


The NameNode uses 2 separate Log4J appenders for tracking state changes.  The 
appenders are named "org.apache.hadoop.hdfs.StateChange" and 
"BlockStateChange".  The intention of BlockStateChange is to separate more 
verbose block state change logging and allow it to be configured separately.  
In branch-2, there is some block state change logging that incorrectly goes to 
the "org.apache.hadoop.hdfs.StateChange" appender though.  The bug is not 
present in trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit

2014-11-18 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7398:

   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

I have committed this to trunk and branch-2.  Gera, thank you for taking the 
time to improve this area of the code.  Colin and Vinay, thank you for 
participating on the code review.

> Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
> --
>
> Key: HDFS-7398
> URL: https://issues.apache.org/jira/browse/HDFS-7398
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Fix For: 2.7.0
>
> Attachments: HDFS-7398.v01.patch, HDFS-7398.v02.patch
>
>
> This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6833) DirectoryScanner should not register a deleting block with memory of DataNode

2014-11-18 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216546#comment-14216546
 ] 

Chris Nauroth commented on HDFS-6833:
-

Also related to the last comment, are you seeing anything in your environment 
that indicates the async deletions take an unusually long time to complete?

> DirectoryScanner should not register a deleting block with memory of DataNode
> -
>
> Key: HDFS-6833
> URL: https://issues.apache.org/jira/browse/HDFS-6833
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.5.0, 2.5.1
>Reporter: Shinichi Yamashita
>Assignee: Shinichi Yamashita
>Priority: Critical
> Attachments: HDFS-6833-6-2.patch, HDFS-6833-6-3.patch, 
> HDFS-6833-6.patch, HDFS-6833-7-2.patch, HDFS-6833-7.patch, HDFS-6833.8.patch, 
> HDFS-6833.9.patch, HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, 
> HDFS-6833.patch, HDFS-6833.patch
>
>
> When a block is deleted in DataNode, the following messages are usually 
> output.
> {code}
> 2014-08-07 17:53:11,606 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:11,617 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> However, DirectoryScanner may be executed when DataNode deletes the block in 
> the current implementation. And the following messsages are output.
> {code}
> 2014-08-07 17:53:30,519 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:31,426 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1887080305-172.28.0.101-1407398838872 Total blocks: 1, missing metadata 
> files:0, missing block files:0, missing blocks in memory:1, mismatched 
> blocks:0
> 2014-08-07 17:53:31,426 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
> missing block to memory FinalizedReplica, blk_1073741825_1001, FINALIZED
>   getNumBytes() = 21230663
>   getBytesOnDisk()  = 21230663
>   getVisibleLength()= 21230663
>   getVolume()   = /hadoop/data1/dfs/data/current
>   getBlockFile()= 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>   unlinked  =false
> 2014-08-07 17:53:31,531 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> Deleting block information is registered in DataNode's memory.
> And when DataNode sends a block report, NameNode receives wrong block 
> information.
> For example, when we execute recommission or change the number of 
> replication, NameNode may delete the right block as "ExcessReplicate" by this 
> problem.
> And "Under-Replicated Blocks" and "Missing Blocks" occur.
> When DataNode run DirectoryScanner, DataNode should not register a deleting 
> block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6833) DirectoryScanner should not register a deleting block with memory of DataNode

2014-11-18 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216528#comment-14216528
 ] 

Chris Nauroth commented on HDFS-6833:
-

Thank you for working on this, Shinichi.  Echoing earlier comments, I'm a bit 
confused about why {{DirectoryScanner}} has the responsibility to call 
{{FsDatasetSpi#removeDeletedBlocks}}.  This causes us to delete the block ID 
from the internal data structure tracking still-to-be-deleted-from-disk blocks. 
 This part of the code is logically disconnected from the code that actually 
does the delete syscall, so it has no way to guarantee that the delete has 
really finished.  It seems there would still be a race condition.  If the next 
scan triggered before the delete completed, then the scanner wouldn't know that 
the block is still waiting to be deleted.  (Of course, I'd expect this to be 
extremely rare given the fact that scan periods are usually quite long, 6 hours 
by default.)  Moving this logic closer to the actual delete in 
{{ReplicaFileDeleteTask}} would address this.

I'm curious if you can provide any more details about why this is so easy to 
reproduce in your environment.  There is no doubt there is a bug here, but from 
what I can tell, it has been there a long time, and I'd expect it to occur only 
very rarely.  The scan period is so long (again, 6 hours by default) that I 
can't see how this can happen very often.  Your comments seem to suggest that 
you can see this happen regularly, and on multiple DataNodes simultaneously, 
resulting in data loss.  That would require scanners on independent DataNodes 
landing in a lock-step schedule with each other.  For a typical 3-replica file, 
this should be very unlikely.  For a 1 or even a 2-replica file, there is 
already a much higher risk of data loss due to hardware failure despite this 
bug.  Is there anything specific to your configuration that could make this 
more likely?  Have you configured the scan period to something much more 
frequent?  Are you very rapidly decommissioning and recommissioning nodes?

> DirectoryScanner should not register a deleting block with memory of DataNode
> -
>
> Key: HDFS-6833
> URL: https://issues.apache.org/jira/browse/HDFS-6833
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.5.0, 2.5.1
>Reporter: Shinichi Yamashita
>Assignee: Shinichi Yamashita
>Priority: Critical
> Attachments: HDFS-6833-6-2.patch, HDFS-6833-6-3.patch, 
> HDFS-6833-6.patch, HDFS-6833-7-2.patch, HDFS-6833-7.patch, HDFS-6833.8.patch, 
> HDFS-6833.9.patch, HDFS-6833.patch, HDFS-6833.patch, HDFS-6833.patch, 
> HDFS-6833.patch, HDFS-6833.patch
>
>
> When a block is deleted in DataNode, the following messages are usually 
> output.
> {code}
> 2014-08-07 17:53:11,606 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:11,617 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Deleted BP-1887080305-172.28.0.101-1407398838872 blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
> {code}
> However, DirectoryScanner may be executed when DataNode deletes the block in 
> the current implementation. And the following messsages are output.
> {code}
> 2014-08-07 17:53:30,519 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:
>  Scheduling blk_1073741825_1001 file 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>  for deletion
> 2014-08-07 17:53:31,426 INFO 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
> BP-1887080305-172.28.0.101-1407398838872 Total blocks: 1, missing metadata 
> files:0, missing block files:0, missing blocks in memory:1, mismatched 
> blocks:0
> 2014-08-07 17:53:31,426 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
> missing block to memory FinalizedReplica, blk_1073741825_1001, FINALIZED
>   getNumBytes() = 21230663
>   getBytesOnDisk()  = 21230663
>   getVisibleLength()= 21230663
>   getVolume()   = /hadoop/data1/dfs/data/current
>   getBlockFile()= 
> /hadoop/data1/dfs/data/current/BP-1887080305-172.28.0.101-1407398838872/current/finalized/subdir0/subdir0/blk_1073741825
>   unlinked  =false
> 2014-08-07 17:53:31,531 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService:

[jira] [Updated] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit

2014-11-17 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7398:

Hadoop Flags: Reviewed

bq. Access to protected is the superset of the package scope according to the 
JLS Section 6.6.1.

Oops, you're right.

I'm +1 for patch v2.  The test failure is unrelated.  It passed locally for me.

I see a lot of other watchers here, so I'll hold off 24 hours before 
committing.  Thanks, Gera.

> Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
> --
>
> Key: HDFS-7398
> URL: https://issues.apache.org/jira/browse/HDFS-7398
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: HDFS-7398.v01.patch, HDFS-7398.v02.patch
>
>
> This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit

2014-11-17 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215272#comment-14215272
 ] 

Chris Nauroth commented on HDFS-7398:
-

This version looks great to me.  Just one minor nitpick: let's mark 
{{FSEditLogOp#resetSubFields}} as {{protected}}.  That will enforce that 
visibility is open only for subclasses to implement, and not for other classes 
within the same package to call.  I'll be +1 after that change and a fresh 
Jenkins run.  Thanks!

> Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
> --
>
> Key: HDFS-7398
> URL: https://issues.apache.org/jira/browse/HDFS-7398
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: HDFS-7398.v01.patch, HDFS-7398.v02.patch
>
>
> This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit

2014-11-17 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214830#comment-14214830
 ] 

Chris Nauroth commented on HDFS-7398:
-

We could have {{FSEditLog#baseReset}} call the abstract {{reset}} after it 
resets {{txid}}, {{rpcClientId}} and {{rpcCallId}}.  That would still preserve 
the compile-time constraint that new ops must implement {{reset}}.

If we do that, then we might also change up the naming a little bit, so that 
{{reset}} is a final method called externally from {{FSEditLog}}, and we use 
{{doReset}} or {{resetSubfields}} or some similar name for a protected abstract 
method.  That's just a nitpick though.

Thanks for putting this patch together, Gera!

> Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
> --
>
> Key: HDFS-7398
> URL: https://issues.apache.org/jira/browse/HDFS-7398
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: HDFS-7398.v01.patch
>
>
> This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213439#comment-14213439
 ] 

Chris Nauroth commented on HDFS-7384:
-

I haven't reviewed the whole patch yet, but I wanted to state again quickly 
that I'd prefer to keep effective permissions out of {{AclEntry}}.

One problem is that the {{AclEntry}} class is also used in the setter APIs, 
like {{setAcl}}.  In that context, the effective permissions would be ignored.  
This could cause confusion for users of those APIs.

Another problem is that we use the same class for both the public API on the 
client side and the internal in-memory representation in the NameNode.  
Therefore, adding a new member to {{AclEntry}} would have a side effect of 
increasing memory footprint in the NameNode.  Even if we don't populate the 
field when used within the NameNode, there is still the overhead of the 
additional pointer multiplied by every ACL entry.  We could potentially change 
the NameNode to use a different class for its internal implementation, but then 
we'd have a dual-maintenance problem and a need for extra code to translate 
between the two representations.

If {{AclStatus}} could have a new method that does the calculation for an 
entry's effective permissions on demand, instead of requiring a new member in 
{{AclEntry}}, then we wouldn't impact the setter APIs or increase memory 
footprint in the NameNode.

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-3749) Disable check for jsvc on windows

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-3749.
-
Resolution: Won't Fix

This is no longer required, because HDFS-2856 has been implemented, providing 
SASL as a means to authenticate the DataNode instead of jsvc/privileged ports.  
I'm resolving this as Won't Fix.

> Disable check for jsvc on windows
> -
>
> Key: HDFS-3749
> URL: https://issues.apache.org/jira/browse/HDFS-3749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hdfs-3749-trunk.patch, hdfs-3749.patch, hdfs-3749.patch
>
>
> Jsvc doesn't make sense on windows and thus we should not require the 
> datanode to start up under it on that platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7386:

   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

I committed this to trunk and branch-2.  Yongjun, thank you for improving this 
part of the code.

> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, security
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Fix For: 2.7.0
>
> Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7386:

 Component/s: security
  datanode
Target Version/s: 2.7.0
Hadoop Flags: Reviewed

+1 for the patch.  I agree that the test failures are unrelated.  I saw the 
same thing that you saw when I reran locally.  I'll commit this.

> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, security
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-3806) Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-3806.
-
Resolution: Duplicate

I'm resolving this as duplicate of HDFS-3519.

> Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState
> --
>
> Key: HDFS-3806
> URL: https://issues.apache.org/jira/browse/HDFS-3806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
> Environment: Jenkins
>Reporter: Trevor Robinson
>Priority: Minor
>
> Failed in Jenkins build for unrelated issue (HDFS-3804): 
> https://builds.apache.org/job/PreCommit-HDFS-Build/3011/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestStandbyCheckpoints/testBothNodesInStandbyState/
> {noformat}
> java.lang.AssertionError: Expected non-empty 
> /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/fsimage_012
>   at org.junit.Assert.fail(Assert.java:91)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageTestUtil.assertNNHasCheckpoints(FSImageTestUtil.java:467)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.HATestUtil.waitForCheckpoint(HATestUtil.java:213)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testBothNodesInStandbyState(TestStandbyCheckpoints.java:133)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.

2014-11-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212736#comment-14212736
 ] 

Chris Nauroth commented on HDFS-6711:
-

This was fixed in HDFS-7218, so I'm resolving this as duplicate.

> FSNamesystem#getAclStatus does not write to the audit log.
> --
>
> Key: HDFS-6711
> URL: https://issues.apache.org/jira/browse/HDFS-6711
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Chris Nauroth
>Priority: Minor
>
> Consider writing an event to the audit log for the {{getAclStatus}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-6711.
-
Resolution: Duplicate

> FSNamesystem#getAclStatus does not write to the audit log.
> --
>
> Key: HDFS-6711
> URL: https://issues.apache.org/jira/browse/HDFS-6711
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Chris Nauroth
>Priority: Minor
>
> Consider writing an event to the audit log for the {{getAclStatus}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212734#comment-14212734
 ] 

Chris Nauroth commented on HDFS-7384:
-

Yes, what you described makes sense.  An older client simply wouldn't consume 
the new protobuf field.

I'd prefer not add the effective action directly to {{AclEntry}}, since the 
effective action is something that only makes sense when the entry is 
considered against some other object (the mask).

Overall, it sounds good.  Thanks for thinking this through and putting out the 
proposal!

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6962) ACLs inheritance conflict with umaskmode

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-6962:

Target Version/s: 2.7.0  (was: 2.4.1)

Hello, [~Alexandre LINTE].  Thank you for filing this issue.  I tested the same 
scenario against a Linux local file system, and I confirmed that HDFS is 
showing different behavior, just like you described.

I also confirmed that this is a divergence from the POSIX ACL specs.  Here is a 
quote of the relevant section:

{quote}
The permissions of inherited access ACLs are further modified by the mode 
parameter that each system call creating file system objects has. The mode 
parameter contains nine permission bits that stand for the permissions of the 
owner, group, and other class permissions. The effective permissions of each 
class are set to the intersection of the permissions defined for this class in 
the ACL and specified in the mode parameter.

If the parent directory has no default ACL, the permissions of the new file are 
determined as defined in POSIX.1. The effective permissions are set to the 
permissions defined in the mode parameter, minus the permissions set in the 
current umask.

The umask has no effect if a default ACL exists.
{quote}

Changing this behavior is going to be somewhat challenging.  Note the 
distinction made in the spec between mode and umask.  When creating a new child 
(file or directory) of a directory with a default ACL, the mode influences the 
inherited access ACL entries, but the umask has no effect.  Unfortunately, our 
current implementation intersects mode and umask on the client side before 
passing them to the NameNode in the RPC.  This happens in {{DFSClient#mkdirs}} 
and {{DFSClient#create}}:

{code}
  public boolean mkdirs(String src, FsPermission permission,
  boolean createParent) throws IOException {
if (permission == null) {
  permission = FsPermission.getDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
{code}

{code}
  public DFSOutputStream create(String src, 
 FsPermission permission,
 EnumSet flag, 
 boolean createParent,
 short replication,
 long blockSize,
 Progressable progress,
 int buffersize,
 ChecksumOpt checksumOpt,
 InetSocketAddress[] favoredNodes) throws 
IOException {
checkOpen();
if (permission == null) {
  permission = FsPermission.getFileDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
{code}

On the NameNode side, when it copies the default ACL from parent to child, 
we've lost the information.  We just have a single piece of permissions data, 
with no knowledge of what was the mode vs. the umask on the client side.

A potential solution is to push both mode and umask explicitly to the NameNode 
in the RPC requests for {{MkdirsRequestProto}} and {{CreateRequestProto}}.  
Those messages already contain an instance of {{FsPermissionProto}}.  We could 
add a second optional instance.  If both instances are defined, then the 
NameNode would interpret one as being mode and the other as being umask.  There 
would still be a possibility of an older client still passing just one 
instance, and in that case, we'd have to fall back to the current behavior.  
It's a bit messy, but it could work.

We also have one additional problem specific to the shell for files (not 
directories).  The implementation of copyFromLocal breaks down into 2 separate 
RPCs: creating the file, followed by a separate chmod call.  The NameNode has 
no way of knowing if that chmod call is part of a copyFromLocal or not though.  
It's too late to enforce the mode vs. umask distinction.

I'm tentatively targeting this to 2.7.0.  I think this will need more 
investigation to make sure there are no compatibility issues with the solution. 
 If there is an unavoidable compatibility problem, then it might require 
pushing out to 3.x.  We won't know for sure until someone starts coding.

Thank you again for the very detailed bug report.

> ACLs inheritance conflict with umaskmode
> 
>
> Key: HDFS-6962
> URL: https://issues.apache.org/jira/browse/HDFS-6962
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.4.1
> Environment: CentOS release 6.5 (Final)
>Reporter: LINTE
>  Labels: hadoop, security
>
> In hdfs-site.xml 
> 
> dfs.umaskmode
> 027
> 
> 1/ Create a directory as superuser
> bash# hdfs dfs -mkdir  /tmp/ACLS
> 2/ set default ACLs on this directory rwx access for group readwrite and user 
> toto
> bash# hdfs dfs -setfac

[jira] [Resolved] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return

2014-11-14 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-7177.
-
Resolution: Duplicate

> Add an option to include minimal ACL in getAclStatus return
> ---
>
> Key: HDFS-7177
> URL: https://issues.apache.org/jira/browse/HDFS-7177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Minor
>
> Currently the 3 minimal ACL entries are not included in the returned value of 
> getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = 
> item.stat.getPermission();}}). It'd be useful to make it optional to include 
> them, so that external programs can get a complete view of the permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return

2014-11-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212555#comment-14212555
 ] 

Chris Nauroth commented on HDFS-7177:
-

Hi, [~zhz].  I just realized too late that HDFS-7384 is reporting basically the 
same thing as this.  I just entered a huge comment on HDFS-7384 about it, so 
I'd prefer to resolve this one as duplicate, even though it really came first.  
I'll add all of the watchers over to HDFS-7384 so that they can still be 
involved in the conversation.  Thanks!

> Add an option to include minimal ACL in getAclStatus return
> ---
>
> Key: HDFS-7177
> URL: https://issues.apache.org/jira/browse/HDFS-7177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Minor
>
> Currently the 3 minimal ACL entries are not included in the returned value of 
> getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = 
> item.stat.getPermission();}}). It'd be useful to make it optional to include 
> them, so that external programs can get a complete view of the permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212548#comment-14212548
 ] 

Chris Nauroth commented on HDFS-7384:
-

Hi, [~vinayrpet].  The current behavior of {{getAclStatus}} is an intentional 
design choice, but the history behind that choice is a bit convoluted.  Let me 
see if I can reconstruct it here.

It starts with HADOOP-10220, which added an ACL indicator bit to 
{{FsPermission}}.  This was provided as an optimization so that clients could 
quickly identify if a file has an ACL, without needing an additional RPC.

Later, objections were raised against the ACL bit in HDFS-5923 and HDFS-5932.  
We made a decision to roll back the HADOOP-10220 changes, and instead require 
callers to use {{getAclStatus}} to identify the presence of an ACL.  Prior to 
this, early implementations of {{getAclStatus}} would always return a non-empty 
list.  For an inode with no ACL, it would return the "minimal ACL" containing 
the 3 entries that correspond to basic POSIX permissions.  However, at this 
point, it became helpful to change {{getAclStatus}} so that it would return an 
empty list if there is no ACL.  This was seen as easier for clients than trying 
to check the entries for no ACL/minimal ACL.  It was also seen as a cleaner 
logical separation, since the client likely already has the {{FsPermission}} 
prior to calling {{getAclStatus}}, and therefore it would not be helpful to 
return redundant ACL entries.

Finally, HDFS-6326 identified that our implementation choice was 
backwards-incompatible for webhdfs, and generally a performance bottleneck for 
shell users.  To solve this, we reinstated the ACL bit, in a slightly different 
implementation, but the behavior of {{getAclStatus}} remained the same.

You've definitely identified a weakness in the current API design, and I raised 
similar objections at the time.  It's a trade-off.  I think there is good 
logical separation right now, but as a side effect, it does mean that callers 
may need some extra client-side logic to piece all of the information together, 
such as if someone wanted to write a custom GUI consuming WebHDFS to display 
ACL information.

At this point, we can't change the behavior of {{getAclStatus}} on the 2.x line 
for compatibility reasons.  Suppose a 2.6.0 deployment of the shell called 
{{getAclStatus}} on a 2.7.0 NameNode, and it had been changed to return the 
complete ACL.  This would cause {{getfacl}} to display duplicate entries, 
because the 2.6.0 logic of {{GetfaclCommand}} and 
{{AclUtil#getAclFromPermAndEntries}} will combine the output of 
{{getAclStatus}} with the {{FsPermission}}, resulting in 3 duplicate entries.

Where does that leave us for this jira?  I can see the following options:
# Resolve as won't fix, based on the above rationale.
# Target 3.0 for a backwards-incompatible change.
# Add a new RPC, named {{getFullAcl}} or similar, with the behavior that you 
proposed.  However, I'd prefer not to increase the API footprint unless there 
is a really strong use case.

Hope this helps.  Let me know your thoughts.  Thanks!

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode

2014-11-14 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212469#comment-14212469
 ] 

Chris Nauroth commented on HDFS-7396:
-

bq. Whenever we experimented with improving concurrency, the limiting factor 
was the garbage collection overhead.

I also would be interested in seeing more information on this.  We've been 
updating our recommendations for garbage collection tuning recently.  It would 
be interesting for us to compare notes.

I'm also curious if you've tried any experiments running with the G1 collector. 
 I haven't tried it in several years.  When I tried it, it was still very 
experimental, so I ended up hitting too many bugs to run it in production.  
Perhaps it has stabilized by now.

> Revisit synchronization in Namenode
> ---
>
> Key: HDFS-7396
> URL: https://issues.apache.org/jira/browse/HDFS-7396
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>
> HDFS-2106 separated block management to a new package from namenode.  As part 
> of it, some code was refactored to new classes such as DatanodeManager, 
> HeartbeatManager, etc.  There are opportunities for improve locking in 
> namenode while currently the synchronization in namenode is mainly done by a 
> single global FSNamesystem lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211525#comment-14211525
 ] 

Chris Nauroth commented on HDFS-7385:
-

On a side note, I have to wonder if the thread-local storage here is an 
unnecessary optimization at this point.  It might be interesting to tear it 
down and just let the edit logging code paths create new short-lived instances 
that likely never leave eden.  We could compare JIT assembly output before and 
after to see if it really makes a difference.

> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211505#comment-14211505
 ] 

Chris Nauroth commented on HDFS-7385:
-

Hi, [~jira.shegalov].  Yes, I had the same thought, but I wanted to keep the 
patch here small and focused on the known problem, since we were hoping to 
include it in the 2.6.0 release candidate quickly.  Please do feel free to file 
a subsequent jira for further refactoring targeting 2.7.0.  Thanks!

> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211286#comment-14211286
 ] 

Chris Nauroth commented on HDFS-7385:
-

I agree.  The workaround makes sense.  Thanks again!

> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I committed this to trunk, branch-2, branch-2.6 and branch-2.6.0.  
[~jiangyu1211], thank you again for reporting the issue and providing a patch.

bq. I also wonder if we should advise some procedures to help the users who 
apply Acls already to save their meta data.

Yes, we can document a workaround here in the jira.  If I understand correctly, 
your suggested workaround is running {{hdfs dfsadmin -saveNamespace}} at the 
active to force correct persistence of all inodes to the fsimage, thus 
bypassing the buggy edits.  Do I have it correct?

> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Fix Version/s: 2.6.0

> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Summary: ThreadLocal used in FSEditLog class causes FSImage permission mess 
up  (was: ThreadLocal used in FSEditLog class  lead FSImage permission mess up)

> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Hadoop Flags: Reviewed

For {{TestCacheDirectives}}, it looks like we hit one of the build collision 
issues we've seen on Jenkins lately, resulting in a {{NoClassDefFoundError}}.  
The other failures are unrelated issues that are tracked in other jiras.  The 
failures did not repro locally for me.  I'm going to commit this.

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Due Date: 13/Nov/14  (was: 10/Nov/14)
Priority: Blocker  (was: Critical)

General consensus is that this is severe enough to hold the 2.6.0 release 
candidate, so I'm marking it a blocker.  At this point, we're just waiting on 
Jenkins.

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Status: Patch Available  (was: Open)

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.0, 2.4.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Critical
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-13 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7385:

Attachment: HDFS-7385.2.patch

Hello, [~jiangyu1211].  Like others have already said, this is a great find, 
and thank you for reporting it.  Sorry for jumping in, but I want to expedite 
getting this fixed, so I'm attaching a v2 patch.  I still intend to credit the 
patch to you.  You did the hard part.  :-)

I agree with the suggestions to add a test and add a {{reset}} method to the 
ops.  I'm uploading a patch that does that.  The test works by configuring a 
single RPC handler thread, which guarantees that all transactions get processed 
on the same thread, and therefore hit the same thread-local storage.

[~hitliuyi] or [~vinayrpet], how does this look?

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Critical
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-12 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209227#comment-14209227
 ] 

Chris Nauroth commented on HDFS-7386:
-

Thanks for the patch, Yongjun.  This looks good.  Here are just a few comments:
# Let's JavaDoc the new {{SecurityUtil#isPrivilegedPort}} method.
# There is one more place that we can use this new method, in 
{{SecureDataNodeStarter#getSecureResources}}.  In this case, you'll want to 
negate the return value of {{SecurityUtil#isPrivilegedPort}}.


> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Attachments: HDFS-7386.001.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-12 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7387:

Fix Version/s: (was: 2.7.0)
   2.6.0

I've merged this to branch-2.6 and branch-2.6.0 for inclusion in the 2.6.0 
release candidate.

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
> Fix For: 2.6.0
>
> Attachments: HDFS-7387.001.patch, HDFS-7387.002.patch
>
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-12 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208354#comment-14208354
 ] 

Chris Nauroth commented on HDFS-7386:
-

Hi Yongjun.

On further reflection, I think we should not incorporate a Windows check here.  
Sometimes the check for < 1024 is used on the client side to detect the 
behavior of the server side.  If we consider the possibility of a Windows 
client connecting to a Linux server, then the client on Windows could assume 
incorrectly that there are no privileged ports, even though the server on Linux 
does have privileged ports.  As a practical matter, I think this means that 
when secure mode is fully implemented for Windows, there is going to be a 
limitation that the DataNode can't use a port < 1024.  Otherwise, it would 
throw off some of this detection logic.  It's not a bad limitation, just 
something we'll need to be aware of.

> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7389) Named user ACL cannot stop the user from accessing the FS entity.

2014-11-11 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7389:

   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

I have committed this to trunk and branch-2.  Chunjun, thank you for reporting 
the bug.  Vinay, thank you for providing the patch.

> Named user ACL cannot stop the user from accessing the FS entity.
> -
>
> Key: HDFS-7389
> URL: https://issues.apache.org/jira/browse/HDFS-7389
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.1
>Reporter: Chunjun Xiao
>Assignee: Vinayakumar B
> Fix For: 2.7.0
>
> Attachments: HDFS-7389-001.patch, HDFS-7389-002.patch
>
>
> In 
> http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/:
> {quote}
> It’s important to keep in mind the order of evaluation for ACL entries when a 
> user attempts to access a file system object:
> 1. If the user is the file owner, then the owner permission bits are enforced.
> 2. Else if the user has a named user ACL entry, then those permissions are 
> enforced.
> 3. Else if the user is a member of the file’s group or any named group in an 
> ACL entry, then the union of permissions for all matching entries are 
> enforced.  (The user may be a member of multiple groups.)
> 4. If none of the above were applicable, then the other permission bits are 
> enforced.
> {quote}
> Assume we have a user UserA from group GroupA, if we config a directory as 
> following ACL entries:
> group:GroupA:rwx
> user:UserA:---
> According to the design spec above, userA should have no access permission to 
> the file object, while actually userA still has rwx access to the dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7389) Named user ACL cannot stop the user from accessing the FS entity.

2014-11-11 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7389:

Hadoop Flags: Reviewed

Thank you for reporting this, [~chunjun.xiao].

[~vinayrpet], the patch looks good.  It looks like 2 Jenkins runs interfered 
with each other, something that I've seen causing trouble lately.  I've 
triggered a fresh Jenkins run here:

https://builds.apache.org/job/PreCommit-HDFS-Build/8714/

+1 pending the new Jenkins run.  Thank you!

> Named user ACL cannot stop the user from accessing the FS entity.
> -
>
> Key: HDFS-7389
> URL: https://issues.apache.org/jira/browse/HDFS-7389
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.1
>Reporter: Chunjun Xiao
>Assignee: Vinayakumar B
> Attachments: HDFS-7389-001.patch, HDFS-7389-002.patch
>
>
> In 
> http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/:
> {quote}
> It’s important to keep in mind the order of evaluation for ACL entries when a 
> user attempts to access a file system object:
> 1. If the user is the file owner, then the owner permission bits are enforced.
> 2. Else if the user has a named user ACL entry, then those permissions are 
> enforced.
> 3. Else if the user is a member of the file’s group or any named group in an 
> ACL entry, then the union of permissions for all matching entries are 
> enforced.  (The user may be a member of multiple groups.)
> 4. If none of the above were applicable, then the other permission bits are 
> enforced.
> {quote}
> Assume we have a user UserA from group GroupA, if we config a directory as 
> following ACL entries:
> group:GroupA:rwx
> user:UserA:---
> According to the design spec above, userA should have no access permission to 
> the file object, while actually userA still has rwx access to the dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-10 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205128#comment-14205128
 ] 

Chris Nauroth commented on HDFS-7328:
-

No problem, [~cmccabe].

bq. It seems like Java chose to do things differently and as a consequence we 
probably have a lot of these cases.

Don't you just love consistency?  :-)

There have been a lot of these cases throughout the past 2 years, but I think 
all of them have been cleaned up at this point.  If we had Windows Jenkins, 
then it would help catch introduction of new occurrences.  I hope to revisit 
this topic on the Apache infra side in the next several months.

> TestTraceAdmin assumes Unix line endings.
> -
>
> Key: HDFS-7328
> URL: https://issues.apache.org/jira/browse/HDFS-7328
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Fix For: 2.6.0
>
> Attachments: HDFS-7328.1.patch
>
>
> {{TestTraceAdmin}} contains some string assertions that assume Unix line 
> endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7383) DataNode.requestShortCircuitFdsForRead may throw NullPointerException

2014-11-08 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7383:

Hadoop Flags: Reviewed

+1 for the patch.  The test failures look unrelated, and I verified that the 
tests pass in my local environment.  Thank you for the patch, Nicholas.

> DataNode.requestShortCircuitFdsForRead may throw NullPointerException
> -
>
> Key: HDFS-7383
> URL: https://issues.apache.org/jira/browse/HDFS-7383
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7383_20141108.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.requestShortCircuitFdsForRead(DataNode.java:1525)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitFds(DataXceiver.java:286)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitFds(Receiver.java:185)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:89)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-08 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7382:

   Resolution: Fixed
Fix Version/s: 2.6.0
   Status: Resolved  (was: Patch Available)

I committed this to trunk, branch-2 and branch-2.6.  Nicholas and Yongjun, 
thank you again for the code reviews.

> DataNode in secure mode may throw NullPointerException if client connects 
> before DataNode registers itself with NameNode.
> -
>
> Key: HDFS-7382
> URL: https://issues.apache.org/jira/browse/HDFS-7382
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Fix For: 2.6.0
>
> Attachments: HDFS-7382.1.patch, HDFS-7382.2.patch
>
>
> {{SaslDataTransferServer#receive}} needs to check if the DataNode is 
> listening on a privileged port.  It does this by checking the address from 
> the {{DatanodeID}}.  However, there is a window of time when this will be 
> {{null}}.  If a client is still holding a {{LocatedBlock}} that references 
> that DataNode and chooses to connect, then there is a risk of getting a 
> {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-08 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7382:

Attachment: HDFS-7382.2.patch

Thank you for the code reviews, Nicholas and Yongjun.

Yongjun, I'm uploading patch v2 with the log message fixed like you suggested.  
Nice catch!  I'm going to commit this version based on the prior +1 from 
Nicholas, and this new version is only changing a string literal in a log 
message.

You're right that there are a few different places checking for a port number < 
1024.  We might benefit from a shared {{isPrivilegedPort}} method somewhere.  
One interesting aspect is that Windows has no concept of privileged ports, so 
perhaps this method would return false always on Windows.  I won't address it 
here, but I do encourage you to file a jira for that chnage.

bq. -1 tests included. The patch doesn't appear to include any new or modified 
tests.

This turned up in internal system testing that involved restart DataNodes while 
running jobs.  It would be very difficult to put this into a predictable test.

The failure in {{TestWebHDFSForHA}} looks unrelated.  The test passed locally.

> DataNode in secure mode may throw NullPointerException if client connects 
> before DataNode registers itself with NameNode.
> -
>
> Key: HDFS-7382
> URL: https://issues.apache.org/jira/browse/HDFS-7382
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Attachments: HDFS-7382.1.patch, HDFS-7382.2.patch
>
>
> {{SaslDataTransferServer#receive}} needs to check if the DataNode is 
> listening on a privileged port.  It does this by checking the address from 
> the {{DatanodeID}}.  However, there is a window of time when this will be 
> {{null}}.  If a client is still holding a {{LocatedBlock}} that references 
> that DataNode and chooses to connect, then there is a risk of getting a 
> {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-08 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7382:

Status: Patch Available  (was: Open)

> DataNode in secure mode may throw NullPointerException if client connects 
> before DataNode registers itself with NameNode.
> -
>
> Key: HDFS-7382
> URL: https://issues.apache.org/jira/browse/HDFS-7382
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Attachments: HDFS-7382.1.patch
>
>
> {{SaslDataTransferServer#receive}} needs to check if the DataNode is 
> listening on a privileged port.  It does this by checking the address from 
> the {{DatanodeID}}.  However, there is a window of time when this will be 
> {{null}}.  If a client is still holding a {{LocatedBlock}} that references 
> that DataNode and chooses to connect, then there is a risk of getting a 
> {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-08 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7382:

Attachment: HDFS-7382.1.patch

We don't really need to rely on NameNode registration.  A DataNode always knows 
which port it's listening on for DataTransferProtocol, based solely on its own 
local configuration.  The attached patch implements the fix.  I'm still passing 
through the {{DatanodeID}}, because that's still useful for debug logging.

> DataNode in secure mode may throw NullPointerException if client connects 
> before DataNode registers itself with NameNode.
> -
>
> Key: HDFS-7382
> URL: https://issues.apache.org/jira/browse/HDFS-7382
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Attachments: HDFS-7382.1.patch
>
>
> {{SaslDataTransferServer#receive}} needs to check if the DataNode is 
> listening on a privileged port.  It does this by checking the address from 
> the {{DatanodeID}}.  However, there is a window of time when this will be 
> {{null}}.  If a client is still holding a {{LocatedBlock}} that references 
> that DataNode and chooses to connect, then there is a risk of getting a 
> {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-08 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7382:

Component/s: security

> DataNode in secure mode may throw NullPointerException if client connects 
> before DataNode registers itself with NameNode.
> -
>
> Key: HDFS-7382
> URL: https://issues.apache.org/jira/browse/HDFS-7382
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
>
> {{SaslDataTransferServer#receive}} needs to check if the DataNode is 
> listening on a privileged port.  It does this by checking the address from 
> the {{DatanodeID}}.  However, there is a window of time when this will be 
> {{null}}.  If a client is still holding a {{LocatedBlock}} that references 
> that DataNode and chooses to connect, then there is a risk of getting a 
> {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-08 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7382:
---

 Summary: DataNode in secure mode may throw NullPointerException if 
client connects before DataNode registers itself with NameNode.
 Key: HDFS-7382
 URL: https://issues.apache.org/jira/browse/HDFS-7382
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Minor


{{SaslDataTransferServer#receive}} needs to check if the DataNode is listening 
on a privileged port.  It does this by checking the address from the 
{{DatanodeID}}.  However, there is a window of time when this will be {{null}}. 
 If a client is still holding a {{LocatedBlock}} that references that DataNode 
and chooses to connect, then there is a risk of getting a 
{{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test

2014-11-06 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7226:

Fix Version/s: (was: 2.7.0)
   2.6.0

Thank you again for fixing this, Yongjun.  I just merged it to branch-2.6, 
since the test was still failing there.

> TestDNFencing.testQueueingWithAppend failed often in latest test
> 
>
> Key: HDFS-7226
> URL: https://issues.apache.org/jira/browse/HDFS-7226
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Fix For: 2.6.0
>
> Attachments: HDFS-7226.001.patch, HDFS-7226.002.patch, 
> HDFS-7226.003.patch
>
>
> Using tool from HADOOP-11045, got the following report:
> {code}
> [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j 
> PreCommit-HDFS-Build -n 1 
> Recently FAILED builds in url: 
> https://builds.apache.org//job/PreCommit-HDFS-Build
> THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, 
> as listed below:
> ..
> Among 9 runs examined, all failed tests <#failedRuns: testName>:
> 7: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
> 6: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress
> 3: 
> org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots
> 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen
> 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching
> ..
> {code}
> TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. 
> Creating this jira for TestDNFencing.testQueueingWithAppend.
> Symptom:
> {code}
> Failed
> org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend
> Failing for the past 1 build (Since Failed#8390 )
> Took 2.9 sec.
> Error Message
> expected:<18> but was:<12>
> Stacktrace
> java.lang.AssertionError: expected:<18> but was:<12>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7221) TestDNFencingWithReplication fails consistently

2014-11-06 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7221:

Target Version/s: 2.6.0  (was: 2.7.0)
   Fix Version/s: (was: 2.7.0)
  2.6.0

When I code reviewed HDFS-7128, Jenkins did report a failure on 
{{TestDNFencingWithReplication}}.  At the time, I misdiagnosed it as unrelated 
to the patch and went ahead with the commit.  Sorry for the confusion.

Charles, thank you for fixing it.  I merged it down to branch-2.6, since the 
test was still failing there.

> TestDNFencingWithReplication fails consistently
> ---
>
> Key: HDFS-7221
> URL: https://issues.apache.org/jira/browse/HDFS-7221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Fix For: 2.6.0
>
> Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch, 
> HDFS-7221.003.patch, HDFS-7221.004.patch, HDFS-7221.005.patch
>
>
> TestDNFencingWithReplication consistently fails with a timeout, both in 
> jenkins runs and on my local machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7367) HDFS short-circuit read cannot negotiate shared memory slot and file descriptors when SASL is enabled on DataTransferProtocol.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7367:

Status: Patch Available  (was: Open)

> HDFS short-circuit read cannot negotiate shared memory slot and file 
> descriptors when SASL is enabled on DataTransferProtocol.
> --
>
> Key: HDFS-7367
> URL: https://issues.apache.org/jira/browse/HDFS-7367
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7367.1.patch
>
>
> When both short-circuit read and SASL on DataTransferProtocol are enabled, 
> the server side tries to negotiate SASL on the operation to allocate a new 
> shared memory slot.  However, the transport for this operation is the Unix 
> domain socket (not TCP), and the client always assumes that Unix domain 
> socket traffic is trustworthy.  The end result is that the server side still 
> attempts SASL negotiation, and it fails with an exception while erroneously 
> trying to parse the domain socket address as if it were a network address.  
> The read succeeds, but only because we fallback to a read through the 
> DataNode TCP server.  It's not a short-circuit read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7367) HDFS short-circuit read cannot negotiate shared memory slot and file descriptors when SASL is enabled on DataTransferProtocol.

2014-11-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199485#comment-14199485
 ] 

Chris Nauroth commented on HDFS-7367:
-

I also meant to mention that this is exactly the same way the code already 
handles it for the encrypted case.  See 
{{SaslDataTransferServer#getEncryptedStreams}}.

> HDFS short-circuit read cannot negotiate shared memory slot and file 
> descriptors when SASL is enabled on DataTransferProtocol.
> --
>
> Key: HDFS-7367
> URL: https://issues.apache.org/jira/browse/HDFS-7367
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7367.1.patch
>
>
> When both short-circuit read and SASL on DataTransferProtocol are enabled, 
> the server side tries to negotiate SASL on the operation to allocate a new 
> shared memory slot.  However, the transport for this operation is the Unix 
> domain socket (not TCP), and the client always assumes that Unix domain 
> socket traffic is trustworthy.  The end result is that the server side still 
> attempts SASL negotiation, and it fails with an exception while erroneously 
> trying to parse the domain socket address as if it were a network address.  
> The read succeeds, but only because we fallback to a read through the 
> DataNode TCP server.  It's not a short-circuit read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7367) HDFS short-circuit read cannot negotiate shared memory slot and file descriptors when SASL is enabled on DataTransferProtocol.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7367:

Attachment: HDFS-7367.1.patch

I'm attaching a patch with the simple fix.  We just need to check if the 
channel is secure/trusted before entering SASL negotiation.  As per the 
definition of {{DomainPeer#hasSecureChannel}}, traffic over a domain socket is 
assumed to be trustworthy, because it doesn't cross a network, and we restrict 
the permissions on the domain socket.

I don't have a test yet.  I'll look into adding one.

> HDFS short-circuit read cannot negotiate shared memory slot and file 
> descriptors when SASL is enabled on DataTransferProtocol.
> --
>
> Key: HDFS-7367
> URL: https://issues.apache.org/jira/browse/HDFS-7367
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7367.1.patch
>
>
> When both short-circuit read and SASL on DataTransferProtocol are enabled, 
> the server side tries to negotiate SASL on the operation to allocate a new 
> shared memory slot.  However, the transport for this operation is the Unix 
> domain socket (not TCP), and the client always assumes that Unix domain 
> socket traffic is trustworthy.  The end result is that the server side still 
> attempts SASL negotiation, and it fails with an exception while erroneously 
> trying to parse the domain socket address as if it were a network address.  
> The read succeeds, but only because we fallback to a read through the 
> DataNode TCP server.  It's not a short-circuit read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7367) HDFS short-circuit read cannot negotiate shared memory slot and file descriptors when SASL is enabled on DataTransferProtocol.

2014-11-05 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7367:
---

 Summary: HDFS short-circuit read cannot negotiate shared memory 
slot and file descriptors when SASL is enabled on DataTransferProtocol.
 Key: HDFS-7367
 URL: https://issues.apache.org/jira/browse/HDFS-7367
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Chris Nauroth
Assignee: Chris Nauroth


When both short-circuit read and SASL on DataTransferProtocol are enabled, the 
server side tries to negotiate SASL on the operation to allocate a new shared 
memory slot.  However, the transport for this operation is the Unix domain 
socket (not TCP), and the client always assumes that Unix domain socket traffic 
is trustworthy.  The end result is that the server side still attempts SASL 
negotiation, and it fails with an exception while erroneously trying to parse 
the domain socket address as if it were a network address.  The read succeeds, 
but only because we fallback to a read through the DataNode TCP server.  It's 
not a short-circuit read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7361) TestCheckpoint#testStorageAlreadyLockedErrorMessage fails after change of log message related to locking violation.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7361:

Hadoop Flags: Reviewed

+1 for the patch, pending Jenkins run.  Thanks for the quick fix, [~shv].

bq. Wonder why Jenkins didn't fail for HDFS-7333.

I tried reviewing the console output for that run, but I couldn't find an 
explanation.  It shows the test passing.  It's mysterious.

> TestCheckpoint#testStorageAlreadyLockedErrorMessage fails after change of log 
> message related to locking violation.
> ---
>
> Key: HDFS-7361
> URL: https://issues.apache.org/jira/browse/HDFS-7361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode, test
>Reporter: Chris Nauroth
>Assignee: Konstantin Shvachko
>Priority: Minor
> Attachments: HDFS-7361.patch
>
>
> HDFS-7333 changed the log message related to locking violation on a storage 
> directory.  There is an assertion in 
> {{TestCheckpoint#testStorageAlreadyLockedErrorMessage}} that has been failing 
> since that change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

Component/s: namenode

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.6.0
>
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch, HDFS-7359.3.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

   Resolution: Fixed
Fix Version/s: 2.6.0
   Status: Resolved  (was: Patch Available)

I committed this to trunk, branch-2 and branch-2.6.  Thank you to Haohui, Jing 
and Jitendra for code reviews.

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.6.0
>
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch, HDFS-7359.3.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

Hadoop Flags: Reviewed

I think something confused the string parsing Jenkins does to search for timed 
out tests.  I reviewed the console output, and I didn't see any evidence that 
these tests had timed out.  I reran locally, and they were all fine.

I'll commit this later today.

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch, HDFS-7359.3.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199176#comment-14199176
 ] 

Chris Nauroth commented on HDFS-7359:
-

The {{TestCheckpoint}} failure was introduced in HDFS-7333.  I filed HDFS-7361 
to track fixing it.

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch, HDFS-7359.3.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7333) Improve log message in Storage.tryLock()

2014-11-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199174#comment-14199174
 ] 

Chris Nauroth commented on HDFS-7333:
-

This patch introduced a test failure in 
{{TestCheckpoint#testStorageAlreadyLockedErrorMessage}}.  I filed HDFS-7361 to 
track it.  [~shv], would you please take a look?  Thank you.

> Improve log message in Storage.tryLock()
> 
>
> Key: HDFS-7333
> URL: https://issues.apache.org/jira/browse/HDFS-7333
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 2.5.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 2.7.0
>
> Attachments: logging.patch
>
>
> Confusing log message in Storage.tryLock(). It talks about namenode, while 
> this is a common part of NameNode and DataNode storage.
> The log message should include the directory path and the exception.
> Also fix the long line in tryLock().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7361) TestCheckpoint#testStorageAlreadyLockedErrorMessage fails after change of log message related to locking violation.

2014-11-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199171#comment-14199171
 ] 

Chris Nauroth commented on HDFS-7361:
-

Here is the output from a failed test run.

{code}
Running org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
Tests run: 38, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 40.681 sec <<< 
FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
testStorageAlreadyLockedErrorMessage(org.apache.hadoop.hdfs.server.namenode.TestCheckpoint)
  Time elapsed: 0.079 sec  <<< FAILURE!
java.lang.AssertionError: Log output does not contain expected log message: It 
appears that another namenode 28733@Chriss-MacBook-Pro.local has already locked 
the storage directory
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testStorageAlreadyLockedErrorMessage(TestCheckpoint.java:867)
{code}


> TestCheckpoint#testStorageAlreadyLockedErrorMessage fails after change of log 
> message related to locking violation.
> ---
>
> Key: HDFS-7361
> URL: https://issues.apache.org/jira/browse/HDFS-7361
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode, test
>Reporter: Chris Nauroth
>Priority: Minor
>
> HDFS-7333 changed the log message related to locking violation on a storage 
> directory.  There is an assertion in 
> {{TestCheckpoint#testStorageAlreadyLockedErrorMessage}} that has been failing 
> since that change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7361) TestCheckpoint#testStorageAlreadyLockedErrorMessage fails after change of log message related to locking violation.

2014-11-05 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7361:
---

 Summary: TestCheckpoint#testStorageAlreadyLockedErrorMessage fails 
after change of log message related to locking violation.
 Key: HDFS-7361
 URL: https://issues.apache.org/jira/browse/HDFS-7361
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, namenode, test
Reporter: Chris Nauroth
Priority: Minor


HDFS-7333 changed the log message related to locking violation on a storage 
directory.  There is an assertion in 
{{TestCheckpoint#testStorageAlreadyLockedErrorMessage}} that has been failing 
since that change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199124#comment-14199124
 ] 

Chris Nauroth commented on HDFS-7359:
-

The test failures are unrelated.  {{TestBalancer}} has been flaky.  It's 
passing for me locally.  The {{TestCheckpoint}} failure repros on current trunk 
even without this patch.  We're still waiting on the Jenkins run for patch v3, 
which is currently in progress.

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch, HDFS-7359.3.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

Attachment: HDFS-7359.3.patch

Here is patch v3 with the improved logging.  I still retained logging of the 
full stack trace at debug level in case we ever need to find that.  Thanks 
again, Jing.

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch, HDFS-7359.3.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198839#comment-14198839
 ] 

Chris Nauroth commented on HDFS-7359:
-

That's a good question.  I believe we'll still have debugging information in 
that case thanks to this code in {{ImageServlet}}:

{code}
LOG.info("ImageServlet rejecting: " + remoteUser);
{code}

{code}
if (UserGroupInformation.isSecurityEnabled()
&& !isValidRequestor(context, request.getUserPrincipal().getName(),
conf)) {
  String errorMsg = "Only Namenode, Secondary Namenode, and administrators 
may access "
  + "this servlet";
  response.sendError(HttpServletResponse.SC_FORBIDDEN, errorMsg);
  LOG.warn("Received non-NN/SNN/administrator request for image or edits 
from "
  + request.getUserPrincipal().getName()
  + " at "
  + request.getRemoteHost());
  throw new IOException(errorMsg);
}
{code}

I guess another possibility would be to change the new debug log message in the 
catch block to warn level and include the values of 
{{DFS_SECONDARY_NAMENODE_KERBEROS_PRINCIPAL_KEY}} and 
{{DFS_NAMENODE_SECONDARY_HTTP_ADDRESS_KEY}}.

Let me know your thoughts, and if necessary, I can upload a v3.  Thanks again!

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

Attachment: HDFS-7359.2.patch

Here is patch v2.  We need one more change in {{ImageServlet}} to prevent the 
problem from happening during bootstrapStandby.

bq. It looks to me that simply removing the checks is equivalent to the current 
proposed patch, correct?

bq. Removing that several lines means we no longer recognize SNN as a valid 
requestor. I guess in some scenario (maybe even in the future) we can still 
allow SNN to download journals from JN.

Thanks for reviewing, Haohui and Jing.  Right, doing it this way preserves 
existing behavior if anyone out there is trying to use the SNN as requestor.  
It would be a little odd to do this, and I haven't seen it in practice, but I 
think it would be a backwards-incompatible change if we dropped it.

Jing, are you still +1 for the v2 patch (pending fresh Jenkins run)?

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch, HDFS-7359.2.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

Status: Patch Available  (was: Open)

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-05 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7359:

Attachment: HDFS-7359.1.patch

Here is a patch that fixes the bug by catching the error in 
{{GetJournalEditServlet}}.  I considered just removing the addition of the 
SecondaryNameNode principal, since I've never heard of this usage in practice.  
However, I suppose it would be considered a backwards-incompatible change if 
someone out there was running a non-HA cluster and just had chosen to offload 
edits to the JournalNodes for consumption by the SecondaryNameNode.  Catching 
it is probably the safer change.  {{TestSecureNNWithQJM}} is a new test suite 
that covers usage of QJM in a secured cluster.  While I was working on this, I 
also spotted a typo in {{TestNNWithQJM}}, which I'm correcting in this patch.

> NameNode in secured HA cluster fails to start if 
> dfs.namenode.secondary.http-address cannot be interpreted as a network 
> address.
> 
>
> Key: HDFS-7359
> URL: https://issues.apache.org/jira/browse/HDFS-7359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7359.1.patch
>
>
> In a secured cluster, the JournalNode validates that the caller is one of a 
> valid set of principals.  One of the principals considered is that of the 
> SecondaryNameNode.  This involves checking 
> {{dfs.namenode.secondary.http-address}} and trying to interpret it as a 
> network address.  If a user has specified a value for this property that 
> cannot be interpeted as a network address, such as "null", then this causes 
> the JournalNode operation to fail, and ultimately the NameNode cannot start.  
> The JournalNode should not have a hard dependency on 
> {{dfs.namenode.secondary.http-address}} like this.  It is not typical to run 
> a SecondaryNameNode in combination with JournalNodes.  There is even a check 
> in SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7359) NameNode in secured HA cluster fails to start if dfs.namenode.secondary.http-address cannot be interpreted as a network address.

2014-11-04 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7359:
---

 Summary: NameNode in secured HA cluster fails to start if 
dfs.namenode.secondary.http-address cannot be interpreted as a network address.
 Key: HDFS-7359
 URL: https://issues.apache.org/jira/browse/HDFS-7359
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: journal-node
Reporter: Chris Nauroth
Assignee: Chris Nauroth


In a secured cluster, the JournalNode validates that the caller is one of a 
valid set of principals.  One of the principals considered is that of the 
SecondaryNameNode.  This involves checking 
{{dfs.namenode.secondary.http-address}} and trying to interpret it as a network 
address.  If a user has specified a value for this property that cannot be 
interpeted as a network address, such as "null", then this causes the 
JournalNode operation to fail, and ultimately the NameNode cannot start.  The 
JournalNode should not have a hard dependency on 
{{dfs.namenode.secondary.http-address}} like this.  It is not typical to run a 
SecondaryNameNode in combination with JournalNodes.  There is even a check in 
SecondaryNameNode that aborts if HA is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7355) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, because we cannot deny access to the file owner.

2014-11-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197127#comment-14197127
 ] 

Chris Nauroth commented on HDFS-7355:
-

Thank you for the reviews and the commit, Ming and Haohui.

bq. Perhaps this came up before, if we want to make unit tests pass on other 
non linux OS, should we set up Jenkins builds for that?

Yes, absolutely.  This is something I've been pursuing in the background for a 
while, but it's still a work in progress.

> TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on 
> Windows, because we cannot deny access to the file owner.
> 
>
> Key: HDFS-7355
> URL: https://issues.apache.org/jira/browse/HDFS-7355
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Fix For: 2.6.0
>
> Attachments: HDFS-7355.1.patch
>
>
> {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} fails on 
> Windows.  The test attempts to simulate volume failure by denying permissions 
> to data volume directories.  This doesn't work on Windows, because Windows 
> allows the file owner access regardless of the permission settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7355) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, because we cannot deny access to the file owner.

2014-11-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196734#comment-14196734
 ] 

Chris Nauroth commented on HDFS-7355:
-

http://technet.microsoft.com/en-us/library/cc783530(v=ws.10).aspx

Quoting the relevant section:

{quote}
Permissions enable the owner of each secured object, such as a file, Active 
Directory object, or registry key, to control who can perform an operation or a 
set of operations on the object or object property. Because access to an object 
is at the owner’s discretion, the type of access control that is used in 
Windows Server 2003 is called discretionary access control. An owner of an 
object always has the ability to read and change permissions on the 
object.{quote}

We'll need to skip this test on Windows.

> TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on 
> Windows, because we cannot deny access to the file owner.
> 
>
> Key: HDFS-7355
> URL: https://issues.apache.org/jira/browse/HDFS-7355
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-7355.1.patch
>
>
> {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} fails on 
> Windows.  The test attempts to simulate volume failure by denying permissions 
> to data volume directories.  This doesn't work on Windows, because Windows 
> allows the file owner access regardless of the permission settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

2014-11-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196752#comment-14196752
 ] 

Chris Nauroth commented on HDFS-7208:
-

The new test cannot work correctly on Windows.  See HDFS-7355 for a full 
explanation and a trivial patch to skip the test on Windows.

> NN doesn't schedule replication when a DN storage fails
> ---
>
> Key: HDFS-7208
> URL: https://issues.apache.org/jira/browse/HDFS-7208
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.6.0
>
> Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch
>
>
> We found the following problem. When a storage device on a DN fails, NN 
> continues to believe replicas of those blocks on that storage are valid and 
> doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When 
> a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
> dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still 
> considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
>  
> {noformat}
> 2014-10-04 02:11:12,626 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
> replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
> /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
> remove the DN and the replicas from the BlocksMap. In addition, blockReport 
> doesn't provide the diff given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
> datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
> ipcPort=50020, 
> storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
>  DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7355) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, because we cannot deny access to the file owner.

2014-11-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7355:

Attachment: HDFS-7355.1.patch

The attached patch skips the test.

> TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on 
> Windows, because we cannot deny access to the file owner.
> 
>
> Key: HDFS-7355
> URL: https://issues.apache.org/jira/browse/HDFS-7355
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-7355.1.patch
>
>
> {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} fails on 
> Windows.  The test attempts to simulate volume failure by denying permissions 
> to data volume directories.  This doesn't work on Windows, because Windows 
> allows the file owner access regardless of the permission settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7355) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, because we cannot deny access to the file owner.

2014-11-04 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7355:
---

 Summary: 
TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, 
because we cannot deny access to the file owner.
 Key: HDFS-7355
 URL: https://issues.apache.org/jira/browse/HDFS-7355
 Project: Hadoop HDFS
  Issue Type: Test
  Components: test
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial


{{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} fails on 
Windows.  The test attempts to simulate volume failure by denying permissions 
to data volume directories.  This doesn't work on Windows, because Windows 
allows the file owner access regardless of the permission settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7355) TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, because we cannot deny access to the file owner.

2014-11-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7355:

Status: Patch Available  (was: Open)

> TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on 
> Windows, because we cannot deny access to the file owner.
> 
>
> Key: HDFS-7355
> URL: https://issues.apache.org/jira/browse/HDFS-7355
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-7355.1.patch
>
>
> {{TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure}} fails on 
> Windows.  The test attempts to simulate volume failure by denying permissions 
> to data volume directories.  This doesn't work on Windows, because Windows 
> allows the file owner access regardless of the permission settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7218) FSNamesystem ACL operations should write to audit log on failure

2014-11-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7218:

Hadoop Flags: Reviewed

+1 from me too.  Thank you, Charles.

> FSNamesystem ACL operations should write to audit log on failure
> 
>
> Key: HDFS-7218
> URL: https://issues.apache.org/jira/browse/HDFS-7218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Charles Lamb
>Assignee: Charles Lamb
>Priority: Minor
> Attachments: HDFS-7218.001.patch, HDFS-7218.002.patch, 
> HDFS-7218.003.patch, HDFS-7218.004.patch, HDFS-7218.005.patch
>
>
> Various Acl methods in FSNamesystem do not write to the audit log when the 
> operation is not successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-03 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7328:

   Resolution: Fixed
Fix Version/s: 2.6.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

The test failure is unrelated.  I have committed this to trunk, branch-2 and 
branch-2.6.  Yi, thank you for the code review.

> TestTraceAdmin assumes Unix line endings.
> -
>
> Key: HDFS-7328
> URL: https://issues.apache.org/jira/browse/HDFS-7328
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Fix For: 2.6.0
>
> Attachments: HDFS-7328.1.patch
>
>
> {{TestTraceAdmin}} contains some string assertions that assume Unix line 
> endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-03 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7328:

Status: Patch Available  (was: Open)

> TestTraceAdmin assumes Unix line endings.
> -
>
> Key: HDFS-7328
> URL: https://issues.apache.org/jira/browse/HDFS-7328
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-7328.1.patch
>
>
> {{TestTraceAdmin}} contains some string assertions that assume Unix line 
> endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-03 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7328:

Attachment: HDFS-7328.1.patch

Here is a test fix to use the correct platform line separator.

> TestTraceAdmin assumes Unix line endings.
> -
>
> Key: HDFS-7328
> URL: https://issues.apache.org/jira/browse/HDFS-7328
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Attachments: HDFS-7328.1.patch
>
>
> {{TestTraceAdmin}} contains some string assertions that assume Unix line 
> endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-03 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7328:
---

 Summary: TestTraceAdmin assumes Unix line endings.
 Key: HDFS-7328
 URL: https://issues.apache.org/jira/browse/HDFS-7328
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial
 Attachments: HDFS-7328.1.patch

{{TestTraceAdmin}} contains some string assertions that assume Unix line 
endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7313) Support optional configuration of AES cipher suite on DataTransferProtocol.

2014-10-31 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192332#comment-14192332
 ] 

Chris Nauroth commented on HDFS-7313:
-

No problem!  One more note: we actually cover the combination of encryption + 
balancer in a separate test suite: {{TestBalancerWithEncryptedTransfer}}.  That 
one did pass in the Jenkins run.

> Support optional configuration of AES cipher suite on DataTransferProtocol.
> ---
>
> Key: HDFS-7313
> URL: https://issues.apache.org/jira/browse/HDFS-7313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs-client, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.6.0
>
> Attachments: HDFS-7313.1.patch
>
>
> HDFS-6606 introduced use of AES for encryption of DataTransferProtocol.  This 
> issue proposes introduction of a configuration property for administrators to 
> control whether or not AES is used or the existing support for 3DES and RC4 
> is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7313) Support optional configuration of AES cipher suite on DataTransferProtocol.

2014-10-31 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192311#comment-14192311
 ] 

Chris Nauroth commented on HDFS-7313:
-

bq. I haven't looked into this at all yet, but are we completely positive that 
the TestBalancer failure is unrelated?

Sorry, I forgot to address this in my commit comment.  I investigated last 
night before committing.  It's unrelated.  We've seen this test flake out 
elsewhere.  The failure would not repro for me locally.

> Support optional configuration of AES cipher suite on DataTransferProtocol.
> ---
>
> Key: HDFS-7313
> URL: https://issues.apache.org/jira/browse/HDFS-7313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs-client, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.6.0
>
> Attachments: HDFS-7313.1.patch
>
>
> HDFS-6606 introduced use of AES for encryption of DataTransferProtocol.  This 
> issue proposes introduction of a configuration property for administrators to 
> control whether or not AES is used or the existing support for 3DES and RC4 
> is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6606) Optimize HDFS Encrypted Transport performance

2014-10-30 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-6606:

Release Note: HDFS now supports the option to configure AES encryption for 
block data transfer.  AES offers improved cryptographic strength and 
performance over the prior options of 3DES and RC4.

> Optimize HDFS Encrypted Transport performance
> -
>
> Key: HDFS-6606
> URL: https://issues.apache.org/jira/browse/HDFS-6606
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs-client, security
>Reporter: Yi Liu
>Assignee: Yi Liu
> Fix For: 2.6.0
>
> Attachments: HDFS-6606.001.patch, HDFS-6606.002.patch, 
> HDFS-6606.003.patch, HDFS-6606.004.patch, HDFS-6606.005.patch, 
> HDFS-6606.006.patch, HDFS-6606.007.patch, HDFS-6606.008.patch, 
> HDFS-6606.009.patch, OptimizeHdfsEncryptedTransportperformance.pdf
>
>
> In HDFS-3637, [~atm] added support for encrypting the DataTransferProtocol, 
> it was a great work.
> It utilizes SASL {{Digest-MD5}} mechanism (use Qop: auth-conf),  it supports 
> three security strength:
> * high  3des   or rc4 (128bits)
> * medium des or rc4(56bits)
> * low   rc4(40bits)
> 3des and rc4 are slow, only *tens of MB/s*, 
> http://www.javamex.com/tutorials/cryptography/ciphers.shtml
> http://www.cs.wustl.edu/~jain/cse567-06/ftp/encryption_perf/
> I will give more detailed performance data in future. Absolutely it’s 
> bottleneck and will vastly affect the end to end performance. 
> AES(Advanced Encryption Standard) is recommended as a replacement of DES, 
> it’s more secure; with AES-NI support, the throughput can reach nearly 
> *2GB/s*, it won’t be the bottleneck any more, AES and CryptoCodec work is 
> supported in HADOOP-10150, HADOOP-10603 and HADOOP-10693 (We may need to add 
> a new mode support for AES). 
> This JIRA will use AES with AES-NI support as encryption algorithm for 
> DataTransferProtocol.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7313) Support optional configuration of AES cipher suite on DataTransferProtocol.

2014-10-30 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7313:

   Resolution: Fixed
Fix Version/s: 2.6.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Thank you for the review, Yi.  I committed this to trunk, branch-2 and 
branch-2.6.

> Support optional configuration of AES cipher suite on DataTransferProtocol.
> ---
>
> Key: HDFS-7313
> URL: https://issues.apache.org/jira/browse/HDFS-7313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs-client, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.6.0
>
> Attachments: HDFS-7313.1.patch
>
>
> HDFS-6606 introduced use of AES for encryption of DataTransferProtocol.  This 
> issue proposes introduction of a configuration property for administrators to 
> control whether or not AES is used or the existing support for 3DES and RC4 
> is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6385) Show when block deletion will start after NameNode startup in WebUI

2014-10-30 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-6385:

   Resolution: Fixed
Fix Version/s: 2.6.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I have committed this to trunk, branch-2 and branch-2.6.  Jing and Haohui, 
thank you for the code reviews.

> Show when block deletion will start after NameNode startup in WebUI
> ---
>
> Key: HDFS-6385
> URL: https://issues.apache.org/jira/browse/HDFS-6385
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jing Zhao
>Assignee: Chris Nauroth
> Fix For: 2.6.0
>
> Attachments: HDFS-6385.1.patch, HDFS-6385.2.patch, HDFS-6385.3.patch, 
> HDFS-6385.4.patch, HDFS-6385.png
>
>
> HDFS-6186 provides functionality to delay block deletion for a period of time 
> after NameNode startup. Currently we only show the number of pending block 
> deletions in WebUI. We should also show when the block deletion will start in 
> WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7313) Support optional configuration of AES cipher suite on DataTransferProtocol.

2014-10-30 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7313:

Status: Patch Available  (was: Open)

> Support optional configuration of AES cipher suite on DataTransferProtocol.
> ---
>
> Key: HDFS-7313
> URL: https://issues.apache.org/jira/browse/HDFS-7313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs-client, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7313.1.patch
>
>
> HDFS-6606 introduced use of AES for encryption of DataTransferProtocol.  This 
> issue proposes introduction of a configuration property for administrators to 
> control whether or not AES is used or the existing support for 3DES and RC4 
> is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7313) Support optional configuration of AES cipher suite on DataTransferProtocol.

2014-10-30 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7313:

Attachment: HDFS-7313.1.patch

Full disclosure: Support for configuration was discussed once before in 
HDFS-6606, and the decision at that time was to not make it configurable.  See 
several comments starting here:

https://issues.apache.org/jira/browse/HDFS-6606?focusedCommentId=14145832&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14145832

However, since then, I've heard feedback that it would be preferable to 
maintain the existing behavior of 3DES or RC4 in existing clusters by default, 
and activate AES only if configured explicitly.  This is seen as a safety net 
for managing rollout of the new feature in 2.6.0.  There is no argument against 
AES as the superior choice, so we can revisit switching it to the default in a 
future release after some stabilization time.

[~hitliuyi] and [~atm], you participated in the HDFS-6606 discussion.  What do 
you think of this patch?

Here is a summary of the changes.
* {{SecureMode.apt.vm}}: While I was working on this, I decided to update the 
docs to cover our new encryption capabilities.
* {{GenericTestUtils}}: Added new test helper method.
* {{DFSConfigKeys}}/{{hdfs-default.xml}}}: Defined new configuration property 
for specifying cipher suites used during DataTransferProtocol encryption.
* {{DataTransferSaslUtil}}: On the server side, only negotiate AES if it's 
configured.
* {{SaslDataTransferClient}}: On the client side, only negotiate AES if it's 
configured.
* {{DataXceiver}}: This fixes a regression that I had introduced in my 
HDFS-2856 patch.  The fix restores some special case handling from Aaron's 
earlier HDFS-3637 patch.  We didn't catch the regression earlier because of a 
bug in the tests.  (See below.)
* {{TestEncryptedTransfer}}: I changed test assertions on existing tests to 
assert that AES is not used, and I added a new test with AES configured.  I 
discovered that some of the assertions in this suite have not been running due 
to an erroneous check on {{resolverClazz}}.  If it's ever non-null, then it's 
going to be {{TestTrustedChannelResolver}}, effectively meaning that the 
assertions were never running.  I fixed this and confirmed that the assertions 
are running now.


> Support optional configuration of AES cipher suite on DataTransferProtocol.
> ---
>
> Key: HDFS-7313
> URL: https://issues.apache.org/jira/browse/HDFS-7313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs-client, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7313.1.patch
>
>
> HDFS-6606 introduced use of AES for encryption of DataTransferProtocol.  This 
> issue proposes introduction of a configuration property for administrators to 
> control whether or not AES is used or the existing support for 3DES and RC4 
> is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7313) Support optional configuration of AES cipher suite on DataTransferProtocol.

2014-10-30 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-7313:
---

 Summary: Support optional configuration of AES cipher suite on 
DataTransferProtocol.
 Key: HDFS-7313
 URL: https://issues.apache.org/jira/browse/HDFS-7313
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, hdfs-client, security
Reporter: Chris Nauroth
Assignee: Chris Nauroth


HDFS-6606 introduced use of AES for encryption of DataTransferProtocol.  This 
issue proposes introduction of a configuration property for administrators to 
control whether or not AES is used or the existing support for 3DES and RC4 is 
used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6385) Show when block deletion will start after NameNode startup in WebUI

2014-10-30 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191037#comment-14191037
 ] 

Chris Nauroth commented on HDFS-6385:
-

Disregard my last comment.  I crossed replies.  Thanks, Haohui.

> Show when block deletion will start after NameNode startup in WebUI
> ---
>
> Key: HDFS-6385
> URL: https://issues.apache.org/jira/browse/HDFS-6385
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Jing Zhao
>Assignee: Chris Nauroth
> Attachments: HDFS-6385.1.patch, HDFS-6385.2.patch, HDFS-6385.3.patch, 
> HDFS-6385.4.patch, HDFS-6385.png
>
>
> HDFS-6186 provides functionality to delay block deletion for a period of time 
> after NameNode startup. Currently we only show the number of pending block 
> deletions in WebUI. We should also show when the block deletion will start in 
> WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    6   7   8   9   10   11   12   13   14   15   >