[jira] Updated: (HDFS-1081) Performance regression in DistributedFileSystem::getFileBlockLocations in secure systems

Jakob Homan (JIRA) Fri, 16 Apr 2010 14:37:52 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jakob Homan updated HDFS-1081:
------------------------------

    Attachment: HADOOP-1081-Y20-1.patch

Patch for review.

In our Y20S benchmarking, we saw dramatic increase for the getBlockLocations, 
due to two operations introduced by block access tokens:

*TokenIdentifier.getBytes() is expensive and is called twice*
In getFileBlockLocations TokenIdentifier.getBytes() is called twice in rapid 
succession: 
{code}
  // Token.java:50
  public Token(T id, SecretManager<T> mgr) {
    password = mgr.createPassword(id); // Calls id.getBytes()
    identifier = id.getBytes();                   // and here
{code}
This call is relatively expensive, as the BlockTokenIdentifier is serialized to 
a new DataOutPutBuffer and copied to a new array each time .  This patch caches 
the results of the getBytes() call and returns that, assuming no mutation to 
the token state.  

*For n blocks in a getBlockLocations() call, n block access tokens are created 
and each is relatively expensive*
In a call to getBlockLocations(), for every block that is returned, a new  
Token<BlockTokenIdentifier> is created and attached to the block.  Each new 
Token<BlockTokenIdentifier> means a call to hmac.DoFinal on the BTI's bytes.  
This call to the hmac calculation, which generates the token's password, turns 
out to be relatively expensive and was dramatically slowing down the function, 
particularly for files with large numbers of blocks.

This patch updates BlockTokenIdentifiers to be valid for a collection of 
blockIds rather than a single blockid.  This allows us to generate a single 
Token<BlockTokenIdentifier> for every call to getBlockLocations, calling the 
hmac function only once.  A quick benchmark of hmac.doFinal shows that its 
processing time is pretty much constant even for large byte arrays (by our 
standards for these tokens), meaning with this optimization, our time in hmac 
for n blocks should be constant.  This is a pretty surgical change and does not 
require much change to other parts of the Token authentication and 
authorization code.  For files with a small number of blocks there should be no 
penalty in performance.

> Performance regression in DistributedFileSystem::getFileBlockLocations in 
> secure systems
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-1081
>                 URL: https://issues.apache.org/jira/browse/HDFS-1081
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: security
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: HADOOP-1081-Y20-1.patch
>
>
> We've seen a significant decrease in the performance of 
> DistributedFileSystem::getFileBlockLocations() with security turned on Y20. 
> This JIRA is for correcting and tracking it both on Y20 and trunk.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HDFS-1081) Performance regression in DistributedFileSystem::getFileBlockLocations in secure systems

Reply via email to