[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Vishwajeet Dusane (JIRA) Tue, 09 Feb 2016 03:24:48 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138804#comment-15138804
 ]


Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------

Thanks [~eddyxu] for the comments.

{quote}
* You mentioned in the above comments. But PrivateAzureDataLakeFileSystem does 
not call it within synchronized calls (e.g., 
PrivateAzureDataLakeFileSystem#create. Although syncMap is a synchronizedMap, 
putFileStatus has multiple operations on syncMap, which can not guarantee 
atomicity.

* It might be a better idea to provide atomicity in 
PrivateAzureDataLakeFileSystem. A couple of places have multiple cache calls 
within the same function (e.g., rename()).
{quote}

PutFileStatus has only 1 operation on syncMap. Could you please elaborate on 
the scenario which could be affected? To be certain, are you reviewing to 
HADOOP-12666-005.patch right?

{quote}
* It might be a good idea to rename FileStatusCacheManager#getFileStatus, 
putFileStatus, removeFileStatus to get/put/remove, because the class name 
already clearly indicates the context.
{quote}

 Agree. Renamed to get/put/remove

{quote}
* FileStatusCacheObject can only store an absolute expiration time. And its 
methods can be package-level methods.
{quote}

You are right, this is an alternate approach to handle cache expiration time. I 
think we can leave with current implementation using time to live check, Please 
let me know if you find any issue with that approach?  

{quote}
* I saw a few places, e.g., PrivateAzureDataLakeFileSystem#rename/delete, that 
clear the cache if the param is a directory. Could you justify the reason 
behind this? Would it cause noticeable performance degradation? Or as an 
alternative, using LinkedList + TreeMap for FileStatusCacheManager?
{quote}

Yes, To avoid performance & correction issue when directory is renamed/deleted. 
In such cases, Cache is holding stale entries and needs to be removed so that 
delete/rename followed by getFileStatus call (For file/folder present in the 
directory). At the point of folder deletion, Cache might be holding multiple 
FileStatus instances within directory. Its efficient to nuke the cache and 
rebuild it than iterate over.

The current cache is a basic implementation to hold FileStatus instances to 
start with and we would continue to enhance in upcoming changes.

{quote}
* One general question, is this FileStatusCacheManager in HdfsClient? If it is 
the case, how do you make them consistent across clients on multiple nodes?
{quote}

FileStatusCacheManager need not be consistent across clients. 
FileStatusCacheManager is build based on the ListStatus and GetFileStatus calls 
from the respective clients.

{quote}
* Can we use Precondtions here? It will be more descriptive.
{quote}

Are you referring to com.google.common.base.Preconditions? 


> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, 
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Reply via email to