[ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138804#comment-15138804 ]
Vishwajeet Dusane commented on HADOOP-12666: -------------------------------------------- Thanks [~eddyxu] for the comments. {quote} * You mentioned in the above comments. But PrivateAzureDataLakeFileSystem does not call it within synchronized calls (e.g., PrivateAzureDataLakeFileSystem#create. Although syncMap is a synchronizedMap, putFileStatus has multiple operations on syncMap, which can not guarantee atomicity. * It might be a better idea to provide atomicity in PrivateAzureDataLakeFileSystem. A couple of places have multiple cache calls within the same function (e.g., rename()). {quote} PutFileStatus has only 1 operation on syncMap. Could you please elaborate on the scenario which could be affected? To be certain, are you reviewing to HADOOP-12666-005.patch right? {quote} * It might be a good idea to rename FileStatusCacheManager#getFileStatus, putFileStatus, removeFileStatus to get/put/remove, because the class name already clearly indicates the context. {quote} Agree. Renamed to get/put/remove {quote} * FileStatusCacheObject can only store an absolute expiration time. And its methods can be package-level methods. {quote} You are right, this is an alternate approach to handle cache expiration time. I think we can leave with current implementation using time to live check, Please let me know if you find any issue with that approach? {quote} * I saw a few places, e.g., PrivateAzureDataLakeFileSystem#rename/delete, that clear the cache if the param is a directory. Could you justify the reason behind this? Would it cause noticeable performance degradation? Or as an alternative, using LinkedList + TreeMap for FileStatusCacheManager? {quote} Yes, To avoid performance & correction issue when directory is renamed/deleted. In such cases, Cache is holding stale entries and needs to be removed so that delete/rename followed by getFileStatus call (For file/folder present in the directory). At the point of folder deletion, Cache might be holding multiple FileStatus instances within directory. Its efficient to nuke the cache and rebuild it than iterate over. The current cache is a basic implementation to hold FileStatus instances to start with and we would continue to enhance in upcoming changes. {quote} * One general question, is this FileStatusCacheManager in HdfsClient? If it is the case, how do you make them consistent across clients on multiple nodes? {quote} FileStatusCacheManager need not be consistent across clients. FileStatusCacheManager is build based on the ListStatus and GetFileStatus calls from the respective clients. {quote} * Can we use Precondtions here? It will be more descriptive. {quote} Are you referring to com.google.common.base.Preconditions? > Support Microsoft Azure Data Lake - as a file system in Hadoop > -------------------------------------------------------------- > > Key: HADOOP-12666 > URL: https://issues.apache.org/jira/browse/HADOOP-12666 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, fs/azure, tools > Reporter: Vishwajeet Dusane > Assignee: Vishwajeet Dusane > Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, > HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-1.patch > > Original Estimate: 336h > Time Spent: 336h > Remaining Estimate: 0h > > h2. Description > This JIRA describes a new file system implementation for accessing Microsoft > Azure Data Lake Store (ADL) from within Hadoop. This would enable existing > Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as > input or output. > > ADL is ultra-high capacity, Optimized for massive throughput with rich > management and security features. More details available at > https://azure.microsoft.com/en-us/services/data-lake-store/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)