[ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189587#comment-15189587 ]
Chris Nauroth commented on HADOOP-12666: ---------------------------------------- The create/append/flush sequence is hugely different behavior. At the protocol layer, there is the addition of the flush parameter, which is a deviation from stock WebHDFS. Basically any of the custom *Param classes represent deviations from WebHDFS protocol: leaseId, ADLFeatureSet, etc. At the client layer, the aggressive client-side caching and buffering in the name of performance creates different behavior from stock WebHDFS. I and others have called out that while perhaps you don't observe anything to be broken right now, that's no guarantee that cache consistency won't become a problem for certain applications. This is not a wire protocol difference, but it is a significant deviation in behavior from stock WebHDFS. At this point, it appears that the ADL protocol, while heavily inspired by the WebHDFS protocol, is not really a compatible match. It is its own protocol with its own unique requirements for clients to use it correctly and use it well. Accidentally connecting the ADL client to an HDFS cluster would be disastrous. The create/append/flush sequence would cause massive unsustainable load to the NameNode in terms of RPC calls and edit logging. Client write latency would be unacceptable. Likewise, accidentally connecting the stock WebHDFS client to ADL seems to yield unacceptable performance for ADL. It is these large deviations that lead me to conclude the best choice is a dedicated client distinct from the WebHDFS client code. Having full control of that client gives us the opportunity to provide the best possible user experience with ADL. As I've stated before though, I can accept a short-term plan of some code reuse with the WebHDFS client. > Support Microsoft Azure Data Lake - as a file system in Hadoop > -------------------------------------------------------------- > > Key: HADOOP-12666 > URL: https://issues.apache.org/jira/browse/HADOOP-12666 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, fs/azure, tools > Reporter: Vishwajeet Dusane > Assignee: Vishwajeet Dusane > Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, > HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch, > HADOOP-12666-007.patch, HADOOP-12666-008.patch, HADOOP-12666-1.patch > > Original Estimate: 336h > Time Spent: 336h > Remaining Estimate: 0h > > h2. Description > This JIRA describes a new file system implementation for accessing Microsoft > Azure Data Lake Store (ADL) from within Hadoop. This would enable existing > Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as > input or output. > > ADL is ultra-high capacity, Optimized for massive throughput with rich > management and security features. More details available at > https://azure.microsoft.com/en-us/services/data-lake-store/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)