[ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189587#comment-15189587
 ] 

Chris Nauroth commented on HADOOP-12666:
----------------------------------------

The create/append/flush sequence is hugely different behavior.  At the protocol 
layer, there is the addition of the flush parameter, which is a deviation from 
stock WebHDFS.  Basically any of the custom *Param classes represent deviations 
from WebHDFS protocol: leaseId, ADLFeatureSet, etc.

At the client layer, the aggressive client-side caching and buffering in the 
name of performance creates different behavior from stock WebHDFS.  I and 
others have called out that while perhaps you don't observe anything to be 
broken right now, that's no guarantee that cache consistency won't become a 
problem for certain applications.  This is not a wire protocol difference, but 
it is a significant deviation in behavior from stock WebHDFS.

At this point, it appears that the ADL protocol, while heavily inspired by the 
WebHDFS protocol, is not really a compatible match.  It is its own protocol 
with its own unique requirements for clients to use it correctly and use it 
well.  Accidentally connecting the ADL client to an HDFS cluster would be 
disastrous.  The create/append/flush sequence would cause massive unsustainable 
load to the NameNode in terms of RPC calls and edit logging.  Client write 
latency would be unacceptable.  Likewise, accidentally connecting the stock 
WebHDFS client to ADL seems to yield unacceptable performance for ADL.

It is these large deviations that lead me to conclude the best choice is a 
dedicated client distinct from the WebHDFS client code.  Having full control of 
that client gives us the opportunity to provide the best possible user 
experience with ADL.  As I've stated before though, I can accept a short-term 
plan of some code reuse with the WebHDFS client.

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, 
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch, 
> HADOOP-12666-007.patch, HADOOP-12666-008.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to