[ https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210903#comment-15210903 ]
Tony Wu commented on HADOOP-12666: ---------------------------------- Hi [~vishwajeet.dusane], Thank you for posting the semantics document, meeting minutes and a new patch. It was really helpful. *General comments:* # Regarding the semantics document. It will be great if you can also include more information on how ADL backend can "lock" a file for write. From the doc and our meeting discussion, there seems to be 2 ways (please confirm): *# File lease (which you included in the doc). Used by {{createNonRecursive()}} *# Maintain connection to the backend (you mentioned during meeting). Used by {{append()}} *** In this case do we assume ADL backend tracks which file is opened for write by keeping track of HTTP connections to the file? # In {{hadoop-tools/hadoop-azure-datalake/src/site/markdown/index.md}}. You mentioned: {quote}User and group information returned as ListStatus and GetFileStatus is in form of GUID associated in Azure Active Directory.{quote} There are applications which verifies the file ownership and it will fail because of this. Also I believe this means commands like {{hdfs dfs -ls <some path>}} will return GUID as user & group. It may not be that readable. Can you comment on how users should handle this? ** I see there is a {{adl.debug.override.localuserasfileowner}} config which will override the user with local client user and group with "hdfs". But this workaround is probably not for actual usage. *Specific comments:* 1. Can you explain why {{flushAsync()}} is used in {{BatchAppendOutputStream}}? It seems like {{flushAsync()}} is only used in a particular case where there is some data left in the buffer from previous writes and that combined with the current write will cross the buffer boundary. I'm not sure why this particular flush has to be async. Consider the following case: # {{BatchAppendOutputStream#flushAsync()}} returns. {{flushAsync()}} sends the sync job to thread and sets offset to 0: {code:java} private void flushAsync() throws IOException { if (offset > 0) { waitForOutstandingFlush(); // Submit the new flush task to the executor flushTask = EXECUTOR.submit(new CommitTask(data, offset, eof)); // Get a new internal buffer for the user to write data = getBuffer(); offset = 0; } } {code} # BatchAppendOutputStream#write() returns. # Client closes outputStream. # BatchAppendOutputStream#close() checks to see if there's anything to flush by checking offset. And in this case there is nothing to flush because offset is set to 0 earlier. {code:java} boolean flushedSomething = false; if (hadError) { // No point proceeding further since the error has occurred and // stream would be required to upload again. return; } else { flushedSomething = offset > 0; flush(); } {code} # {{BatchAppendOutputStream#close()}} does not wait for the async flush job to complete. After this point if {{flushAsync()}} hit any error, this error will be lost and client will not be aware of it. # If client then starts to write (append) to the same file with a new stream, this new write also will not wait for the previous async job to complete because {{flushTask}} is internal to {{BatchAppendOutputStream}}. It might be possible for the 2 writes to reach the backend in reverse order. IMHO if this async flush is necessary then {{EXECUTOR}} should be created inside {{BatchAppendOutputStream}} and shutdown when the stream is closed. Currently {{EXECUTOR}} lives in {{PrivateAzureDataLakeFileSystem}}. 2. {{EXECUTOR}} in {{PrivateAzureDataLakeFileSystem}} is not shutdown properly. 3. Stream is closed check is missing in {{BatchAppendOutputStream}}. This check is present for {{BatchByteArrayInputStream}}. > Support Microsoft Azure Data Lake - as a file system in Hadoop > -------------------------------------------------------------- > > Key: HADOOP-12666 > URL: https://issues.apache.org/jira/browse/HADOOP-12666 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, fs/azure, tools > Reporter: Vishwajeet Dusane > Assignee: Vishwajeet Dusane > Attachments: Create_Read_Hadoop_Adl_Store_Semantics.pdf, > HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch, > HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-007.patch, > HADOOP-12666-008.patch, HADOOP-12666-009.patch, HADOOP-12666-1.patch > > Original Estimate: 336h > Time Spent: 336h > Remaining Estimate: 0h > > h2. Description > This JIRA describes a new file system implementation for accessing Microsoft > Azure Data Lake Store (ADL) from within Hadoop. This would enable existing > Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as > input or output. > > ADL is ultra-high capacity, Optimized for massive throughput with rich > management and security features. More details available at > https://azure.microsoft.com/en-us/services/data-lake-store/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)