[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Tony Wu (JIRA) Thu, 24 Mar 2016 13:30:47 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210903#comment-15210903
 ]


Tony Wu commented on HADOOP-12666:
----------------------------------

Hi [~vishwajeet.dusane], 

Thank you for posting the semantics document, meeting minutes and a new patch. 
It was really helpful.

*General comments:*
# Regarding the semantics document. It will be great if you can also include 
more information on how ADL backend can "lock" a file for write. From the doc 
and our meeting discussion, there seems to be 2 ways (please confirm):
*# File lease (which you included in the doc). Used by {{createNonRecursive()}}
*# Maintain connection to the backend (you mentioned during meeting). Used by 
{{append()}}
*** In this case do we assume ADL backend tracks which file is opened for write 
by keeping track of HTTP connections to the file?
# In {{hadoop-tools/hadoop-azure-datalake/src/site/markdown/index.md}}. You 
mentioned:
{quote}User and group information returned as ListStatus and GetFileStatus is 
in form of GUID associated in Azure Active Directory.{quote}
There are applications which verifies the file ownership and it will fail 
because of this. Also I believe this means commands like {{hdfs dfs -ls <some 
path>}} will return GUID as user & group. It may not be that readable. Can you 
comment on how users should handle this?
** I see there is a {{adl.debug.override.localuserasfileowner}} config which 
will override the user with local client user and group with "hdfs". But this 
workaround is probably not for actual usage.

*Specific comments:*

1. Can you explain why {{flushAsync()}} is used in {{BatchAppendOutputStream}}? 
It seems like {{flushAsync()}} is only used in a particular case where there is 
some data left in the buffer from previous writes and that combined with the 
current write will cross the buffer boundary. I'm not sure why this particular 
flush has to be async.

Consider the following case:
# {{BatchAppendOutputStream#flushAsync()}} returns. {{flushAsync()}} sends the 
sync job to thread and sets offset to 0:
{code:java}
    private void flushAsync() throws IOException {
      if (offset > 0) {
        waitForOutstandingFlush();

        // Submit the new flush task to the executor
        flushTask = EXECUTOR.submit(new CommitTask(data, offset, eof));

        // Get a new internal buffer for the user to write
        data = getBuffer();
        offset = 0;
      }
    }
{code}
# BatchAppendOutputStream#write() returns.
# Client closes outputStream.
# BatchAppendOutputStream#close() checks to see if there's anything to flush by 
checking offset. And in this case there is nothing to flush because offset is 
set to 0 earlier.
{code:java}
      boolean flushedSomething = false;
      if (hadError) {
        // No point proceeding further since the error has occurred and
        // stream would be required to upload again.
        return;
      } else {
        flushedSomething = offset > 0;
        flush();
      }
{code}
# {{BatchAppendOutputStream#close()}} does not wait for the async flush job to 
complete. After this point if {{flushAsync()}} hit any error, this error will 
be lost and client will not be aware of it.
# If client then starts to write (append) to the same file with a new stream, 
this new write also will not wait for the previous async job to complete 
because {{flushTask}} is internal to {{BatchAppendOutputStream}}. It might be 
possible for the 2 writes to reach the backend in reverse order.

IMHO if this async flush is necessary then {{EXECUTOR}} should be created 
inside {{BatchAppendOutputStream}} and shutdown when the stream is closed. 
Currently {{EXECUTOR}} lives in {{PrivateAzureDataLakeFileSystem}}.

2. {{EXECUTOR}} in {{PrivateAzureDataLakeFileSystem}} is not shutdown properly.

3. Stream is closed check is missing in {{BatchAppendOutputStream}}. This check 
is present for {{BatchByteArrayInputStream}}.


> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: Create_Read_Hadoop_Adl_Store_Semantics.pdf, 
> HADOOP-12666-002.patch, HADOOP-12666-003.patch, HADOOP-12666-004.patch, 
> HADOOP-12666-005.patch, HADOOP-12666-006.patch, HADOOP-12666-007.patch, 
> HADOOP-12666-008.patch, HADOOP-12666-009.patch, HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Reply via email to