[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Vishwajeet Dusane (JIRA) Wed, 24 Feb 2016 05:17:46 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162960#comment-15162960
 ]


Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------

Extracted comments out 

* AF>This class seems to have serious issues that need addressing:
        1. Local race conditions in caller PrivateAzureDataLakeFileSystem
        [Vishwajeet] - Already explained in the previous replies to comments 
from [~cnauroth] and [~eddyxu]. In the worst case, there would be a cache miss 
and this would not break any functionality.
        
        2. No mechanism for cache invalidation across nodes in the cluster.
        [Vishwajeet] - Scope of cache is per process. Invalidation across nodes 
would be taken care as part of the distributed cache mechanism in the future.
        
 * AF>Update comment? This uses "adl" scheme, right?
 [Vishwajeet] - Yes
 
 * AF>What is this class used for? I didn't see any uses.
 [Vishwajeet] - No longer required. This will is removed. 
 
 * AF>Care to comment why this is in the ..hdfs.web package instead of fs.adl?
 [Vishwajeet] Already explained in the previous comments. In short, requirement 
to use protected functionality within `WebHdfsFileSystem` implementation to 
disable redirect operation, Access Runner class for configuration.. etc. 
 
 * AF>?
 [Vishwajeet] Comment was trimmed, updated again. 
  
 * AF>Due to the bug or due to the fix? The fix was merged in 2.8.0, right?
 * AF>I'm not understanding this last sentence, can you explain?
 
 [Vishwajeet] Not needed anymore since homedirectory would constructed locally 
and no back-end call would be necessary. Updated comment accordingly.
 
 * AF>Is this a race condition?
        thread 1> getFileStatus(), cache miss
        super.getStatus -> s1
        cache.get() -> null
        thread 2> delete()
        cache.clear()
        thread 1> cache.put(s1)
        Maybe provide an atomic putIfAbsent() for FileStatusCacheManager. You 
can
        synchronize on the underlying map object I believe (see
        Collections.synchronizedMap()).
        
[Vishwajeet] - Already using `syncMap = Collections.synchronizedMap(map);` We 
are aware of the limitation of the current implementation. Cache is short lived 
and would not be persistent effect. Is there a real user scenario where such 
issue can surface frequently ? We have recommended to turn off the `FileStatus` 
cache feature in case of misbehavior.

 * AF> Seems like there is a less-likely race condition here. (f is replaced by 
a directory after checking fs.isFile())
[Vishwajeet] Why there would be race condition?

 * AF>Similar pattern of get/mutate non-atomically repeats here and below.
 [Vishwajeet] Could you please elaborate the issue ? 
 
 * AF>typo
  [Vishwajeet] Thanks, corrected.
  
 * AF>Did you guys consider handling this as transparently reconnecting, instead
of doing separate connections for each op? Seems like performance would be a lot
better?
  [Vishwajeet] Exactly the same. Since transparent re-connection would require 
to pass new offset and length value again which is separate op. HTTP persistent 
connection ensures the same socket is reused so no impact on the performance as 
well. We have done enough perf test to ensure.
  
 * AF>I'd expect you to want a connection per-thread, instead of per-op.
 [Vishwajeet] Connection per-op
 
 * AF> This case could use some perf optimization. e.g. Three calls to get 
system time. + return fin; + }
 [Vishwajeet] Agree. Corrected.

 * AF> How about adding ADLLogger.logWithTimestamp(). That way, if the logger 
is disabled, you don't keep getting system time.
 [Vishwajeet] Removed system.currentTimeinMs in case debug or perf flags are 
OFF.
 
 * AF> Redundant check of isLogEnabled() + ADLLogger.log("getFileBlockLocations 
}
[Vishwajeet] Corrected

 * AF>Just use "name" twice instead of defining host?
 [Vishwajeet] Mainly for readability on block location computation so would not 
change that.
 
 * AF> Why the runtime check of a compile-time constant? How about just add a 
comment near the definition "must be non-zero" + throw new 
IllegalArgumentException( + "The block size for the given file is not a 
positive number}
[Vishwajeet] Removed.

 * AF> Redundant check of isLogEnabled() + ADLLogger.log("getFileBlockLocations 
}
[Vishwajeet] Removed 

 * AF>Formatting. Missing newline above.
 [Vishwajeet] Using Apache.xml in intellij for formatting per apache guideline. 
Will look at it later in rulset defined in Apache.xml and raise JIRA 
accordingly. 
 
 * AF>Why volatile here? Needs comments.
        I have a feeling this is wrong and you need some synchronized blocks 
below
        instead.
 [Vishwajeet] - We removed synchronization block for performance reason and 
used Volatile for specific variables which require synchronization. Problem 
could be when same FsInputSteam instance being used across threads to read from 
the same file. Couple of variable does not require to be volatile though, 
updated code.
 
 * AF>Probably unique. What is impact if this collides? Why not use a random 
java UUID?
 [Vishwajeet] - For performance reason we have removed UUID usage. Usage is for 
telemetry purpose. It is used with other parameters to identity the request. 
For the overlapped value, impact is none.
 
 * AF>this is not thread safe, but looking at FSInputStream, it appears read() 
is
 supposed to be.
Seems this class is racy in general?

 [Vishwajeet] - Assumption is FsInputSteam instance API would not be invoked 
from multiple threads. Are you aware of such condition which we might 
encounter? 
 
 * AF>I didn't have time to review this part thoroughly.. Can take
better look next round.
[Vishwajeet] Looking forward for your comments.

 * AF>Is this shutdown() needed?
 [Vishwajeet] - Do you see issue any issue here ? 
 
* AF> Curious, why don't you care about timeout expiring? Can't you just remove 
this whole try-catch block? The futures are what are ensuring tasks finish 
below, no? + }
[Vishwajeet] - Decision is to have time bound operation and if not complete in 
the duration then fail fast so that application can retry. 

 * AF>Add function comment w/ param and @return explanation?
 [Vishwajeet] - Done
 
 * AF>Can you explain what this openid does?
 - [Vishwajeet] To be used by ADL back-end for telemetry.
 
 * AF> humm.. we expect in.read() to return 0, and then return -1? Also, you 
can remove the string concatenation in the exception message. + }
 [Vishwajeet] We encountered this condition when requested n number of bytes 
match the available n number of bytes however available bytes are (n+1), where 
+1 is end of network steam marker. Mainly for reusing socket connection. if end 
of network steam marker is not read then next urlConnection instance re 
initiate new socket connection to back-end server. 
 
 * AF> As noted before, using filesystem types to achieve logging tags is 
awkward.
 [Vishwajeet] Additional instrumentation is done in this class for capture perf 
data hence the separate filesystem exposed. We need to rework on that, i will 
file separate JIRA for the same.
 
 * AF>Again, need to put these in core-default.xml with documentation and
 [Vishwajeet] Yes, Will add it as part of the patch. All the configuration 
defined part of this patch are already lowered case in Path 06, do let me know 
if you have found any discrepancy?  
 

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, 
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch, 
> HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Reply via email to