Mapreduce

Aravindan Vijayan (Jira) Thu, 03 Oct 2019 13:40:12 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943983#comment-16943983
 ]


Aravindan Vijayan edited comment on HDDS-2188 at 10/3/19 8:39 PM:
------------------------------------------------------------------

On discussing with [~msingh], we found the following to be implemented to make 
this work correctly. I will be creating separate JIRAs for handling each work 
item.

*This JIRA will cover the following*
* The getFileStatus call on the Ozone file system will compute and return an 
instance of LocatedFileStatus. This means that it will have the block locations 
for the file (if present) as part of the status. This will be used by the 
Map-Reduce applications automatically. 
* For the Ozone Manager to supplement the getFileStatus with Block info 
locations, we need to use a flag like the "refreshPipeline" flag to obtain the 
information from the SCM. Whenever the flag is set, OM will get the Block 
locations from SCM and include it in the returned File Status.

*New JIRAs will be created for the following.* 
* Currently, we get block location info from SCM for every block. This will 
lead to multiple SCM RPC calls to get the blocks for 1 file. We can implement a 
batch GET API for SCM using which we can get the Block info locations for all 
the blocks for a file. (HDDS-2241)
* As an optimization, we can cache the block info in FileSystem client layer so 
that we can reuse them instead of making a call to RPC.  An expiry based Guava 
cache is one candidate. 


was (Author: avijayan):
On discussing with [~msingh], we found the following to be implemented to make 
this work correctly. I will be creating separate JIRAs for handling each work 
item.

*This JIRA will cover the following*
* The getFileStatus call on the Ozone file system will compute and return an 
instance of LocatedFileStatus. This means that it will have the block locations 
for the file (if present) as part of the status. This will be used by the 
Map-Reduce applications automatically. 
* For the Ozone Manager to supplement the getFileStatus with Block info 
locations, we need to use a flag like the "refreshPipeline" flag to obtain the 
information from the SCM. Whenever the flag is set, OM will get the Block 
locations from SCM and include it in the returned File Status.

*New JIRAs will be created for the following.* 
* Currently, we get block location info from SCM for every block. This will 
lead to multiple SCM RPC calls to get the blocks for 1 file. We can implement a 
batch GET API for SCM using which we can get the Block info locations for all 
the blocks for a file. 
* As an optimization, we can cache the block info in FileSystem client layer so 
that we can reuse them instead of making a call to RPC.  An expiry based Guava 
cache is one candidate. 

> Implement LocatedFileStatus & getFileBlockLocations to provide 
> node/localization information to Yarn/Mapreduce
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-2188
>                 URL: https://issues.apache.org/jira/browse/HDDS-2188
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Filesystem
>    Affects Versions: 0.5.0
>            Reporter: Mukul Kumar Singh
>            Assignee: Aravindan Vijayan
>            Priority: Major
>
> For applications like Hive/MapReduce to take advantage of the data locality 
> in Ozone, Ozone should return the location of the Ozone blocks. This is 
> needed for better read performance for Hadoop Applications.
> {code}
>         if (file instanceof LocatedFileStatus) {
>           blkLocations = ((LocatedFileStatus) file).getBlockLocations();
>         } else {
>           blkLocations = fs.getFileBlockLocations(file, 0, length);
>         }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDDS-2188) Implement LocatedFileStatus & getFileBlockLocations to provide node/localization information to Yarn/Mapreduce

Reply via email to