[ https://issues.apache.org/jira/browse/HDDS-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943983#comment-16943983 ]
Aravindan Vijayan edited comment on HDDS-2188 at 10/3/19 8:39 PM: ------------------------------------------------------------------ On discussing with [~msingh], we found the following to be implemented to make this work correctly. I will be creating separate JIRAs for handling each work item. *This JIRA will cover the following* * The getFileStatus call on the Ozone file system will compute and return an instance of LocatedFileStatus. This means that it will have the block locations for the file (if present) as part of the status. This will be used by the Map-Reduce applications automatically. * For the Ozone Manager to supplement the getFileStatus with Block info locations, we need to use a flag like the "refreshPipeline" flag to obtain the information from the SCM. Whenever the flag is set, OM will get the Block locations from SCM and include it in the returned File Status. *New JIRAs will be created for the following.* * Currently, we get block location info from SCM for every block. This will lead to multiple SCM RPC calls to get the blocks for 1 file. We can implement a batch GET API for SCM using which we can get the Block info locations for all the blocks for a file. (HDDS-2241) * As an optimization, we can cache the block info in FileSystem client layer so that we can reuse them instead of making a call to RPC. An expiry based Guava cache is one candidate. was (Author: avijayan): On discussing with [~msingh], we found the following to be implemented to make this work correctly. I will be creating separate JIRAs for handling each work item. *This JIRA will cover the following* * The getFileStatus call on the Ozone file system will compute and return an instance of LocatedFileStatus. This means that it will have the block locations for the file (if present) as part of the status. This will be used by the Map-Reduce applications automatically. * For the Ozone Manager to supplement the getFileStatus with Block info locations, we need to use a flag like the "refreshPipeline" flag to obtain the information from the SCM. Whenever the flag is set, OM will get the Block locations from SCM and include it in the returned File Status. *New JIRAs will be created for the following.* * Currently, we get block location info from SCM for every block. This will lead to multiple SCM RPC calls to get the blocks for 1 file. We can implement a batch GET API for SCM using which we can get the Block info locations for all the blocks for a file. * As an optimization, we can cache the block info in FileSystem client layer so that we can reuse them instead of making a call to RPC. An expiry based Guava cache is one candidate. > Implement LocatedFileStatus & getFileBlockLocations to provide > node/localization information to Yarn/Mapreduce > -------------------------------------------------------------------------------------------------------------- > > Key: HDDS-2188 > URL: https://issues.apache.org/jira/browse/HDDS-2188 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Filesystem > Affects Versions: 0.5.0 > Reporter: Mukul Kumar Singh > Assignee: Aravindan Vijayan > Priority: Major > > For applications like Hive/MapReduce to take advantage of the data locality > in Ozone, Ozone should return the location of the Ozone blocks. This is > needed for better read performance for Hadoop Applications. > {code} > if (file instanceof LocatedFileStatus) { > blkLocations = ((LocatedFileStatus) file).getBlockLocations(); > } else { > blkLocations = fs.getFileBlockLocations(file, 0, length); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org