[ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Kapoor resolved CARBONDATA-4050.
--------------------------------------
    Resolution: Fixed

> TPC-DS queries performance degraded when compared to older versions due to 
> redundant getFileStatus() invocations
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CARBONDATA-4050
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4050
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 2.0.0
>            Reporter: Venugopal Reddy K
>            Priority: Major
>             Fix For: 2.1.1
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> *Issue:*
> In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
> files in the segment, loop through all the carbon files and make a map of 
> fileNameToMetaInfoMapping<path-string, BlockMetaInfo>
>       In that carbon files loop, if the file is of AbstractDFSCarbonFile 
> type, we get the org.apache.hadoop.fs.FileStatus thrice for each file. And 
> the method to get file status is an RPC call(fileSystem.getFileStatus(path)). 
> It takes ~2ms in the cluster for each call. Thus, incur an overhead of ~6ms 
> per file. So overall driver side query processing time has increased 
> significantly when there are more carbon files. Hence caused TPC-DS queries 
> performance degradation.
> Have shown the methods/calls which get the file status for the carbon file in 
> loop:
> {code:java}
> public static Map<String, BlockMetaInfo> 
> createCarbonDataFileBlockMetaInfoMapping(
>     String segmentFilePath, Configuration configuration) throws IOException {
>   Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap();
>   CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
> configuration);
>   if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
> S3CarbonFile)) {
>     PathFilter pathFilter = new PathFilter() {
>       @Override
>       public boolean accept(Path path) {
>         return CarbonTablePath.isCarbonDataFile(path.getName());
>       }
>     };
>     CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
>     for (CarbonFile file : carbonFiles) {
>       String[] location = file.getLocations(); // RPC call - 1
>       long len = file.getSize(); // RPC call - 2
>       BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
>       fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC 
> call - 3 in file.getpath() method
>     }
>   }
>   return fileNameToMetaInfoMapping;
> }
> {code}
>  
> *Suggestion:*
> I think, currently we make RPC call to get the file status upon each 
> invocation because file status may change over a period of time. And we 
> shouldn't cache the file status in AbstractDFSCarbonFile.
>      In the current case, just before the loop of carbon files, we get the 
> file status of all the carbon files in the segment with RPC call shown below. 
> LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
> with file status. 
> {code:java}
> RemoteIterator<LocatedFileStatus> iter = 
> fileSystem.listLocatedStatus(path);{code}
>         Intention of getting all the file status here is to create instance 
> of BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.
> So it is safe to avoid these unnecessary rpc calls to get file status again 
> in getLocations(), getSize() and getPath() methods.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to