[ https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kunal Kapoor resolved CARBONDATA-4050. -------------------------------------- Resolution: Fixed > TPC-DS queries performance degraded when compared to older versions due to > redundant getFileStatus() invocations > ---------------------------------------------------------------------------------------------------------------- > > Key: CARBONDATA-4050 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4050 > Project: CarbonData > Issue Type: Improvement > Components: core > Affects Versions: 2.0.0 > Reporter: Venugopal Reddy K > Priority: Major > Fix For: 2.1.1 > > Time Spent: 2.5h > Remaining Estimate: 0h > > *Issue:* > In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata > files in the segment, loop through all the carbon files and make a map of > fileNameToMetaInfoMapping<path-string, BlockMetaInfo> > In that carbon files loop, if the file is of AbstractDFSCarbonFile > type, we get the org.apache.hadoop.fs.FileStatus thrice for each file. And > the method to get file status is an RPC call(fileSystem.getFileStatus(path)). > It takes ~2ms in the cluster for each call. Thus, incur an overhead of ~6ms > per file. So overall driver side query processing time has increased > significantly when there are more carbon files. Hence caused TPC-DS queries > performance degradation. > Have shown the methods/calls which get the file status for the carbon file in > loop: > {code:java} > public static Map<String, BlockMetaInfo> > createCarbonDataFileBlockMetaInfoMapping( > String segmentFilePath, Configuration configuration) throws IOException { > Map<String, BlockMetaInfo> fileNameToMetaInfoMapping = new TreeMap(); > CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, > configuration); > if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof > S3CarbonFile)) { > PathFilter pathFilter = new PathFilter() { > @Override > public boolean accept(Path path) { > return CarbonTablePath.isCarbonDataFile(path.getName()); > } > }; > CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter); > for (CarbonFile file : carbonFiles) { > String[] location = file.getLocations(); // RPC call - 1 > long len = file.getSize(); // RPC call - 2 > BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len); > fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC > call - 3 in file.getpath() method > } > } > return fileNameToMetaInfoMapping; > } > {code} > > *Suggestion:* > I think, currently we make RPC call to get the file status upon each > invocation because file status may change over a period of time. And we > shouldn't cache the file status in AbstractDFSCarbonFile. > In the current case, just before the loop of carbon files, we get the > file status of all the carbon files in the segment with RPC call shown below. > LocatedFileStatus is a child class of FileStatus. It has BlockLocation along > with file status. > {code:java} > RemoteIterator<LocatedFileStatus> iter = > fileSystem.listLocatedStatus(path);{code} > Intention of getting all the file status here is to create instance > of BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping. > So it is safe to avoid these unnecessary rpc calls to get file status again > in getLocations(), getSize() and getPath() methods. > -- This message was sent by Atlassian Jira (v8.3.4#803005)