[ https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
shanyu zhao updated HADOOP-15320: --------------------------------- Attachment: HADOOP-15320.patch > Remove customized getFileBlockLocations for hadoop-azure and > hadoop-azure-datalake > ---------------------------------------------------------------------------------- > > Key: HADOOP-15320 > URL: https://issues.apache.org/jira/browse/HADOOP-15320 > Project: Hadoop Common > Issue Type: Bug > Components: fs/adl, fs/azure > Affects Versions: 2.7.3, 2.9.0, 3.0.0 > Reporter: shanyu zhao > Assignee: shanyu zhao > Priority: Major > Attachments: HADOOP-15320.patch > > > hadoop-azure and hadoop-azure-datalake have its own implementation of > getFileBlockLocations(), which faked a list of artificial blocks based on the > hard-coded block size. And each block has one host with name "localhost". > Take a look at this code: > [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485] > This is a unnecessary mock up for a "remote" file system to mimic HDFS. And > the problem with this mock is that for large (~TB) files we generates lots of > artificial blocks, and FileInputFormat.getSplits() is slow in calculating > splits based on these blocks. > We can safely remove this customized getFileBlockLocations() implementation, > fall back to the default FileSystem.getFileBlockLocations() implementation, > which is to return 1 block for any file with 1 host "localhost". Note that > this doesn't mean we will create much less splits, because the number of > splits is still limited by the blockSize in > FileInputFormat.computeSplitSize(): > {code:java} > return Math.max(minSize, Math.min(goalSize, blockSize));{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org