[ 
https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411827#comment-16411827
 ] 

Chris Douglas commented on HADOOP-15320:
----------------------------------------

bq. I know s3 "appears" to work, but I'm not actually confident that everything 
is getting the splits right there.
Me either, but 1.5h to generate synthetic splits is definitely wrong. If we 
develop a new, best practice for object stores, then we can apply that across 
the stores we support. The {{Blocklocations[]}} return type is pretty 
restrictive, but we could probably do better.

bq. The one I want you look at is: Spark, CSV, multiGB: SPARK-22240 . That's 
what's been niggling at me for a while.
Maybe I'm missing the bug. Block locations are hints for locality, not format 
partitioning. In that JIRA: gzip is not splittable, so a single reader is 
correct, absent some other preparation (saving the dictionary at offsets, 
writing zero-length gzip files as split markers, etc.). In general, framework 
parallelism should not rely exclusively on block locations...

> Remove customized getFileBlockLocations for hadoop-azure and 
> hadoop-azure-datalake
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-15320
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15320
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/adl, fs/azure
>    Affects Versions: 2.7.3, 2.9.0, 3.0.0
>            Reporter: shanyu zhao
>            Assignee: shanyu zhao
>            Priority: Major
>         Attachments: HADOOP-15320.patch
>
>
> hadoop-azure and hadoop-azure-datalake have its own implementation of 
> getFileBlockLocations(), which faked a list of artificial blocks based on the 
> hard-coded block size. And each block has one host with name "localhost". 
> Take a look at this code:
> [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485]
> This is a unnecessary mock up for a "remote" file system to mimic HDFS. And 
> the problem with this mock is that for large (~TB) files we generates lots of 
> artificial blocks, and FileInputFormat.getSplits() is slow in calculating 
> splits based on these blocks.
> We can safely remove this customized getFileBlockLocations() implementation, 
> fall back to the default FileSystem.getFileBlockLocations() implementation, 
> which is to return 1 block for any file with 1 host "localhost". Note that 
> this doesn't mean we will create much less splits, because the number of 
> splits is still limited by the blockSize in 
> FileInputFormat.computeSplitSize():
> {code:java}
> return Math.max(minSize, Math.min(goalSize, blockSize));{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to