[jira] [Comment Edited] (HADOOP-12878) Impersonate hosts in s3a for better data locality handling

Ryan Blue (JIRA) Wed, 06 Jul 2016 15:56:57 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365275#comment-15365275
 ]


Ryan Blue edited comment on HADOOP-12878 at 7/6/16 10:55 PM:
-------------------------------------------------------------

FileInputFormat works slightly differently. First, the [split 
size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445]
 is calculated from the file's reported block size and the current min and max 
split sizes. Then, [the file is broken into N 
splits|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L410-416]
 that size, where {{N = Math.ceil(fileLength / splitSize)}}. The block 
locations are then used to determine [where each split is 
located|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L448],
 based on the split's starting offset.

The result is that {{getFileBlockLocations}} can return a single location for 
the entire file and you'll still end up with N roughly block-sized splits. This 
is what enables you to get more parallelism by setting smaller split sizes, 
even if the resulting splits don't correspond to different blocks. In our 
environment, we use a 64MB S3 block size and don't see a bottleneck from one 
input split per file.


was (Author: rdblue):
FileInputFormat works slightly differently. First, the [split 
size|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L445]
 is calculated from the file's reported block size and the current min and max 
split sizes. Then, the file is broken into N splits that size, where {{N = 
Math.ceil(fileLength / splitSize)}}. The block locations are then used to 
determine where each split is located, based on the split's starting offset.

The result is that {{getFileBlockLocations}} can return a single location for 
the entire file and you'll still end up with N roughly block-sized splits. This 
is what enables you to get more parallelism by setting smaller split sizes, 
even if the resulting splits don't correspond to different blocks. In our 
environment, we use a 64MB S3 block size and don't see a bottleneck from one 
input split per file.

> Impersonate hosts in s3a for better data locality handling
> ----------------------------------------------------------
>
>                 Key: HADOOP-12878
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12878
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Thomas Demoor
>            Assignee: Thomas Demoor
>
> Currently, {{localhost}} is passed as locality for each block, causing all 
> blocks involved in job to initially target the same node (RM), before being 
> moved by the scheduler (to a rack-local node). This reduces parallelism for 
> jobs (with short-lived mappers). 
> We should mimic Azures implementation: a config setting 
> {{fs.s3a.block.location.impersonatedhost}} where the user can enter the list 
> of hostnames in the cluster to return to {{getFileBlockLocations}}. 
> Possible optimization: for larger systems, it might be better to return N 
> (5?) random hostnames to prevent passing a huge array (the downstream code 
> assumes size = O(3)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HADOOP-12878) Impersonate hosts in s3a for better data locality handling

Reply via email to