[ 
https://issues.apache.org/jira/browse/HADOOP-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13770584#comment-13770584
 ] 

Steve Loughran commented on HADOOP-9978:
----------------------------------------

There already is a seek() so that >1 mapper can read off different parts of the 
same S3 file, after that initial GET to read in the file header -using offsets 
But that file header is needed to
# determine the length of the blob
# meet the standard expectation "open() fails if the file  isn't there"

were it not for #2, we could delay the open until the first read & so save one 
round trip (more relevant long-haul than in-EC2), but people don't expect that. 

What S3n does do is pretend that there is a block size for the data, so that 
the splitter can split up a file by blocks, handing each block off to a 
different mapper. You can configure this with {{"fs.s3n.block.size"}}; it 
defaults to 64 MB -but you are free to make it smaller or larger.

Even if you run 60 mappers against a 4GB file, the bandwidth you will get off 
an S3 blob won't be 60x that of a single mapper. S3 doesn't do replication the 
way HDFS does, where the bandwidth is O(blocks*3). For S3 it is O(1). What does 
that mean? It means that you won't get any speedup at the map phase, though the 
different no. of mappers may make things better/worse at reduce time.


                
> Support range reads in s3n interface to split objects for mappers to read
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-9978
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9978
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Amandeep Khurana
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to