[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

Justin Uang (JIRA) Thu, 28 Feb 2019 12:48:55 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780896#comment-16780896
 ]


Justin Uang commented on HADOOP-16132:
--------------------------------------

[~ste...@apache.org]

The billing differences are good to know. I'm going to have to check with our 
usages, but I'm pretty sure the billing difference is small for us since it 
costs only $0.0004 per 1,000 requests ([https://aws.amazon.com/s3/pricing/).] I 
think that our main costs are in storage. Regarding the throttling, assuming 
that this is for sequential reads, we would only be requesting per the 
part-size which is 8MB, which I imagine is less frequent than the heavy random 
IO.

That's interesting about random IO. I do think that it would be hard to 
implement this for random IO given that the cost of guessing the wrong 
readahead can be quite expensive if the blocks are that large. It's a lot 
easier to guess what needs to be read in Sequential IO.

I do want to make sure I'm on the same page as you regarding what constitutes 
sequential IO. I view parquet as mostly sequential IO because from the 
perspective of [^seek-logs-parquet.txt], we do seek a few times for the footer 
(hundreds of bytes), but then afterwards we a straight read of several hundred 
MBs. Is my understanding the same as you?

I also posted a patch! I'm still getting familiar with the process, but any 
feedback on how to push this forwards would be great!

> Support multipart download in S3AFileSystem
> -------------------------------------------
>
>                 Key: HADOOP-16132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16132
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Justin Uang
>            Priority: Major
>         Attachments: HADOOP-16132.001.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3://<bucket>/<key> - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3://<bucket>/<key> > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

Reply via email to