[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-03-26 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802119#comment-16802119
 ] 

Justin Uang commented on HADOOP-16132:
--

[~gabor.bota], I just rebased it and pushed the new change to here: 
[https://github.com/apache/hadoop/pull/645.] I would really appreciate your 
comments!

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, 
> seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-03-26 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Status: Open  (was: Patch Available)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, 
> seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-03-01 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Attachment: HADOOP-16132.005.patch
Status: Patch Available  (was: Open)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, 
> seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-03-01 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Status: Open  (was: Patch Available)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, HADOOP-16132.004.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-03-01 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Attachment: HADOOP-16132.004.patch
Status: Patch Available  (was: Open)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, HADOOP-16132.004.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-03-01 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Status: Open  (was: Patch Available)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Attachment: HADOOP-16132.003.patch
Status: Patch Available  (was: Open)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Status: Open  (was: Patch Available)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> HADOOP-16132.003.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Status: Open  (was: Patch Available)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Attachment: HADOOP-16132.002.patch
Status: Patch Available  (was: Open)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, 
> seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780896#comment-16780896
 ] 

Justin Uang commented on HADOOP-16132:
--

[~ste...@apache.org]

The billing differences are good to know. I'm going to have to check with our 
usages, but I'm pretty sure the billing difference is small for us since it 
costs only $0.0004 per 1,000 requests ([https://aws.amazon.com/s3/pricing/).] I 
think that our main costs are in storage. Regarding the throttling, assuming 
that this is for sequential reads, we would only be requesting per the 
part-size which is 8MB, which I imagine is less frequent than the heavy random 
IO.

That's interesting about random IO. I do think that it would be hard to 
implement this for random IO given that the cost of guessing the wrong 
readahead can be quite expensive if the blocks are that large. It's a lot 
easier to guess what needs to be read in Sequential IO.

I do want to make sure I'm on the same page as you regarding what constitutes 
sequential IO. I view parquet as mostly sequential IO because from the 
perspective of [^seek-logs-parquet.txt], we do seek a few times for the footer 
(hundreds of bytes), but then afterwards we a straight read of several hundred 
MBs. Is my understanding the same as you?

I also posted a patch! I'm still getting familiar with the process, but any 
feedback on how to push this forwards would be great!

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Attachment: seek-logs-parquet.txt

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch, seek-logs-parquet.txt
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780886#comment-16780886
 ] 

Justin Uang commented on HADOOP-16132:
--

Copying over the last comment from the github ticket since we will be 
continuing the conversation here:

[~ste...@apache.org]
{quote}BTW, one little side effect of breaking up the reads: every GET is its 
own HTTP request, so gets billed differently, and for SSE-KMS, possibly a 
separate call to AWS:KMS. Nobody quite knows about the latter, we do know that 
heavy random seek IO on a single tree in a bucket can trigger more throttling 
than you'd expect

Anyway, maybe for random IO the strategy would be to have a notion of aligned 
blocks, say 8 MB, the current block is cached as it is read in, so a backward 
seek can often work from in memory; the stream could be doing a readahead of , 
say, the next 2+ blocks in parallel & then store them in a ring of cached 
blocks ready for when they are used.

you've got me thinking now...
{quote}
 

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-28 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Attachment: HADOOP-16132.001.patch
Status: Patch Available  (was: Open)

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Reporter: Justin Uang
>Priority: Major
> Attachments: HADOOP-16132.001.patch
>
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-21 Thread Justin Uang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774558#comment-16774558
 ] 

Justin Uang commented on HADOOP-16132:
--

Great! I make the changes!

> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Justin Uang
>Priority: Major
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-21 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Description: 
I noticed that I get 150MB/s when I use the AWS CLI
{code:java}
aws s3 cp s3:/// - > /dev/null{code}
vs 50MB/s when I use the S3AFileSystem
{code:java}
hadoop fs -cat s3:/// > /dev/null{code}
Looking into the AWS CLI code, it looks like the 
[download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
 logic is quite clever. It downloads the next couple parts in parallel using 
range requests, and then buffers them in memory in order to reorder them and 
expose a single contiguous stream. I translated the logic to Java and modified 
the S3AFileSystem to do similar things, and am able to achieve 150MB/s download 
speeds as well. It is mostly done but I have some things to clean up first. The 
PR is here: https://github.com/palantir/hadoop/pull/47/files

It would be great to get some other eyes on it to see what we need to do to get 
it merged.

  was:
I noticed that I get 150MB/s when I use the AWS CLI
{code:java}
aws s3 cp s3:/// - > /dev/null{code}
vs 50MB/s when I use the S3AFileSystem
{code:java}
hadoop fs -cat s3:/// > /dev/null{code}
Looking into the AWS CLI code, it looks like the 
[download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
 logic is quite clever. It downloads the next couple parts in parallel using 
range requests, and then buffers them in memory in order to reorder them and 
expose a single contiguous stream. I translated the logic to Java and modified 
the S3AFileSystem to do similar things, and am able to achieve 150MB/s download 
speeds as well. It is mostly done but I have some things to clean up first.

It would be great to get some other eyes on it to see what we need to do to get 
it merged.


> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Justin Uang
>Priority: Major
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first. The PR is here: 
> https://github.com/palantir/hadoop/pull/47/files
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-21 Thread Justin Uang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Uang updated HADOOP-16132:
-
Description: 
I noticed that I get 150MB/s when I use the AWS CLI
{code:java}
aws s3 cp s3:/// - > /dev/null{code}
vs 50MB/s when I use the S3AFileSystem
{code:java}
hadoop fs -cat s3:/// > /dev/null{code}
Looking into the AWS CLI code, it looks like the 
[download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
 logic is quite clever. It downloads the next couple parts in parallel using 
range requests, and then buffers them in memory in order to reorder them and 
expose a single contiguous stream. I translated the logic to Java and modified 
the S3AFileSystem to do similar things, and am able to achieve 150MB/s download 
speeds as well. It is mostly done but I have some things to clean up first.

It would be great to get some other eyes on it to see what we need to do to get 
it merged.

  was:
I noticed that I get 150MB/s when I use the aws CLI
{code:java}
aws s3 cp s3:/// - > /dev/null{code}
 

vs 50MB/s when I use the S3AFileSystem
{code:java}
hadoop fs -cat s3:/// > /dev/null{code}
Looking into the AWS CLI code, it looks like the 
[download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
 logic is quite clever. It downloads the next couple parts in parallel using 
range requests, and then buffers them in memory in order to reorder them and 
expose a single contiguous stream. I translated the logic to Java and modified 
the S3AFileSystem to do similar things, and am able to achieve 150MB/s download 
speeds as well. It is mostly done but I have some things to clean up first.

It would be great to get some other eyes on it to see what we need to do to get 
it merged.


> Support multipart download in S3AFileSystem
> ---
>
> Key: HADOOP-16132
> URL: https://issues.apache.org/jira/browse/HADOOP-16132
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Justin Uang
>Priority: Major
>
> I noticed that I get 150MB/s when I use the AWS CLI
> {code:java}
> aws s3 cp s3:/// - > /dev/null{code}
> vs 50MB/s when I use the S3AFileSystem
> {code:java}
> hadoop fs -cat s3:/// > /dev/null{code}
> Looking into the AWS CLI code, it looks like the 
> [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
>  logic is quite clever. It downloads the next couple parts in parallel using 
> range requests, and then buffers them in memory in order to reorder them and 
> expose a single contiguous stream. I translated the logic to Java and 
> modified the S3AFileSystem to do similar things, and am able to achieve 
> 150MB/s download speeds as well. It is mostly done but I have some things to 
> clean up first.
> It would be great to get some other eyes on it to see what we need to do to 
> get it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16132) Support multipart download in S3AFileSystem

2019-02-21 Thread Justin Uang (JIRA)
Justin Uang created HADOOP-16132:


 Summary: Support multipart download in S3AFileSystem
 Key: HADOOP-16132
 URL: https://issues.apache.org/jira/browse/HADOOP-16132
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Justin Uang


I noticed that I get 150MB/s when I use the aws CLI
{code:java}
aws s3 cp s3:/// - > /dev/null{code}
 

vs 50MB/s when I use the S3AFileSystem
{code:java}
hadoop fs -cat s3:/// > /dev/null{code}
Looking into the AWS CLI code, it looks like the 
[download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py]
 logic is quite clever. It downloads the next couple parts in parallel using 
range requests, and then buffers them in memory in order to reorder them and 
expose a single contiguous stream. I translated the logic to Java and modified 
the S3AFileSystem to do similar things, and am able to achieve 150MB/s download 
speeds as well. It is mostly done but I have some things to clean up first.

It would be great to get some other eyes on it to see what we need to do to get 
it merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16050) Support setting cipher suites for s3a file system

2019-01-17 Thread Justin Uang (JIRA)
Justin Uang created HADOOP-16050:


 Summary: Support setting cipher suites for s3a file system
 Key: HADOOP-16050
 URL: https://issues.apache.org/jira/browse/HADOOP-16050
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 2.9.1
Reporter: Justin Uang
 Attachments: Screen Shot 2019-01-17 at 2.57.06 PM.png

We have found that when running the S3AFileSystem, it picks GCM as the ssl 
cipher suite. Unfortunately this is well known to be slow on java 8: 
[https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20.]

 

In practice we have seen that it can take well over 50% of our CPU time in 
spark workflows. We should add an option to set the list of cipher suites we 
would like to use. !Screen Shot 2019-01-17 at 2.57.06 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org