[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802119#comment-16802119 ] Justin Uang commented on HADOOP-16132: -- [~gabor.bota], I just rebased it and pushed the new change to here: [https://github.com/apache/hadoop/pull/645.] I would really appreciate your comments! > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, > seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Status: Open (was: Patch Available) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, > seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Attachment: HADOOP-16132.005.patch Status: Patch Available (was: Open) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, HADOOP-16132.004.patch, HADOOP-16132.005.patch, > seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Status: Open (was: Patch Available) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, HADOOP-16132.004.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Attachment: HADOOP-16132.004.patch Status: Patch Available (was: Open) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, HADOOP-16132.004.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Status: Open (was: Patch Available) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Attachment: HADOOP-16132.003.patch Status: Patch Available (was: Open) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Status: Open (was: Patch Available) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > HADOOP-16132.003.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Status: Open (was: Patch Available) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Attachment: HADOOP-16132.002.patch Status: Patch Available (was: Open) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, HADOOP-16132.002.patch, > seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780896#comment-16780896 ] Justin Uang commented on HADOOP-16132: -- [~ste...@apache.org] The billing differences are good to know. I'm going to have to check with our usages, but I'm pretty sure the billing difference is small for us since it costs only $0.0004 per 1,000 requests ([https://aws.amazon.com/s3/pricing/).] I think that our main costs are in storage. Regarding the throttling, assuming that this is for sequential reads, we would only be requesting per the part-size which is 8MB, which I imagine is less frequent than the heavy random IO. That's interesting about random IO. I do think that it would be hard to implement this for random IO given that the cost of guessing the wrong readahead can be quite expensive if the blocks are that large. It's a lot easier to guess what needs to be read in Sequential IO. I do want to make sure I'm on the same page as you regarding what constitutes sequential IO. I view parquet as mostly sequential IO because from the perspective of [^seek-logs-parquet.txt], we do seek a few times for the footer (hundreds of bytes), but then afterwards we a straight read of several hundred MBs. Is my understanding the same as you? I also posted a patch! I'm still getting familiar with the process, but any feedback on how to push this forwards would be great! > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Attachment: seek-logs-parquet.txt > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch, seek-logs-parquet.txt > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780886#comment-16780886 ] Justin Uang commented on HADOOP-16132: -- Copying over the last comment from the github ticket since we will be continuing the conversation here: [~ste...@apache.org] {quote}BTW, one little side effect of breaking up the reads: every GET is its own HTTP request, so gets billed differently, and for SSE-KMS, possibly a separate call to AWS:KMS. Nobody quite knows about the latter, we do know that heavy random seek IO on a single tree in a bucket can trigger more throttling than you'd expect Anyway, maybe for random IO the strategy would be to have a notion of aligned blocks, say 8 MB, the current block is cached as it is read in, so a backward seek can often work from in memory; the stream could be doing a readahead of , say, the next 2+ blocks in parallel & then store them in a ring of cached blocks ready for when they are used. you've got me thinking now... {quote} > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Attachment: HADOOP-16132.001.patch Status: Patch Available (was: Open) > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Justin Uang >Priority: Major > Attachments: HADOOP-16132.001.patch > > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774558#comment-16774558 ] Justin Uang commented on HADOOP-16132: -- Great! I make the changes! > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Justin Uang >Priority: Major > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Description: I noticed that I get 150MB/s when I use the AWS CLI {code:java} aws s3 cp s3:/// - > /dev/null{code} vs 50MB/s when I use the S3AFileSystem {code:java} hadoop fs -cat s3:/// > /dev/null{code} Looking into the AWS CLI code, it looks like the [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. The PR is here: https://github.com/palantir/hadoop/pull/47/files It would be great to get some other eyes on it to see what we need to do to get it merged. was: I noticed that I get 150MB/s when I use the AWS CLI {code:java} aws s3 cp s3:/// - > /dev/null{code} vs 50MB/s when I use the S3AFileSystem {code:java} hadoop fs -cat s3:/// > /dev/null{code} Looking into the AWS CLI code, it looks like the [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. It would be great to get some other eyes on it to see what we need to do to get it merged. > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Justin Uang >Priority: Major > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. The PR is here: > https://github.com/palantir/hadoop/pull/47/files > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-16132) Support multipart download in S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Uang updated HADOOP-16132: - Description: I noticed that I get 150MB/s when I use the AWS CLI {code:java} aws s3 cp s3:/// - > /dev/null{code} vs 50MB/s when I use the S3AFileSystem {code:java} hadoop fs -cat s3:/// > /dev/null{code} Looking into the AWS CLI code, it looks like the [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. It would be great to get some other eyes on it to see what we need to do to get it merged. was: I noticed that I get 150MB/s when I use the aws CLI {code:java} aws s3 cp s3:/// - > /dev/null{code} vs 50MB/s when I use the S3AFileSystem {code:java} hadoop fs -cat s3:/// > /dev/null{code} Looking into the AWS CLI code, it looks like the [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. It would be great to get some other eyes on it to see what we need to do to get it merged. > Support multipart download in S3AFileSystem > --- > > Key: HADOOP-16132 > URL: https://issues.apache.org/jira/browse/HADOOP-16132 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Justin Uang >Priority: Major > > I noticed that I get 150MB/s when I use the AWS CLI > {code:java} > aws s3 cp s3:/// - > /dev/null{code} > vs 50MB/s when I use the S3AFileSystem > {code:java} > hadoop fs -cat s3:/// > /dev/null{code} > Looking into the AWS CLI code, it looks like the > [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] > logic is quite clever. It downloads the next couple parts in parallel using > range requests, and then buffers them in memory in order to reorder them and > expose a single contiguous stream. I translated the logic to Java and > modified the S3AFileSystem to do similar things, and am able to achieve > 150MB/s download speeds as well. It is mostly done but I have some things to > clean up first. > It would be great to get some other eyes on it to see what we need to do to > get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-16132) Support multipart download in S3AFileSystem
Justin Uang created HADOOP-16132: Summary: Support multipart download in S3AFileSystem Key: HADOOP-16132 URL: https://issues.apache.org/jira/browse/HADOOP-16132 Project: Hadoop Common Issue Type: Improvement Reporter: Justin Uang I noticed that I get 150MB/s when I use the aws CLI {code:java} aws s3 cp s3:/// - > /dev/null{code} vs 50MB/s when I use the S3AFileSystem {code:java} hadoop fs -cat s3:/// > /dev/null{code} Looking into the AWS CLI code, it looks like the [download|https://github.com/boto/s3transfer/blob/ca0b708ea8a6a1213c6e21ca5a856e184f824334/s3transfer/download.py] logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. It would be great to get some other eyes on it to see what we need to do to get it merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-16050) Support setting cipher suites for s3a file system
Justin Uang created HADOOP-16050: Summary: Support setting cipher suites for s3a file system Key: HADOOP-16050 URL: https://issues.apache.org/jira/browse/HADOOP-16050 Project: Hadoop Common Issue Type: Bug Affects Versions: 2.9.1 Reporter: Justin Uang Attachments: Screen Shot 2019-01-17 at 2.57.06 PM.png We have found that when running the S3AFileSystem, it picks GCM as the ssl cipher suite. Unfortunately this is well known to be slow on java 8: [https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20.] In practice we have seen that it can take well over 50% of our CPU time in spark workflows. We should add an option to set the list of cipher suites we would like to use. !Screen Shot 2019-01-17 at 2.57.06 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org