[jira] [Created] (HADOOP-16900) Very large files can be truncated when written through S3AFileSystem
Andrew Olson created HADOOP-16900: - Summary: Very large files can be truncated when written through S3AFileSystem Key: HADOOP-16900 URL: https://issues.apache.org/jira/browse/HADOOP-16900 Project: Hadoop Common Issue Type: Bug Components: fs/s3 Reporter: Andrew Olson If a written file size exceeds 10,000 * {{fs.s3a.multipart.size}}, a corrupt truncation of the S3 object will occur as the maximum number of parts in a multipart upload is 10,000 as specific by the S3 API and there is an apparent bug where this failure is not fatal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-16294) Enable access to input options by DistCp subclasses
Andrew Olson created HADOOP-16294: - Summary: Enable access to input options by DistCp subclasses Key: HADOOP-16294 URL: https://issues.apache.org/jira/browse/HADOOP-16294 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Reporter: Andrew Olson Assignee: Andrew Olson In the DistCp class, the DistCpOptions are private with no getter method allowing retrieval by subclasses. So a subclass would need to save its own copy of the inputOptions supplied to its constructor, if it wishes to override the createInputFileListing method with logic similar to the original implementation, i.e. calling CopyListing#buildListing with a path and input options. I propose adding to DistCp this method, {noformat} protected DistCpOptions getInputOptions() { return inputOptions; } {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Resolved] (HADOOP-12046) Avoid creating "._COPYING_" temporary file when copying file to Swift file system
[ https://issues.apache.org/jira/browse/HADOOP-12046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Olson resolved HADOOP-12046. --- Resolution: Duplicate > Avoid creating "._COPYING_" temporary file when copying file to Swift file > system > - > > Key: HADOOP-12046 > URL: https://issues.apache.org/jira/browse/HADOOP-12046 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/swift >Affects Versions: 2.7.0 >Reporter: Chen He >Assignee: Chen He >Priority: Major > Attachments: Copy Large file to Swift using Hadoop Client.png > > > When copy file from HDFS or local to another file system implementation, in > CommandWithDestination.java, it creates a temp file by adding suffix > "._COPYING_". Once file is successfully copied, it will remove the suffix by > rename(). > try { > PathData tempTarget = target.suffix("._COPYING_"); > targetFs.setWriteChecksum(writeChecksum); > targetFs.writeStreamToFile(in, tempTarget, lazyPersist); > targetFs.rename(tempTarget, target); > } finally { > targetFs.close(); // last ditch effort to ensure temp file is removed > } > It is not costly in HDFS. However, if copy to Swift file system, the rename > process is to create a new file. It is not efficient if users copy a lot of > files to swift file system. I did some tests, for a 1G file copying to swift, > it will take 10% more time. We should only do the copy one time for Swift > file system. Changes should be limited to the Swift driver level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-16147) Allow CopyListing sequence file keys and values to be more easily customized
Andrew Olson created HADOOP-16147: - Summary: Allow CopyListing sequence file keys and values to be more easily customized Key: HADOOP-16147 URL: https://issues.apache.org/jira/browse/HADOOP-16147 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Reporter: Andrew Olson We have encountered a scenario where, when using the Crunch library to run a distributed copy (CRUNCH-660, CRUNCH-675) at the conclusion of a job we need to dynamically rename target paths to the preferred destination output part file names, rather than retaining the original source path names. A custom CopyListing implementation appears to be the proper solution for this. However the place where the current SimpleCopyListing logic needs to be adjusted is in a private method (writeToFileListing), so a relatively large portion of the class would need to be cloned. To minimize the amount of code duplication required for such a custom implementation, we propose adding two new protected methods to the CopyListing class, that can be used to change the actual keys and/or values written to the copy listing sequence file: {noformat} protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus fileStatus); protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus fileStatus); {noformat} The SimpleCopyListing class would then be modified to consume these methods as follows, {noformat} fileListWriter.append( getFileListingKey(sourcePathRoot, fileStatus), getFileListingValue(fileStatus)); {noformat} The default implementations would simply preserve the present behavior of the SimpleCopyListing class, and could reside in either CopyListing or SimpleCopyListing, whichever is preferable. {noformat} protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus fileStatus) { return new Text(DistCpUtils.getRelativePath(sourcePathRoot, fileStatus.getPath())); } protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus fileStatus) { return fileStatus; } {noformat} Please let me know if this proposal seems to be on the right track. If so I can provide a patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-16047) Avoid expensive rename when DistCp is writing to S3
Andrew Olson created HADOOP-16047: - Summary: Avoid expensive rename when DistCp is writing to S3 Key: HADOOP-16047 URL: https://issues.apache.org/jira/browse/HADOOP-16047 Project: Hadoop Common Issue Type: Improvement Components: fs/s3, tools/distcp Reporter: Andrew Olson When writing to an S3-based target, the temp file and rename logic in RetriableFileCopyCommand adds some unnecessary cost to the job, as the rename operation does a server-side copy + delete in S3 [1]. The renames are parallelized across all of the DistCp map tasks, so the severity is mitigated to some extent. However a configuration property to conditionally allow distributed copies to avoid that expense and write directly to the target path would improve performance considerably. [1] https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md#object-stores-vs-filesystems -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13075) Add support for SSE-KMS and SSE-C in s3a filesystem
Andrew Olson created HADOOP-13075: - Summary: Add support for SSE-KMS and SSE-C in s3a filesystem Key: HADOOP-13075 URL: https://issues.apache.org/jira/browse/HADOOP-13075 Project: Hadoop Common Issue Type: New Feature Components: fs/s3 Reporter: Andrew Olson S3 provides 3 types of server-side encryption [1], * SSE-S3 (Amazon S3-Managed Keys) [2] * SSE-KMS (AWS KMS-Managed Keys) [3] * SSE-C (Customer-Provided Keys) [4] Of which the S3AFileSystem in hadoop-aws only supports opting into SSE-S3 - the underlying aws-java-sdk makes that very simple [5]. With native support in aws-java-sdk it should be fairly straightforward [6],[7] to support these other two flavors of SSE with some additional fs.s3a configuration properties. [1] http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html [2] http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html [3] http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html [4] http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html [5] http://docs.aws.amazon.com/AmazonS3/latest/dev/SSEUsingJavaSDK.html [6] http://docs.aws.amazon.com/AmazonS3/latest/dev/kms-using-sdks.html#kms-using-sdks-java [7] http://docs.aws.amazon.com/AmazonS3/latest/dev/sse-c-using-java-sdk.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-12891) S3AFileSystem should configure Multipart Copy threshold and chunk size
Andrew Olson created HADOOP-12891: - Summary: S3AFileSystem should configure Multipart Copy threshold and chunk size Key: HADOOP-12891 URL: https://issues.apache.org/jira/browse/HADOOP-12891 Project: Hadoop Common Issue Type: Improvement Components: fs/s3 Reporter: Andrew Olson In the AWS S3 Java SDK the defaults for Multipart Copy threshold and chunk size are very high [1], {noformat} /** Default size threshold for Amazon S3 object after which multi-part copy is initiated. */ private static final long DEFAULT_MULTIPART_COPY_THRESHOLD = 5 * GB; /** Default minimum size of each part for multi-part copy. */ private static final long DEFAULT_MINIMUM_COPY_PART_SIZE = 100 * MB; {noformat} In internal testing we have found that a lower but still reasonable threshold and chunk size can be extremely beneficial. In our case we set both the threshold and size to 25 MB with good results. Amazon enforces a minimum of 5 MB [2]. For the S3A filesystem, file renames are actually implemented via a remote copy request, which is already quite slow compared to a rename on HDFS. This very high threshold for utilizing the multipart functionality can make the performance considerably worse, particularly for files in the 100MB to 5GB range which is fairly typical for mapreduce job outputs. Two apparent options are: 1) Use the same configuration (fs.s3a.multipart.threshold, fs.s3a.multipart.size) for both. This seems preferable as the accompanying documentation [3] for these configuration properties actually already says that they are applicable for either "uploads or copies". We just need to add in the missing TransferManagerConfiguration#setMultipartCopyThreshold [4] and TransferManagerConfiguration#setMultipartCopyPartSize [5] calls at [6] like: {noformat} /* Handle copies in the same way as uploads. */ transferConfiguration.setMultipartCopyPartSize(partSize); transferConfiguration.setMultipartCopyThreshold(multiPartThreshold); {noformat} 2) Add two new configuration properties so that the copy threshold and part size can be independently configured, maybe change the defaults to be lower than Amazon's, set into TransferManagerConfiguration in the same way. [1] https://github.com/aws/aws-sdk-java/blob/1.10.58/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.java#L36-L40 [2] http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html [3] https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A [4] http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.html#setMultipartCopyThreshold(long) [5] http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.html#setMultipartCopyPartSize(long) [6] https://github.com/apache/hadoop/blob/release-2.7.2-RC2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L286 -- This message was sent by Atlassian JIRA (v6.3.4#6332)