[jira] [Created] (HADOOP-16900) Very large files can be truncated when written through S3AFileSystem

2020-03-02 Thread Andrew Olson (Jira)
Andrew Olson created HADOOP-16900:
-

 Summary: Very large files can be truncated when written through 
S3AFileSystem
 Key: HADOOP-16900
 URL: https://issues.apache.org/jira/browse/HADOOP-16900
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs/s3
Reporter: Andrew Olson


If a written file size exceeds 10,000 * {{fs.s3a.multipart.size}}, a corrupt 
truncation of the S3 object will occur as the maximum number of parts in a 
multipart upload is 10,000 as specific by the S3 API and there is an apparent 
bug where this failure is not fatal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16294) Enable access to input options by DistCp subclasses

2019-05-06 Thread Andrew Olson (JIRA)
Andrew Olson created HADOOP-16294:
-

 Summary: Enable access to input options by DistCp subclasses
 Key: HADOOP-16294
 URL: https://issues.apache.org/jira/browse/HADOOP-16294
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Reporter: Andrew Olson
Assignee: Andrew Olson


In the DistCp class, the DistCpOptions are private with no getter method 
allowing retrieval by subclasses. So a subclass would need to save its own copy 
of the inputOptions supplied to its constructor, if it wishes to override the 
createInputFileListing method with logic similar to the original 
implementation, i.e. calling CopyListing#buildListing with a path and input 
options.

I propose adding to DistCp this method,

{noformat}
  protected DistCpOptions getInputOptions() {
return inputOptions;
  }
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-12046) Avoid creating "._COPYING_" temporary file when copying file to Swift file system

2019-03-14 Thread Andrew Olson (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-12046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Olson resolved HADOOP-12046.
---
Resolution: Duplicate

> Avoid creating "._COPYING_" temporary file when copying file to Swift file 
> system
> -
>
> Key: HADOOP-12046
> URL: https://issues.apache.org/jira/browse/HADOOP-12046
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/swift
>Affects Versions: 2.7.0
>Reporter: Chen He
>Assignee: Chen He
>Priority: Major
> Attachments: Copy Large file to Swift using Hadoop Client.png
>
>
> When copy file from HDFS or local to another file system implementation, in 
> CommandWithDestination.java, it creates a temp file by adding suffix 
> "._COPYING_". Once file is successfully copied, it will remove the suffix by 
> rename(). 
> try {
>   PathData tempTarget = target.suffix("._COPYING_");
>   targetFs.setWriteChecksum(writeChecksum);
>   targetFs.writeStreamToFile(in, tempTarget, lazyPersist);
>   targetFs.rename(tempTarget, target);
> } finally {
>   targetFs.close(); // last ditch effort to ensure temp file is removed
> }
> It is not costly in HDFS. However, if copy to Swift file system, the rename 
> process is to create a new file. It is not efficient if users copy a lot of 
> files to swift file system. I did some tests, for a 1G file copying to swift, 
> it will take 10% more time. We should only do the copy one time for Swift 
> file system. Changes should be limited to the Swift driver level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16147) Allow CopyListing sequence file keys and values to be more easily customized

2019-02-25 Thread Andrew Olson (JIRA)
Andrew Olson created HADOOP-16147:
-

 Summary: Allow CopyListing sequence file keys and values to be 
more easily customized
 Key: HADOOP-16147
 URL: https://issues.apache.org/jira/browse/HADOOP-16147
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Reporter: Andrew Olson


We have encountered a scenario where, when using the Crunch library to run a 
distributed copy (CRUNCH-660, CRUNCH-675) at the conclusion of a job we need to 
dynamically rename target paths to the preferred destination output part file 
names, rather than retaining the original source path names.

A custom CopyListing implementation appears to be the proper solution for this. 
However the place where the current SimpleCopyListing logic needs to be 
adjusted is in a private method (writeToFileListing), so a relatively large 
portion of the class would need to be cloned.

To minimize the amount of code duplication required for such a custom 
implementation, we propose adding two new protected methods to the CopyListing 
class, that can be used to change the actual keys and/or values written to the 
copy listing sequence file: 

{noformat}
protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus 
fileStatus);

protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus 
fileStatus);
{noformat}

The SimpleCopyListing class would then be modified to consume these methods as 
follows,
{noformat}
fileListWriter.append(
   getFileListingKey(sourcePathRoot, fileStatus),
   getFileListingValue(fileStatus));
{noformat}

The default implementations would simply preserve the present behavior of the 
SimpleCopyListing class, and could reside in either CopyListing or 
SimpleCopyListing, whichever is preferable.

{noformat}
protected Text getFileListingKey(Path sourcePathRoot, CopyListingFileStatus 
fileStatus) {
   return new Text(DistCpUtils.getRelativePath(sourcePathRoot, 
fileStatus.getPath()));
}

protected CopyListingFileStatus getFileListingValue(CopyListingFileStatus 
fileStatus) {
   return fileStatus;
}
{noformat}

Please let me know if this proposal seems to be on the right track. If so I can 
provide a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16047) Avoid expensive rename when DistCp is writing to S3

2019-01-14 Thread Andrew Olson (JIRA)
Andrew Olson created HADOOP-16047:
-

 Summary: Avoid expensive rename when DistCp is writing to S3
 Key: HADOOP-16047
 URL: https://issues.apache.org/jira/browse/HADOOP-16047
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3, tools/distcp
Reporter: Andrew Olson


When writing to an S3-based target, the temp file and rename logic in 
RetriableFileCopyCommand adds some unnecessary cost to the job, as the rename 
operation does a server-side copy + delete in S3 [1]. The renames are 
parallelized across all of the DistCp map tasks, so the severity is mitigated 
to some extent. However a configuration property to conditionally allow 
distributed copies to avoid that expense and write directly to the target path 
would improve performance considerably.

[1] 
https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md#object-stores-vs-filesystems



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13075) Add support for SSE-KMS and SSE-C in s3a filesystem

2016-04-29 Thread Andrew Olson (JIRA)
Andrew Olson created HADOOP-13075:
-

 Summary: Add support for SSE-KMS and SSE-C in s3a filesystem
 Key: HADOOP-13075
 URL: https://issues.apache.org/jira/browse/HADOOP-13075
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs/s3
Reporter: Andrew Olson


S3 provides 3 types of server-side encryption [1],

* SSE-S3 (Amazon S3-Managed Keys) [2]
* SSE-KMS (AWS KMS-Managed Keys) [3]
* SSE-C (Customer-Provided Keys) [4]

Of which the S3AFileSystem in hadoop-aws only supports opting into SSE-S3 - the 
underlying aws-java-sdk makes that very simple [5]. With native support in 
aws-java-sdk it should be fairly straightforward [6],[7] to support these other 
two flavors of SSE with some additional fs.s3a configuration properties.

[1] http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html
[2] 
http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html
[3] http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html
[4] 
http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html
[5] http://docs.aws.amazon.com/AmazonS3/latest/dev/SSEUsingJavaSDK.html
[6]
http://docs.aws.amazon.com/AmazonS3/latest/dev/kms-using-sdks.html#kms-using-sdks-java
[7] http://docs.aws.amazon.com/AmazonS3/latest/dev/sse-c-using-java-sdk.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-12891) S3AFileSystem should configure Multipart Copy threshold and chunk size

2016-03-04 Thread Andrew Olson (JIRA)
Andrew Olson created HADOOP-12891:
-

 Summary: S3AFileSystem should configure Multipart Copy threshold 
and chunk size
 Key: HADOOP-12891
 URL: https://issues.apache.org/jira/browse/HADOOP-12891
 Project: Hadoop Common
  Issue Type: Improvement
  Components: fs/s3
Reporter: Andrew Olson


In the AWS S3 Java SDK the defaults for Multipart Copy threshold and chunk size 
are very high [1],

{noformat}
/** Default size threshold for Amazon S3 object after which multi-part copy 
is initiated. */
private static final long DEFAULT_MULTIPART_COPY_THRESHOLD = 5 * GB;

/** Default minimum size of each part for multi-part copy. */
private static final long DEFAULT_MINIMUM_COPY_PART_SIZE = 100 * MB;
{noformat}

In internal testing we have found that a lower but still reasonable threshold 
and chunk size can be extremely beneficial. In our case we set both the 
threshold and size to 25 MB with good results.

Amazon enforces a minimum of 5 MB [2].

For the S3A filesystem, file renames are actually implemented via a remote copy 
request, which is already quite slow compared to a rename on HDFS. This very 
high threshold for utilizing the multipart functionality can make the 
performance considerably worse, particularly for files in the 100MB to 5GB 
range which is fairly typical for mapreduce job outputs.

Two apparent options are:

1) Use the same configuration (fs.s3a.multipart.threshold, 
fs.s3a.multipart.size) for both. This seems preferable as the accompanying 
documentation [3] for these configuration properties actually already says that 
they are applicable for either "uploads or copies". We just need to add in the 
missing TransferManagerConfiguration#setMultipartCopyThreshold [4] and 
TransferManagerConfiguration#setMultipartCopyPartSize [5] calls at [6] like:

{noformat}
/* Handle copies in the same way as uploads. */
transferConfiguration.setMultipartCopyPartSize(partSize);
transferConfiguration.setMultipartCopyThreshold(multiPartThreshold);
{noformat}

2) Add two new configuration properties so that the copy threshold and part 
size can be independently configured, maybe change the defaults to be lower 
than Amazon's, set into TransferManagerConfiguration in the same way.

[1] 
https://github.com/aws/aws-sdk-java/blob/1.10.58/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.java#L36-L40
[2] http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html
[3] 
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#S3A
[4] 
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.html#setMultipartCopyThreshold(long)
[5] 
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManagerConfiguration.html#setMultipartCopyPartSize(long)
[6] 
https://github.com/apache/hadoop/blob/release-2.7.2-RC2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L286



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)