[jira] [Comment Edited] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism

2018-03-29 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418505#comment-16418505
 ] 

Genmao Yu edited comment on HADOOP-14999 at 3/29/18 6:40 AM:
-

[~Sammi]
 - #2:  *"根据不同的上传方式,对象的大小限制是不一样的。分片上传 最大支持 48.8TB 的对象大小,其他的上传方式最大支持 5GB。"* 
[https://help.aliyun.com/document_detail/31827.html]
 - #3: fixed
 - #4: fixed
 - #5: It is not an async operation, after submit tasks (store.uploadPart), we 
will wait until all part uploads finish. so it is safe to do do resource clean.


was (Author: unclegen):
[~Sammi]

- #2: {{根据不同的上传方式,对象的大小限制是不一样的。分片上传 最大支持 48.8TB 的对象大小,其他的上传方式最大支持 5GB。}}  
https://help.aliyun.com/document_detail/31827.html
- #3: fixed
- #4: fixed
- #5: It is not an async operation, after submit tasks (store.uploadPart), we 
will wait until all part uploads finish. so it is safe to do do resource clean.


> AliyunOSS: provide one asynchronous multi-part based uploading mechanism
> 
>
> Key: HADOOP-14999
> URL: https://issues.apache.org/jira/browse/HADOOP-14999
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/oss
>Affects Versions: 3.0.0-beta1
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Major
> Attachments: HADOOP-14999.001.patch, HADOOP-14999.002.patch, 
> HADOOP-14999.003.patch, HADOOP-14999.004.patch, HADOOP-14999.005.patch, 
> HADOOP-14999.006.patch, HADOOP-14999.007.patch, HADOOP-14999.008.patch, 
> HADOOP-14999.009.patch, asynchronous_file_uploading.pdf, 
> diff-between-patch7-and-patch8.txt
>
>
> This mechanism is designed for uploading file in parallel and asynchronously:
>  - improve the performance of uploading file to OSS server. Firstly, this 
> mechanism splits result to multiple small blocks and upload them in parallel. 
> Then, getting result and uploading blocks are asynchronous.
>  - avoid buffering too large result into local disk. To cite an extreme 
> example, there is a task which will output 100GB or even larger, we may need 
> to output this 100GB to local disk and then upload it. Sometimes, it is 
> inefficient and limited to disk space.
> This patch reuse {{SemaphoredDelegatingExecutor}} as executor service and 
> depends on HADOOP-15039.
> Attached {{asynchronous_file_uploading.pdf}} illustrated the difference 
> between previous {{AliyunOSSOutputStream}} and 
> {{AliyunOSSBlockOutputStream}}, i.e. this asynchronous multi-part based 
> uploading mechanism.
> 1. {{AliyunOSSOutputStream}}: we need to output the whole result to local 
> disk before we can upload it to OSS. This will poses two problems:
>  - if the output file is too large, it will run out of the local disk.
>  - if the output file is too large, task will wait long time to upload result 
> to OSS before finish, wasting much compute resource.
> 2. {{AliyunOSSBlockOutputStream}}: we cut the task output into small blocks, 
> i.e. some small local file, and each block will be packaged into a uploading 
> task. These tasks will be submitted into {{SemaphoredDelegatingExecutor}}. 
> {{SemaphoredDelegatingExecutor}} will upload this blocks in parallel, this 
> will improve performance greatly.
> 3. Each task will retry 3 times to upload block to Aliyun OSS. If one of 
> those tasks failed, the whole file uploading will failed, and we will abort 
> current uploading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism

2018-03-08 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390910#comment-16390910
 ] 

Genmao Yu edited comment on HADOOP-14999 at 3/8/18 8:42 AM:


Thanks for [~Sammi] 's review. 
 1. comment-1: remove unused config and refine some config
 2. comment-2: fixed
 3. comment-3: Sorry, any problems? All style check passed.
{code:java}
Preconditions.checkArgument(v >= min,
String.format("Value of %s: %d is below the minimum value %d",
key, v, min));
{code}
4. comment-4: update unit test
 5. comment-5: IMHO, It is too large to test 5GB in integration test. And 
{{MULTIPART_UPLOAD_SIZE}} may cover this case as you mentioned.
 6. "But they are not cleaned when exception happens during the write() 
process.": all temp files are {{deleteOnExit}}, but I also add the resource 
clean logic in {{try-finally}}

performance test: test file upload
|file size|before patch|after patch (with 4 parallelism)|
|10MB|1.03s|1.1s|
|100MB|6.5s|2.3s|
|1GB|56.5s|13.5s|
|10GB|574s|173s|


was (Author: unclegen):
Thanks for [~Sammi] 's review. 
1. comment-1: remove unused config and refine some config
2. comment-2: fixed
3. comment-3: Sorry, any problems?

{code}
Preconditions.checkArgument(v >= min,
String.format("Value of %s: %d is below the minimum value %d",
key, v, min));
{code} 
4. comment-4: update unit test
5. comment-5: IMHO, It is too large to test 5GB in integration test. And 
{{MULTIPART_UPLOAD_SIZE}} may cover this case as you mentioned.
6. "But they are not cleaned when exception happens during the write() 
process.": all temp files are {{deleteOnExit}}, but I also add the resource 
clean logic in {{try-finally}}


performance test: test file upload

|file size|before patch|after patch (with 4 parallelism)|
|10MB|1.03s|1.1s|
|100MB|6.5s|2.3s|
|1GB|56.5s|13.5s|
|10GB|574s|173s|


> AliyunOSS: provide one asynchronous multi-part based uploading mechanism
> 
>
> Key: HADOOP-14999
> URL: https://issues.apache.org/jira/browse/HADOOP-14999
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/oss
>Affects Versions: 3.0.0-beta1
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Major
> Attachments: HADOOP-14999.001.patch, HADOOP-14999.002.patch, 
> HADOOP-14999.003.patch, HADOOP-14999.004.patch, HADOOP-14999.005.patch, 
> HADOOP-14999.006.patch, HADOOP-14999.007.patch, HADOOP-14999.008.patch, 
> asynchronous_file_uploading.pdf, diff-between-patch7-and-patch8.txt
>
>
> This mechanism is designed for uploading file in parallel and asynchronously:
>  - improve the performance of uploading file to OSS server. Firstly, this 
> mechanism splits result to multiple small blocks and upload them in parallel. 
> Then, getting result and uploading blocks are asynchronous.
>  - avoid buffering too large result into local disk. To cite an extreme 
> example, there is a task which will output 100GB or even larger, we may need 
> to output this 100GB to local disk and then upload it. Sometimes, it is 
> inefficient and limited to disk space.
> This patch reuse {{SemaphoredDelegatingExecutor}} as executor service and 
> depends on HADOOP-15039.
> Attached {{asynchronous_file_uploading.pdf}} illustrated the difference 
> between previous {{AliyunOSSOutputStream}} and 
> {{AliyunOSSBlockOutputStream}}, i.e. this asynchronous multi-part based 
> uploading mechanism.
> 1. {{AliyunOSSOutputStream}}: we need to output the whole result to local 
> disk before we can upload it to OSS. This will poses two problems:
>  - if the output file is too large, it will run out of the local disk.
>  - if the output file is too large, task will wait long time to upload result 
> to OSS before finish, wasting much compute resource.
> 2. {{AliyunOSSBlockOutputStream}}: we cut the task output into small blocks, 
> i.e. some small local file, and each block will be packaged into a uploading 
> task. These tasks will be submitted into {{SemaphoredDelegatingExecutor}}. 
> {{SemaphoredDelegatingExecutor}} will upload this blocks in parallel, this 
> will improve performance greatly.
> 3. Each task will retry 3 times to upload block to Aliyun OSS. If one of 
> those tasks failed, the whole file uploading will failed, and we will abort 
> current uploading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism

2018-02-08 Thread SammiChen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356629#comment-16356629
 ] 

SammiChen edited comment on HADOOP-14999 at 2/8/18 8:15 AM:


Hi [~uncleGen],  thanks for refine the patch. Here are a few comments.

1.  AliyunOSSFileSystemStore.

{color:#660e7a}uploadPartSize {color}= 
conf.getLong({color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY{color},
 {color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT{color});
 {color:#660e7a}multipartThreshold {color}= 
conf.getLong({color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_KEY{color},
 {color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_DEFAULT{color});
 partSize = conf.getLong({color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY{color},
 {color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT{color});
 {color:#80}if {color}(partSize < 
{color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE{color}) {
 partSize = {color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE{color};
 }

 What's the difference usage of "uploadPartSize" and "partSize" with the same 
initial value? It seems "partSize" is not used in other places.

Also please refine the multi upload related constant properties, put related 
property in adjacent place. Seems "

{color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT{color}" should be called 
"{color:#660e7a}MULTIPART_UPLOAD_PART_SIZE_DEFAULT{color}".  And 
"{color:#660e7a}MULTIPART_UPLOAD_SIZE {color}= {color:#ff}104857600{color}" 
is the temp file size. Try to make the property name carries the accurate 
meaning.

{color:#808080}// Size of each of or multipart pieces in bytes{color}

{color:#80}public static final {color}String 
{color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY {color}=
 {color:#008000}"fs.oss.multipart.upload.size"{color};
 {color:#80}public static final long 
{color}{color:#660e7a}MULTIPART_UPLOAD_SIZE {color}= 
{color:#ff}104857600{color}; {color:#808080}// 100 MB{color}

{color:#80}public static final long 
{color}{color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT {color}= {color:#ff}10 
{color}* {color:#ff}1024 {color}* {color:#ff}1024{color};
 {color:#80}public static final int 
{color}{color:#660e7a}MULTIPART_UPLOAD_PART_NUM_LIMIT {color}= 
{color:#ff}1{color};

{color:#808080}// Minimum size in bytes before we start a multipart uploads or 
copy{color}

{color:#80}public static final {color}String 
{color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_KEY {color}=
 {color:#008000}"fs.oss.multipart.upload.threshold"{color};
 {color:#80}public static final long 
{color}{color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_DEFAULT {color}=
 {color:#ff}20 {color}* {color:#ff}1024 {color}* 
{color:#ff}1024{color};

{color:#80}public static final long 
{color}{color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE {color}= 
{color:#ff}100 {color}* {color:#ff}1024L{color};

 

2. AliyunOSSUtils#createTmpFileForWrite

    Change the order of following statements,

{color:#80}if {color}({color:#660e7a}directoryAllocator {color}== 
{color:#80}null{color}) {
 {color:#660e7a}directoryAllocator {color}= {color:#80}new 
{color}LocalDirAllocator({color:#660e7a}BUFFER_DIR_KEY{color});
 }
 {color:#80}if {color}(conf.get({color:#660e7a}BUFFER_DIR_KEY{color}) == 
{color:#80}null{color}) {
 conf.set({color:#660e7a}BUFFER_DIR_KEY{color}, 
conf.get({color:#008000}"hadoop.tmp.dir"{color}) + 
{color:#008000}"/oss"{color});
 }

Also is "{color:#660e7a}directoryAllocator{color}" final?

3. AliyunOSSUtils#intOption,  longOption

   Precondition doesn't support "%d".  Add test case to cover the logic. 
Suggest change the name to more meaning full names like getXOption. Pay 
attention to the code style, the indent.

4. TestAliyunOSSBlockOutputStream.  Add random length file tests here. Only 
1024 aligned file length is not enough.

5. AliyunOSSBlockOutputStream

   {color:#808080} Asynchronous multi-part based uploading mechanism to support 
huge file{color}{color:#808080}* which is larger than 5GB.{color}

Where is this 5GB threshold checked in the code?

The resources are well cleaned after close() is called. But they are not 
cleaned when exception happens during the write() process.

 


was (Author: sammi):
Hi [~uncleGen],  thanks for refine the patch. Here are a few comments.

1.  AliyunOSSFileSystemStore.

{color:#660e7a}uploadPartSize {color}= 
conf.getLong({color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY{color},
 {color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT{color});
 {color:#660e7a}multipartThreshold {color}= 
conf.getLong({color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_KEY{color},
 {color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_DEFAULT{color});
 partSize = conf.getLong({color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY{color},
 {color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT{color});
 {color:#80}if {color}(partSize < 
{color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE{color}) {
 partSize = 

[jira] [Comment Edited] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism

2017-11-14 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250906#comment-16250906
 ] 

Genmao Yu edited comment on HADOOP-14999 at 11/14/17 11:09 AM:
---

pending on refactoring: use {{SemaphoredDelegatingExecutor}} instead of 
{{TaskEngine}}.

Just as discussed in HADOOP-15027, I think 
{{org.apache.hadoop.fs.s3a.SemaphoredDelegatingExecutor}} is a good common 
class, and we may move it to hadoop-common. [~ste...@apache.org] Do you mind if 
I open jira to do this work?


was (Author: unclegen):
pending on refactoring: move the {{TaskEngine}} from output stream to oss 
filesystem .

Just as discussed in HADOOP-15027, I think 
{{org.apache.hadoop.fs.s3a.SemaphoredDelegatingExecutor}} is a good common 
class, and we may move it to hadoop-common. 
Then, I will refactor the {{TaskEngine}} to use 
{{SemaphoredDelegatingExecutor}}.

[~ste...@apache.org] Do you mind if I open jira to do this work?

> AliyunOSS: provide one asynchronous multi-part based uploading mechanism
> 
>
> Key: HADOOP-14999
> URL: https://issues.apache.org/jira/browse/HADOOP-14999
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/oss
>Affects Versions: 3.0.0-beta1
>Reporter: Genmao Yu
>Assignee: Genmao Yu
> Attachments: HADOOP-14999.001.patch, HADOOP-14999.002.patch
>
>
> This mechanism is designed for uploading file in parallel and asynchronously: 
> - improve the performance of uploading file to OSS server. Firstly, this 
> mechanism splits result to multiple small blocks and upload them in parallel. 
> Then, getting result and uploading blocks are asynchronous.
> - avoid buffering too large result into local disk. To cite an extreme 
> example, there is a task which will output 100GB or even larger, we may need 
> to output this 100GB to local disk and then upload it. Sometimes, it is 
> inefficient and limited to disk space.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org