[jira] [Commented] (HADOOP-16756) Inconsistent Behavior on distcp -update over S3

2020-01-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006843#comment-17006843
 ] 

Steve Loughran commented on HADOOP-16756:
-

Reviewing this. turns out that the default attribute to preserve is the block 
size (OptionsParser L201). And there's no obvious way to turn off

Try doing a distcp with: -direct -pr

this says "preserve replication", which Will be ignored on S3.

If this makes it go away and we need to think about what to do.

One option is for S3A to actually use the block size parameter passed into 
createFile. We would have to do the same for ABFS too. I wonder what would 
break.

> Inconsistent Behavior on distcp -update over S3
> ---
>
> Key: HADOOP-16756
> URL: https://issues.apache.org/jira/browse/HADOOP-16756
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, tools/distcp
>Affects Versions: 3.3.0
>Reporter: Daisuke Kobayashi
>Priority: Major
>
> Distcp over S3A always copies all source files no matter the files are 
> changed or not. This is opposite to the statement in the doc below.
> [http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
> {noformat}
> And to use -update to only copy changed files.
> {noformat}
> CopyMapper compares file length as well as block size before copying. While 
> the file length should match, the block size does not. This is apparently 
> because the returned block size from S3A is always 32MB.
> [https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348]
> I'd suppose we should update the documentation or make code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16756) Inconsistent Behavior on distcp -update over S3

2019-12-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995628#comment-16995628
 ] 

Steve Loughran commented on HADOOP-16756:
-

not as simple as looking at FS schema as webhdfs and hdfs interop with 
checksums and we don't want to break that

> Inconsistent Behavior on distcp -update over S3
> ---
>
> Key: HADOOP-16756
> URL: https://issues.apache.org/jira/browse/HADOOP-16756
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3, tools/distcp
>Affects Versions: 3.3.0
>Reporter: Daisuke Kobayashi
>Priority: Major
>
> Distcp over S3A always copies all source files no matter the files are 
> changed or not. This is opposite to the statement in the doc below.
> [http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
> {noformat}
> And to use -update to only copy changed files.
> {noformat}
> CopyMapper compares file length as well as block size before copying. While 
> the file length should match, the block size does not. This is apparently 
> because the returned block size from S3A is always 32MB.
> [https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348]
> I'd suppose we should update the documentation or make code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16756) Inconsistent Behavior on distcp -update over S3

2019-12-13 Thread Srinivasu Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995514#comment-16995514
 ] 

Srinivasu Majeti commented on HADOOP-16756:
---

 Hi [~ste...@apache.org], [~daisuke.kobayashi], Should we need another option 
like -skipblocklengthcheck for a copy from on-prem to the cloud? Or skip it 
always if target file system is different from the source ?

> Inconsistent Behavior on distcp -update over S3
> ---
>
> Key: HADOOP-16756
> URL: https://issues.apache.org/jira/browse/HADOOP-16756
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3, tools/distcp
>Affects Versions: 3.3.0
>Reporter: Daisuke Kobayashi
>Priority: Major
>
> Distcp over S3A always copies all source files no matter the files are 
> changed or not. This is opposite to the statement in the doc below.
> [http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
> {noformat}
> And to use -update to only copy changed files.
> {noformat}
> CopyMapper compares file length as well as block size before copying. While 
> the file length should match, the block size does not. This is apparently 
> because the returned block size from S3A is always 32MB.
> [https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348]
> I'd suppose we should update the documentation or make code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16756) Inconsistent Behavior on distcp -update over S3

2019-12-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992614#comment-16992614
 ] 

Steve Loughran commented on HADOOP-16756:
-

ignoring the little detail that you shouldn't put your secrets on the command 
line

* try with a hadoop 3.2 release
* and use -direct to avoid the renames.

It will always copies files of different length. For files of the same length, 
because we can't compare checksums, we assume that same length == unchanged. at 
least AFAIK. 



> Inconsistent Behavior on distcp -update over S3
> ---
>
> Key: HADOOP-16756
> URL: https://issues.apache.org/jira/browse/HADOOP-16756
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3, tools/distcp
>Affects Versions: 3.3.0
>Reporter: Daisuke Kobayashi
>Priority: Major
>
> Distcp over S3A always copies all source files no matter the files are 
> changed or not. This is opposite to the statement in the doc below.
> [http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
> {noformat}
> And to use -update to only copy changed files.
> {noformat}
> CopyMapper compares file length as well as block size before copying. While 
> the file length should match, the block size does not. This is apparently 
> because the returned block size from S3A is always 32MB.
> [https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348]
> I'd suppose we should update the documentation or make code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16756) Inconsistent Behavior on distcp -update over S3

2019-12-10 Thread Daisuke Kobayashi (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992581#comment-16992581
 ] 

Daisuke Kobayashi commented on HADOOP-16756:


[~ste...@apache.org], hmm really. Here's my command which is pretty simple:
 
{noformat}
hadoop distcp -Dfs.s3a.access.key=xxx  -Dfs.s3a.secret.key=xxx -update 
-skipcrccheck /user/root/tmp/ s3a:///tmp/
{noformat}

 

> Inconsistent Behavior on distcp -update over S3
> ---
>
> Key: HADOOP-16756
> URL: https://issues.apache.org/jira/browse/HADOOP-16756
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3, tools/distcp
>Affects Versions: 3.3.0
>Reporter: Daisuke Kobayashi
>Priority: Major
>
> Distcp over S3A always copies all source files no matter the files are 
> changed or not. This is opposite to the statement in the doc below.
> [http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
> {noformat}
> And to use -update to only copy changed files.
> {noformat}
> CopyMapper compares file length as well as block size before copying. While 
> the file length should match, the block size does not. This is apparently 
> because the returned block size from S3A is always 32MB.
> [https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348]
> I'd suppose we should update the documentation or make code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16756) Inconsistent Behavior on distcp -update over S3

2019-12-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992550#comment-16992550
 ] 

Steve Loughran commented on HADOOP-16756:
-

should only happen if you are trying to preserve the status. What's your full 
command line?

> Inconsistent Behavior on distcp -update over S3
> ---
>
> Key: HADOOP-16756
> URL: https://issues.apache.org/jira/browse/HADOOP-16756
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs/s3, tools/distcp
>Affects Versions: 3.3.0
>Reporter: Daisuke Kobayashi
>Priority: Major
>
> Distcp over S3A always copies all source files no matter the files are 
> changed or not. This is opposite to the statement in the doc below.
> [http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
> {noformat}
> And to use -update to only copy changed files.
> {noformat}
> CopyMapper compares file length as well as block size before copying. While 
> the file length should match, the block size does not. This is apparently 
> because the returned block size from S3A is always 32MB.
> [https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java#L348]
> I'd suppose we should update the documentation or make code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org