[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-11 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318836#comment-17318836
 ] 

Viraj Jasani commented on HADOOP-17611:
---

Although this seems interesting that we can retain modificationTime of target 
file by updating it after Filesystem.concat() operation, however I am not sure 
if HDFS really provides (or should provide) API to update File modificationTime 
at INode level.

[~ayushtkn] [~liuml07] [~tasanuma] [~weichiu] thoughts?

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-11 Thread Adam Maroti (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318843#comment-17318843
 ] 

Adam Maroti commented on HADOOP-17611:
--

It already has an api for that: Filesystem.setTimes(Path, long, long)

Viraj Jasani (Jira)  ezt írta (időpont: 2021. ápr. 11., V



> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-11 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318848#comment-17318848
 ] 

Viraj Jasani commented on HADOOP-17611:
---

Ah, my bad. Thanks, it updates both access time and modification time.

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-11 Thread Adam Maroti (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318851#comment-17318851
 ] 

Adam Maroti commented on HADOOP-17611:
--

Yes, also it is possible to use it to just change one or the other. (Or
obviously both the access time and he modification time simultaneously)

Viraj Jasani (Jira)  ezt írta (időpont: 2021. ápr. 11., V



> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Adam Maroti (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319129#comment-17319129
 ] 

Adam Maroti commented on HADOOP-17611:
--

[~vjasani] Concat creates a new file right? Does that change the directories 
modification time? Is that supposed to be preserved by distcp?

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319191#comment-17319191
 ] 

Viraj Jasani commented on HADOOP-17611:
---

[~amaroti] yes, concat() does seem to be updating mtime of both destFile and 
it's parent dir. And concat() seems to be appending all source blocks to target 
file block. Hence, although it creates new INode internally copy all blocks 
from target file + array of source blocks, however the high level 
responsibility of the operation is to append all src blocks to target file only.

 
{code:java}
/**
 * Concat all the blocks from srcs to trg and delete the srcs files
 * @param fsd FSDirectory
 */
static void unprotectedConcat(FSDirectory fsd, INodesInPath targetIIP,
INodeFile[] srcList, long timestamp) throws IOException {
  assert fsd.hasWriteLock();
  NameNode.stateChangeLog.debug("DIR* NameSystem.concat to {}",
  targetIIP.getPath());

  final INodeFile trgInode = targetIIP.getLastINode().asFile();
  QuotaCounts deltas = computeQuotaDeltas(fsd, trgInode, srcList);
  verifyQuota(fsd, targetIIP, deltas);

  // the target file can be included in a snapshot
  trgInode.recordModification(targetIIP.getLatestSnapshotId());
  INodeDirectory trgParent = targetIIP.getINode(-2).asDirectory();
  trgInode.concatBlocks(srcList, fsd.getBlockManager());

  // since we are in the same dir - we can use same parent to remove files
  int count = 0;
  for (INodeFile nodeToRemove : srcList) {
if(nodeToRemove != null) {
  nodeToRemove.clearBlocks();
  // Ensure the nodeToRemove is cleared from snapshot diff list
  nodeToRemove.getParent().removeChild(nodeToRemove,
  targetIIP.getLatestSnapshotId());
  fsd.getINodeMap().remove(nodeToRemove);
  count++;
}
  }

  trgInode.setModificationTime(timestamp, targetIIP.getLatestSnapshotId());
  trgParent.updateModificationTime(timestamp, targetIIP.getLatestSnapshotId());
  // update quota on the parent directory with deltas
  FSDirectory.unprotectedUpdateCount(targetIIP, targetIIP.length() - 1, deltas);
}

{code}
In above code, this is updating mtime of target file INode as well as that of 
it's parent dir:
{code:java}
trgInode.setModificationTime(timestamp, targetIIP.getLatestSnapshotId()); 
trgParent.updateModificationTime(timestamp, targetIIP.getLatestSnapshotId()); 
{code}
 
{quote}Is that supposed to be preserved by distcp?
{quote}
Good question, perhaps [~ayushtkn] [~weichiu] can help with this.

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319355#comment-17319355
 ] 

Ayush Saxena commented on HADOOP-17611:
---

{quote}Is that supposed to be preserved by distcp?
{quote}
If I remember correct, I think there is an option in distcp as part of 
preserve, Guess it is {{TIMES}}, Check in {{DistCpOptions.java}} and there the 
FileAttribute. So, If that is specified then it does a setTimes as part of 
DistCpUtils#preserve. For what all directories/files it does will depend on the 
scope of copy, say what all is there in sequence file generated for copying, if 
the parent is there in the scope, it will preserve, else it won't AFAIK.

Give a check to that code, should clarify your doubts, and I just gave a quick 
look to the PR, Guess you are preserving irrespective of the {{TIMES}} option 
provided, double check once, if that is so, I didn't go through it.

Let me know, if you face any issues understanding the flow or need some help, I 
will also try to explore this and help. :)

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319400#comment-17319400
 ] 

Viraj Jasani commented on HADOOP-17611:
---

{quote}Guess you are preserving irrespective of the {{TIMES}} option provided, 
double check once, if that is so, I didn't go through it.
{quote}
Yes, this change is specifically for distcp with parallel blocks copy option 
only. Because for very big file, we copy multiple blocks in parallel and in 
concat(), blocks are appended to target file block and hence, target file's 
mtime changes due to concat, which we are trying to retain as part of this 
change. Hence, this won't still retain mtime and atime exactly same as source 
files.
{quote}If I remember correct, I think there is an option in distcp as part of 
preserve, Guess it is {{TIMES}}, Check in {{DistCpOptions.java}} and there the 
FileAttribute.
{quote}
Yeah, just explored it. This option does preserve source files mtime and atime 
attributes.

DistCpUtils#preserve() has this condition to retain times:
{code:java}
if (attributes.contains(FileAttribute.TIMES)) {
  targetFS.setTimes(path, 
  srcFileStatus.getModificationTime(), 
  srcFileStatus.getAccessTime());
}

{code}

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319410#comment-17319410
 ] 

Ayush Saxena commented on HADOOP-17611:
---

bq. Hence, this won't still retain mtime and atime exactly same as source files.

Wasn't this only the problem?
>From the description:
bq. Filesystem.concat is called which changes the modification time therefore 
the modification times of files copeid by distcp will not match the source 
files.

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319417#comment-17319417
 ] 

Viraj Jasani commented on HADOOP-17611:
---

Oh yes, as per description it seems right, I was mostly looking from viewpoint 
of mtime diff b/ actual target file vs concat operation's updated mtime.

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Adam Maroti (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319541#comment-17319541
 ] 

Adam Maroti commented on HADOOP-17611:
--

[~vjasani] This is my take on this: 
[https://github.com/apache/hadoop/pull/2897]. It also restores the parent 
directories modification time access time.

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319605#comment-17319605
 ] 

Viraj Jasani commented on HADOOP-17611:
---

Thanks [~amaroti]. Have you also tested with TIMES option with DistCp? It seems 
to be already retaining mTime of target file.

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-12 Thread Adam Maroti (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319615#comment-17319615
 ] 

Adam Maroti commented on HADOOP-17611:
--

When times is set the preserve() function is called from the copy mapper
after the file/file junk creation. The copycomitter which runs after that
and does the concat doesn't call preserve because it no longer has the
source file statuses. So the concat happens inside of copycomitter which is
run after the copy mapper causing the concat to be run after the preserve.

Viraj Jasani (Jira)  ezt írta (időpont: 2021. ápr. 12., H



> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17611) Distcp parallel file copy breaks the modification time

2021-04-13 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320445#comment-17320445
 ] 

Ayush Saxena commented on HADOOP-17611:
---

Seems there are two PRs, More or less doing the same thing I guess, I just had 
a glance on the second one, 

So, who ever plans to chase this, Couple of points to keep in mind:
 * We need a test in {{AbstractContractDistCpTest}} which all the 
{{FileSystems}} can also use
 * Should cover *two scenarios.* First When preserve Time is specified it 
should preserve time and when not it shouldn't in case of parallel copy. The 
latter case is working ok as of now, To make sure we don't change the behaviour
 * The parent modification time is to be preserved when the parent is in the 
scope of copy, not always. say your are copying /dir/fil1 to /dir1/file2 using 
parallel copy, then we don't touch /dir1 AFAIK

The above are the basic requirements, Now the below stuff, If possible we 
should do:
 * For parent directories preserve only once, say if you have 10K files under 
that parent, then do that setTimes 10K times.
 * And if the parallel copy is enabled, there is no point of preserving before 
concat operation, we can save that call.

 

This isn't a one liner, and throw some challenges, So, please decide who wants 
to chase this and together work on one PR only.

 

> Distcp parallel file copy breaks the modification time
> --
>
> Key: HADOOP-17611
> URL: https://issues.apache.org/jira/browse/HADOOP-17611
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Adam Maroti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The commit HADOOP-11794. Enable distcp to copy blocks in parallel. 
> (bf3fb585aaf2b179836e139c041fc87920a3c886) broke the modification time of 
> large files.
>  
> In CopyCommitter.java inside concatFileChunks Filesystem.concat is called 
> which changes the modification time therefore the modification times of files 
> copeid by distcp will not match the source files. However this only occurs 
> for large enough files, which are copied by splitting them up by distcp.
> In concatFileChunks before calling concat extract the modification time and 
> apply that to the concatenated result-file after the concat. (probably best 
> -after- before the rename()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org