[ https://issues.apache.org/jira/browse/MAPREDUCE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Romianowski updated MAPREDUCE-1305: ----------------------------------------- Attachment: MAPREDUCE-1305.patch We even do not need the absolute path serialized. Using NullWritable now. Patch is against trunk, rev 891812 > Massive performance problem with DistCp and -delete > --------------------------------------------------- > > Key: MAPREDUCE-1305 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp > Affects Versions: 0.20.1 > Reporter: Peter Romianowski > Assignee: Peter Romianowski > Attachments: MAPREDUCE-1305.patch > > > *First problem* > In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus > objects when the path is all we need. > The performance problem comes from > org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries > to retrieve file permissions by issuing a "ls -ld <path>" which is painfully > slow. > Changed that to just serialize Path and not FileStatus. > *Second problem* > To delete the files we invoke the "hadoop" command line tool with option > "-rmr <path>". Again, for each file. > Changed that to dstfs.delete(path, true) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.