[jira] [Created] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
Zoran Dimitrijevic created HADOOP-13587: --- Summary: distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set Key: HADOOP-13587 URL: https://issues.apache.org/jira/browse/HADOOP-13587 Project: Hadoop Common Issue Type: Bug Components: tools/distcp Affects Versions: 3.0.0-alpha1 Reporter: Zoran Dimitrijevic Priority: Minor distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is not honored even when it is . Current code always overwrites it with either default value (java const) or with -bandwidth command line option. The expected behavior (at least how I would expect it) is to honor the value set in distcp-defaults.xml unless user explicitly specify -bandwidth command line flag. If there is no value set in .xml file or as a command line flag, then the constant from java code should be used. Additionally, I would expect that we also try to get values from distcp-site.xml, similar to other hadoop systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-11876) Refactor code to make it more readable, minor maybePrintStats bug
Zoran Dimitrijevic created HADOOP-11876: --- Summary: Refactor code to make it more readable, minor maybePrintStats bug Key: HADOOP-11876 URL: https://issues.apache.org/jira/browse/HADOOP-11876 Project: Hadoop Common Issue Type: Bug Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Trivial This is related to HADOOP-11827 patch from few days ago. I've noticed a minor bug in maybePrintStats logic which is called only when new directory is processed, and prints every 100K objects (effectively, there is a very low probability it'll ever print stats). The reason for this bug is that I was previously printing stats for every new directory, and later decided it's nicer to print stats for large number of new objects (files or directories) instead. This is a minor issue - and since I'm refactoring this I've also changed the minor retry logic to make code nicer and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
Zoran Dimitrijevic created HADOOP-11827: --- Summary: Speed-up distcp buildListing() using threadpool Key: HADOOP-11827 URL: https://issues.apache.org/jira/browse/HADOOP-11827 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Minor For very large source trees on s3 distcp is taking long time to build file listing (client code, before starting mappers). For a dataset I used (1.5M files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
Zoran Dimitrijevic created HADOOP-11785: --- Summary: Reduce number of listStatus operation in distcp buildListing() Key: HADOOP-11785 URL: https://issues.apache.org/jira/browse/HADOOP-11785 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Minor Fix For: 3.0.0 Attachments: distcp-liststatus.patch Distcp was taking long time in copyListing.buildListing() for large source trees (I was using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was taking more than one hour. I've noticed a performance bug in the current code which does listStatus twice for each directory which doubles number of RPCs in some cases (if most directories do not contain 1000 files). -- This message was sent by Atlassian JIRA (v6.3.4#6332)