[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
[ https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-13587: Assignee: Zoran Dimitrijevic > distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set > -- > > Key: HADOOP-13587 > URL: https://issues.apache.org/jira/browse/HADOOP-13587 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 3.0.0-alpha1 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-13587-01.patch, HADOOP-13587-02.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is > not honored even when it is . Current code always overwrites it with either > default value (java const) or with -bandwidth command line option. > The expected behavior (at least how I would expect it) is to honor the value > set in distcp-defaults.xml unless user explicitly specify -bandwidth command > line flag. If there is no value set in .xml file or as a command line flag, > then the constant from java code should be used. > Additionally, I would expect that we also try to get values from > distcp-site.xml, similar to other hadoop systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
[ https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-13587: Attachment: HADOOP-13587-02.patch > distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set > -- > > Key: HADOOP-13587 > URL: https://issues.apache.org/jira/browse/HADOOP-13587 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 3.0.0-alpha1 >Reporter: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-13587-01.patch, HADOOP-13587-02.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is > not honored even when it is . Current code always overwrites it with either > default value (java const) or with -bandwidth command line option. > The expected behavior (at least how I would expect it) is to honor the value > set in distcp-defaults.xml unless user explicitly specify -bandwidth command > line flag. If there is no value set in .xml file or as a command line flag, > then the constant from java code should be used. > Additionally, I would expect that we also try to get values from > distcp-site.xml, similar to other hadoop systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
[ https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-13587: Target Version/s: 3.0.0-alpha1 Status: Patch Available (was: Open) > distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set > -- > > Key: HADOOP-13587 > URL: https://issues.apache.org/jira/browse/HADOOP-13587 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 3.0.0-alpha1 >Reporter: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-13587-01.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is > not honored even when it is . Current code always overwrites it with either > default value (java const) or with -bandwidth command line option. > The expected behavior (at least how I would expect it) is to honor the value > set in distcp-defaults.xml unless user explicitly specify -bandwidth command > line flag. If there is no value set in .xml file or as a command line flag, > then the constant from java code should be used. > Additionally, I would expect that we also try to get values from > distcp-site.xml, similar to other hadoop systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
[ https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-13587: Attachment: HADOOP-13587-01.patch > distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set > -- > > Key: HADOOP-13587 > URL: https://issues.apache.org/jira/browse/HADOOP-13587 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 3.0.0-alpha1 >Reporter: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-13587-01.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is > not honored even when it is . Current code always overwrites it with either > default value (java const) or with -bandwidth command line option. > The expected behavior (at least how I would expect it) is to honor the value > set in distcp-defaults.xml unless user explicitly specify -bandwidth command > line flag. If there is no value set in .xml file or as a command line flag, > then the constant from java code should be used. > Additionally, I would expect that we also try to get values from > distcp-site.xml, similar to other hadoop systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
Zoran Dimitrijevic created HADOOP-13587: --- Summary: distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set Key: HADOOP-13587 URL: https://issues.apache.org/jira/browse/HADOOP-13587 Project: Hadoop Common Issue Type: Bug Components: tools/distcp Affects Versions: 3.0.0-alpha1 Reporter: Zoran Dimitrijevic Priority: Minor distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is not honored even when it is . Current code always overwrites it with either default value (java const) or with -bandwidth command line option. The expected behavior (at least how I would expect it) is to honor the value set in distcp-defaults.xml unless user explicitly specify -bandwidth command line flag. If there is no value set in .xml file or as a command line flag, then the constant from java code should be used. Additionally, I would expect that we also try to get values from distcp-site.xml, similar to other hadoop systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996139#comment-14996139 ] Zoran Dimitrijevic commented on HADOOP-11827: - I did this long time ago... I have no preference about using - or --. I think I did it the way all other command line arguments in distcp were done, so if we need any fix it will probably be for all options? > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.7.0, 2.7.1 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic > Fix For: 2.8.0 > > Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, > HADOOP-11827-04.patch, HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-1540) distcp should support an exclude list
[ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538520#comment-14538520 ] Zoran Dimitrijevic commented on HADOOP-1540: LGTM++ > distcp should support an exclude list > - > > Key: HADOOP-1540 > URL: https://issues.apache.org/jira/browse/HADOOP-1540 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Affects Versions: 2.6.0 >Reporter: Senthil Subramanian >Assignee: Rich Haase >Priority: Minor > Labels: BB2015-05-TBR, patch > Attachments: HADOOP-1540.008.patch > > > There should be a way to ignore specific paths (eg: those that have already > been copied over under the current srcPath). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-1540) distcp should support an exclude list
[ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535968#comment-14535968 ] Zoran Dimitrijevic commented on HADOOP-1540: please fix: "The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix." RegexCopyFilter constructor is currently reading from a file which is not ideal. It would be nicer if there is an init method and keep constructor only reading filename from the config. But, again, this might not be Hadoop style, and reading a file in constructor might be ok. Other than that, LGTM. > distcp should support an exclude list > - > > Key: HADOOP-1540 > URL: https://issues.apache.org/jira/browse/HADOOP-1540 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Affects Versions: 2.6.0 >Reporter: Senthil Subramanian >Assignee: Rich Haase >Priority: Minor > Labels: BB2015-05-TBR, patch > Attachments: HADOOP-1540.007.patch > > > There should be a way to ignore specific paths (eg: those that have already > been copied over under the current srcPath). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-1540) distcp should support an exclude list
[ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535403#comment-14535403 ] Zoran Dimitrijevic commented on HADOOP-1540: #5: we were experiencing performance issues for large number of files only because of RPCs to either namenode or to s3. Filtering each file name locally using a small number of compiled regex or glob rules should not be a big deal, especially since it's optional. For example, sorting a big filelist that we do now is much more expensive. Thank you for your patch! > distcp should support an exclude list > - > > Key: HADOOP-1540 > URL: https://issues.apache.org/jira/browse/HADOOP-1540 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Affects Versions: 2.6.0 >Reporter: Senthil Subramanian >Assignee: Rich Haase >Priority: Minor > Labels: BB2015-05-TBR, patch > Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch, > HADOOP-1540.005.patch, HADOOP-1540.006.patch > > > There should be a way to ignore specific paths (eg: those that have already > been copied over under the current srcPath). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-1540) distcp should support an exclude list
[ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535345#comment-14535345 ] Zoran Dimitrijevic commented on HADOOP-1540: And one minor comment related to hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestSimpleCopyFilter.java Adding another test to test the case with multiple rules and making sure that all rules are applied when filtering seems like a good idea here. Two simple additional tests would be sufficient. > distcp should support an exclude list > - > > Key: HADOOP-1540 > URL: https://issues.apache.org/jira/browse/HADOOP-1540 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Affects Versions: 2.6.0 >Reporter: Senthil Subramanian >Assignee: Rich Haase >Priority: Minor > Labels: BB2015-05-TBR, patch > Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch, > HADOOP-1540.005.patch, HADOOP-1540.006.patch > > > There should be a way to ignore specific paths (eg: those that have already > been copied over under the current srcPath). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-1540) distcp should support an exclude list
[ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535309#comment-14535309 ] Zoran Dimitrijevic commented on HADOOP-1540: 1. hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java:564 Minor: extra space in the comment. 2. hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java Refactoring of parsing logic should have been a separate patch. This will be harder to cherry-pick to older branches. But since this is a good refactor change, and I am new to hadoop community, so it's fine with me. 3. hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java:329 Minor: space missing between - and 1 (-1 => - 1) 4. hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/package-info.java Is this really part of this patch? Again, I am new to Hadoop community - so if it's ok to combine logically different changes it's definitely good. 5. hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyFilter.java It would be more useful if it is matching glob expressions - matching substrings is a very unusual filter for file-list filtering and many users will be puzzled what to do. I would suggest if we extend this right now instead of submitting this patch as is - for example, *tmp would match filenames ending with tmp, and not any file that happens to contain tmp in it. Or in the unittest "test" filter matching /user/testing is not what I would expect. Otherwise, looks good to me. > distcp should support an exclude list > - > > Key: HADOOP-1540 > URL: https://issues.apache.org/jira/browse/HADOOP-1540 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Affects Versions: 2.6.0 >Reporter: Senthil Subramanian >Assignee: Rich Haase >Priority: Minor > Labels: BB2015-05-RFC, patch > Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch, > HADOOP-1540.005.patch, HADOOP-1540.006.patch > > > There should be a way to ignore specific paths (eg: those that have already > been copied over under the current srcPath). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11876) Refactor code to make it more readable, minor maybePrintStats bug
[ https://issues.apache.org/jira/browse/HADOOP-11876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11876: Attachment: HADOOP-11876.patch > Refactor code to make it more readable, minor maybePrintStats bug > - > > Key: HADOOP-11876 > URL: https://issues.apache.org/jira/browse/HADOOP-11876 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Trivial > Attachments: HADOOP-11876.patch > > > This is related to HADOOP-11827 patch from few days ago. I've noticed a minor > bug in maybePrintStats logic which is called only when new directory is > processed, and prints every 100K objects (effectively, there is a very low > probability it'll ever print stats). The reason for this bug is that I was > previously printing stats for every new directory, and later decided it's > nicer to print stats for large number of new "objects" (files or directories) > instead. > This is a minor issue - and since I'm refactoring this I've also changed the > minor retry logic to make code nicer and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-11876) Refactor code to make it more readable, minor maybePrintStats bug
Zoran Dimitrijevic created HADOOP-11876: --- Summary: Refactor code to make it more readable, minor maybePrintStats bug Key: HADOOP-11876 URL: https://issues.apache.org/jira/browse/HADOOP-11876 Project: Hadoop Common Issue Type: Bug Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Trivial This is related to HADOOP-11827 patch from few days ago. I've noticed a minor bug in maybePrintStats logic which is called only when new directory is processed, and prints every 100K objects (effectively, there is a very low probability it'll ever print stats). The reason for this bug is that I was previously printing stats for every new directory, and later decided it's nicer to print stats for large number of new "objects" (files or directories) instead. This is a minor issue - and since I'm refactoring this I've also changed the minor retry logic to make code nicer and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505741#comment-14505741 ] Zoran Dimitrijevic commented on HADOOP-11827: - LGTM++ > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.7.0, 2.7.1 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic > Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, > HADOOP-11827-04.patch, HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504391#comment-14504391 ] Zoran Dimitrijevic commented on HADOOP-11827: - Sorry Ravi. Thanks for the comments Ravi! > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.7.0, 2.7.1 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic > Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, > HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504359#comment-14504359 ] Zoran Dimitrijevic commented on HADOOP-11827: - Thanks for the comments Allen. I've addressed most of them: 1, 2, 3: done 4: in order to prefer flags over properties, i needed a value to know whether flag was set or not. 0 seemed easier than yet another bool. 5. I added it so that I can have minimal changes in the unittest (rerun tests for various number of threads using org.junit.runners.Parameterized 6. done 7. agree. I wanted to make multithreaded logic outside of SimpleCopyListing.java but if you think it's an overkill, I can refactor. But it'll be uglier and if we need this again, we won't have the wrapper. 8. considering how much code is invoked for each of these simple MaybePrintStats, I don't think it's worth doing it. But, I don't have strong opinions, I just think we should print some progress since this stage can be order of tens of minutes. 9. removed. 10. In current code, we use the same file system instance, so I don't think it's a problem. I use one per thread since we have small number of threads and these run listStatus in parallel. 11. changed the docs - both are blocking, but one can be interrupted by exceptions, and then the user must handle it. Please suggest better names and I'll refactor it. Or maybe just keep one. > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.7.0, 2.7.1 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic > Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, > HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11827: Attachment: HADOOP-11827-03.patch > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.7.0, 2.7.1 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic > Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, > HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491205#comment-14491205 ] Zoran Dimitrijevic commented on HADOOP-11827: - Speedup for s3n source tree of 1.5M/50k dirs: current distcp: 36min 2 threads: 17min 5 threads: 7min 10 threads: 3.5 min 20 threads: 2.3 min For same source dataset on hdfs: current distcp: 206 seconds 1 thread: 204 sec 2 threads: 257 sec (not yet sure why, will repeat the experiment) 3 threads: 154 sec 5 threads: 94 sec 10 threads: 51 sec 20 threads: 45 sec > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-11827-02.patch, HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
[ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491188#comment-14491188 ] Zoran Dimitrijevic commented on HADOOP-11785: - Sorry, updated wrong jira ticket. Please ignore previous comment. > Reduce number of listStatus operation in distcp buildListing() > -- > > Key: HADOOP-11785 > URL: https://issues.apache.org/jira/browse/HADOOP-11785 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Fix For: 2.8.0 > > Attachments: distcp-liststatus.patch, distcp-liststatus2.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Distcp was taking long time in copyListing.buildListing() for large source > trees (I was using source of 1.5M files in a tree of about 50K directories). > For input at s3 buildListing was taking more than one hour. I've noticed a > performance bug in the current code which does listStatus twice for each > directory which doubles number of RPCs in some cases (if most directories do > not contain >1000 files). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11827: Attachment: HADOOP-11827-02.patch small change to handle all exceptions in worker. {color:green}+1 overall{color}. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-11827-02.patch, HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
[ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11785: Attachment: HADOOP-11827-02.patch Slight change to handle all exceptions in worker to handle occasional unexpected non-io exceptions. {color:green}+1 overall{color}. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. > Reduce number of listStatus operation in distcp buildListing() > -- > > Key: HADOOP-11785 > URL: https://issues.apache.org/jira/browse/HADOOP-11785 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Fix For: 2.8.0 > > Attachments: distcp-liststatus.patch, distcp-liststatus2.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Distcp was taking long time in copyListing.buildListing() for large source > trees (I was using source of 1.5M files in a tree of about 50K directories). > For input at s3 buildListing was taking more than one hour. I've noticed a > performance bug in the current code which does listStatus twice for each > directory which doubles number of RPCs in some cases (if most directories do > not contain >1000 files). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
[ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11785: Attachment: (was: HADOOP-11827-02.patch) > Reduce number of listStatus operation in distcp buildListing() > -- > > Key: HADOOP-11785 > URL: https://issues.apache.org/jira/browse/HADOOP-11785 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Fix For: 2.8.0 > > Attachments: distcp-liststatus.patch, distcp-liststatus2.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Distcp was taking long time in copyListing.buildListing() for large source > trees (I was using source of 1.5M files in a tree of about 50K directories). > For input at s3 buildListing was taking more than one hour. I've noticed a > performance bug in the current code which does listStatus twice for each > directory which doubles number of RPCs in some cases (if most directories do > not contain >1000 files). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491170#comment-14491170 ] Zoran Dimitrijevic commented on HADOOP-11827: - Performance results and charts for dataset I used (1.5M files and approx 50K dirs): https://docs.google.com/spreadsheets/d/1qJfO9ZhPXuGCpHyfX1NLE0Zm_NB39gn-cELECShd_zk/edit#gid=0 Please note that there are two sheets (s3n -> hdfs and hdfs -> hdfs). Main improvement is when source is in s3. Improvements when source is hdfs is good as well, but since current distcp has to sort input file total improvement is not as important). TODO: We can sort only directories which would further improve startup time. > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
[ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11827: Attachment: HADOOP-11827.patch test patch report from my laptop: {color:green}+1 overall{color}. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. > Speed-up distcp buildListing() using threadpool > --- > > Key: HADOOP-11827 > URL: https://issues.apache.org/jira/browse/HADOOP-11827 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: HADOOP-11827.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > For very large source trees on s3 distcp is taking long time to build file > listing (client code, before starting mappers). For a dataset I used (1.5M > files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and > 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
Zoran Dimitrijevic created HADOOP-11827: --- Summary: Speed-up distcp buildListing() using threadpool Key: HADOOP-11827 URL: https://issues.apache.org/jira/browse/HADOOP-11827 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Minor For very large source trees on s3 distcp is taking long time to build file listing (client code, before starting mappers). For a dataset I used (1.5M files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 36 minutes after the fix). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
[ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11785: Attachment: distcp-liststatus2.patch removed white space diffs. reused FileSystem sourceFS in traverse as suggested. > Reduce number of listStatus operation in distcp buildListing() > -- > > Key: HADOOP-11785 > URL: https://issues.apache.org/jira/browse/HADOOP-11785 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: distcp-liststatus.patch, distcp-liststatus2.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Distcp was taking long time in copyListing.buildListing() for large source > trees (I was using source of 1.5M files in a tree of about 50K directories). > For input at s3 buildListing was taking more than one hour. I've noticed a > performance bug in the current code which does listStatus twice for each > directory which doubles number of RPCs in some cases (if most directories do > not contain >1000 files). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
[ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391205#comment-14391205 ] Zoran Dimitrijevic commented on HADOOP-11785: - I did not change any tests since this is a performance-bug fix and all existing tests pass. Should I mark this as a bug fix instead of improvement? > Reduce number of listStatus operation in distcp buildListing() > -- > > Key: HADOOP-11785 > URL: https://issues.apache.org/jira/browse/HADOOP-11785 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Attachments: distcp-liststatus.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Distcp was taking long time in copyListing.buildListing() for large source > trees (I was using source of 1.5M files in a tree of about 50K directories). > For input at s3 buildListing was taking more than one hour. I've noticed a > performance bug in the current code which does listStatus twice for each > directory which doubles number of RPCs in some cases (if most directories do > not contain >1000 files). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
Zoran Dimitrijevic created HADOOP-11785: --- Summary: Reduce number of listStatus operation in distcp buildListing() Key: HADOOP-11785 URL: https://issues.apache.org/jira/browse/HADOOP-11785 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 3.0.0 Reporter: Zoran Dimitrijevic Assignee: Zoran Dimitrijevic Priority: Minor Fix For: 3.0.0 Attachments: distcp-liststatus.patch Distcp was taking long time in copyListing.buildListing() for large source trees (I was using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was taking more than one hour. I've noticed a performance bug in the current code which does listStatus twice for each directory which doubles number of RPCs in some cases (if most directories do not contain >1000 files). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
[ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoran Dimitrijevic updated HADOOP-11785: Attachment: distcp-liststatus.patch > Reduce number of listStatus operation in distcp buildListing() > -- > > Key: HADOOP-11785 > URL: https://issues.apache.org/jira/browse/HADOOP-11785 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 3.0.0 >Reporter: Zoran Dimitrijevic >Assignee: Zoran Dimitrijevic >Priority: Minor > Fix For: 3.0.0 > > Attachments: distcp-liststatus.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Distcp was taking long time in copyListing.buildListing() for large source > trees (I was using source of 1.5M files in a tree of about 50K directories). > For input at s3 buildListing was taking more than one hour. I've noticed a > performance bug in the current code which does listStatus twice for each > directory which doubles number of RPCs in some cases (if most directories do > not contain >1000 files). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)