[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set

2016-09-08 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-13587:

Assignee: Zoran Dimitrijevic

> distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
> --
>
> Key: HADOOP-13587
> URL: https://issues.apache.org/jira/browse/HADOOP-13587
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0-alpha1
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-13587-01.patch, HADOOP-13587-02.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is 
> not honored even when it is . Current code always overwrites it with either 
> default value (java const) or with -bandwidth command line option.
> The expected behavior (at least how I would expect it) is to honor the value 
> set in distcp-defaults.xml unless user explicitly specify -bandwidth command 
> line flag. If there is no value set in .xml file or as a command line flag, 
> then the constant from java code should be used.
> Additionally, I would expect that we also try to get values from 
> distcp-site.xml, similar to other hadoop systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set

2016-09-08 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-13587:

Attachment: HADOOP-13587-02.patch

> distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
> --
>
> Key: HADOOP-13587
> URL: https://issues.apache.org/jira/browse/HADOOP-13587
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0-alpha1
>Reporter: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-13587-01.patch, HADOOP-13587-02.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is 
> not honored even when it is . Current code always overwrites it with either 
> default value (java const) or with -bandwidth command line option.
> The expected behavior (at least how I would expect it) is to honor the value 
> set in distcp-defaults.xml unless user explicitly specify -bandwidth command 
> line flag. If there is no value set in .xml file or as a command line flag, 
> then the constant from java code should be used.
> Additionally, I would expect that we also try to get values from 
> distcp-site.xml, similar to other hadoop systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set

2016-09-07 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-13587:

Target Version/s: 3.0.0-alpha1
  Status: Patch Available  (was: Open)

> distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
> --
>
> Key: HADOOP-13587
> URL: https://issues.apache.org/jira/browse/HADOOP-13587
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0-alpha1
>Reporter: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-13587-01.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is 
> not honored even when it is . Current code always overwrites it with either 
> default value (java const) or with -bandwidth command line option.
> The expected behavior (at least how I would expect it) is to honor the value 
> set in distcp-defaults.xml unless user explicitly specify -bandwidth command 
> line flag. If there is no value set in .xml file or as a command line flag, 
> then the constant from java code should be used.
> Additionally, I would expect that we also try to get values from 
> distcp-site.xml, similar to other hadoop systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set

2016-09-07 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-13587:

Attachment: HADOOP-13587-01.patch

> distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set
> --
>
> Key: HADOOP-13587
> URL: https://issues.apache.org/jira/browse/HADOOP-13587
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0-alpha1
>Reporter: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-13587-01.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is 
> not honored even when it is . Current code always overwrites it with either 
> default value (java const) or with -bandwidth command line option.
> The expected behavior (at least how I would expect it) is to honor the value 
> set in distcp-defaults.xml unless user explicitly specify -bandwidth command 
> line flag. If there is no value set in .xml file or as a command line flag, 
> then the constant from java code should be used.
> Additionally, I would expect that we also try to get values from 
> distcp-site.xml, similar to other hadoop systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set

2016-09-07 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-13587:
---

 Summary: distcp.map.bandwidth.mb is overwritten even when 
-bandwidth flag isn't set
 Key: HADOOP-13587
 URL: https://issues.apache.org/jira/browse/HADOOP-13587
 Project: Hadoop Common
  Issue Type: Bug
  Components: tools/distcp
Affects Versions: 3.0.0-alpha1
Reporter: Zoran Dimitrijevic
Priority: Minor


distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is 
not honored even when it is . Current code always overwrites it with either 
default value (java const) or with -bandwidth command line option.

The expected behavior (at least how I would expect it) is to honor the value 
set in distcp-defaults.xml unless user explicitly specify -bandwidth command 
line flag. If there is no value set in .xml file or as a command line flag, 
then the constant from java code should be used.

Additionally, I would expect that we also try to get values from 
distcp-site.xml, similar to other hadoop systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-11-08 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996139#comment-14996139
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
-

I did this long time ago... I have no preference about using - or --. I think I 
did it the way all other command line arguments in distcp were done, so if we 
need any fix it will probably be for all options?

> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.7.0, 2.7.1
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
> Fix For: 2.8.0
>
> Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, 
> HADOOP-11827-04.patch, HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-1540) distcp should support an exclude list

2015-05-11 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538520#comment-14538520
 ] 

Zoran Dimitrijevic commented on HADOOP-1540:


LGTM++

> distcp should support an exclude list
> -
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 2.6.0
>Reporter: Senthil Subramanian
>Assignee: Rich Haase
>Priority: Minor
>  Labels: BB2015-05-TBR, patch
> Attachments: HADOOP-1540.008.patch
>
>
> There should be a way to ignore specific paths (eg: those that have already 
> been copied over under the current srcPath). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-1540) distcp should support an exclude list

2015-05-08 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535968#comment-14535968
 ] 

Zoran Dimitrijevic commented on HADOOP-1540:


please fix: "The patch has 3 line(s) that end in whitespace. Use git apply 
--whitespace=fix."

RegexCopyFilter constructor is currently reading from a file which is not 
ideal. It would be nicer if there is an init method and keep constructor only 
reading filename from the config. But, again, this might not be Hadoop style, 
and reading a file in constructor might be ok.

Other than that, LGTM.

> distcp should support an exclude list
> -
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 2.6.0
>Reporter: Senthil Subramanian
>Assignee: Rich Haase
>Priority: Minor
>  Labels: BB2015-05-TBR, patch
> Attachments: HADOOP-1540.007.patch
>
>
> There should be a way to ignore specific paths (eg: those that have already 
> been copied over under the current srcPath). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-1540) distcp should support an exclude list

2015-05-08 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535403#comment-14535403
 ] 

Zoran Dimitrijevic commented on HADOOP-1540:


#5: we were experiencing performance issues for large number of files only 
because of RPCs to either namenode or to s3. Filtering each file name locally 
using a small number of compiled regex or glob rules should not be a big deal, 
especially since it's optional. For example, sorting a big filelist that we do 
now is much more expensive.

Thank you for your patch!

> distcp should support an exclude list
> -
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 2.6.0
>Reporter: Senthil Subramanian
>Assignee: Rich Haase
>Priority: Minor
>  Labels: BB2015-05-TBR, patch
> Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch, 
> HADOOP-1540.005.patch, HADOOP-1540.006.patch
>
>
> There should be a way to ignore specific paths (eg: those that have already 
> been copied over under the current srcPath). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-1540) distcp should support an exclude list

2015-05-08 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535345#comment-14535345
 ] 

Zoran Dimitrijevic commented on HADOOP-1540:


And one minor comment related to 
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestSimpleCopyFilter.java

Adding another test to test the case with multiple rules and making sure that 
all rules are applied when filtering seems like a good idea here. Two simple 
additional tests would be sufficient.

> distcp should support an exclude list
> -
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 2.6.0
>Reporter: Senthil Subramanian
>Assignee: Rich Haase
>Priority: Minor
>  Labels: BB2015-05-TBR, patch
> Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch, 
> HADOOP-1540.005.patch, HADOOP-1540.006.patch
>
>
> There should be a way to ignore specific paths (eg: those that have already 
> been copied over under the current srcPath). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-1540) distcp should support an exclude list

2015-05-08 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535309#comment-14535309
 ] 

Zoran Dimitrijevic commented on HADOOP-1540:


1.   
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java:564
 
 Minor: extra space in the comment.

2. 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
 Refactoring of parsing logic should have been a separate patch. This will be 
harder to cherry-pick to older branches. But since this is a good refactor 
change, and I am new to hadoop community, so it's fine with me. 

3. 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java:329
  Minor: space missing between - and 1  (-1 => - 1)

4. 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/package-info.java
Is this really part of this patch? Again, I am new to Hadoop community - so if 
it's ok to combine logically different changes it's definitely good.

5. 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyFilter.java
It would be more useful if it is matching glob expressions - matching 
substrings is a very unusual filter for file-list filtering and many users will 
be puzzled what to do. I would suggest if we extend this right now instead of 
submitting this patch as is - for example, *tmp would match filenames ending 
with tmp, and not any file that happens to contain tmp in it. Or in the 
unittest "test" filter matching /user/testing is not what I would expect.

Otherwise, looks good to me.


> distcp should support an exclude list
> -
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 2.6.0
>Reporter: Senthil Subramanian
>Assignee: Rich Haase
>Priority: Minor
>  Labels: BB2015-05-RFC, patch
> Attachments: HADOOP-1540.003.patch, HADOOP-1540.004.patch, 
> HADOOP-1540.005.patch, HADOOP-1540.006.patch
>
>
> There should be a way to ignore specific paths (eg: those that have already 
> been copied over under the current srcPath). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11876) Refactor code to make it more readable, minor maybePrintStats bug

2015-04-24 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11876:

Attachment: HADOOP-11876.patch

> Refactor code to make it more readable, minor maybePrintStats bug
> -
>
> Key: HADOOP-11876
> URL: https://issues.apache.org/jira/browse/HADOOP-11876
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Trivial
> Attachments: HADOOP-11876.patch
>
>
> This is related to HADOOP-11827 patch from few days ago. I've noticed a minor 
> bug in maybePrintStats logic which is called only when new directory is 
> processed, and prints every 100K objects (effectively, there is a very low 
> probability it'll ever print stats). The reason for this bug is that I was 
> previously printing stats for every new directory, and later decided it's 
> nicer to print stats for large number of new "objects" (files or directories) 
> instead. 
> This is a minor issue - and since I'm refactoring this I've also changed the 
> minor retry logic to make code nicer and more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11876) Refactor code to make it more readable, minor maybePrintStats bug

2015-04-24 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-11876:
---

 Summary: Refactor code to make it more readable, minor 
maybePrintStats bug
 Key: HADOOP-11876
 URL: https://issues.apache.org/jira/browse/HADOOP-11876
 Project: Hadoop Common
  Issue Type: Bug
  Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Trivial


This is related to HADOOP-11827 patch from few days ago. I've noticed a minor 
bug in maybePrintStats logic which is called only when new directory is 
processed, and prints every 100K objects (effectively, there is a very low 
probability it'll ever print stats). The reason for this bug is that I was 
previously printing stats for every new directory, and later decided it's nicer 
to print stats for large number of new "objects" (files or directories) 
instead. 

This is a minor issue - and since I'm refactoring this I've also changed the 
minor retry logic to make code nicer and more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-21 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505741#comment-14505741
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
-

LGTM++

> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.7.0, 2.7.1
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
> Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, 
> HADOOP-11827-04.patch, HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-20 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504391#comment-14504391
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
-

Sorry Ravi. Thanks for the comments Ravi!

> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.7.0, 2.7.1
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
> Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, 
> HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-20 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504359#comment-14504359
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
-

Thanks for the comments Allen. I've addressed most of them:

1, 2, 3: done
4: in order to prefer flags over properties, i needed a value to know whether 
flag was set or not. 0 seemed easier than yet another bool.
5. I added it so that I can have minimal changes in the unittest (rerun tests 
for various number of threads using org.junit.runners.Parameterized
6. done
7. agree. I wanted to make multithreaded logic outside of 
SimpleCopyListing.java but if you think it's an overkill, I can refactor. But 
it'll be uglier and if we need this again, we won't have the wrapper.
8. considering how much code is invoked for each of these simple 
MaybePrintStats, I don't think it's worth doing it. But, I don't have strong 
opinions, I just think we should print some progress since this stage can be 
order of tens of minutes.
9. removed.
10. In current code, we use the same file system instance, so I don't think 
it's a problem. I use one per thread since we have small number of threads and 
these run listStatus in parallel.
11. changed the docs - both are blocking, but one can be interrupted by 
exceptions, and then the user must handle it. Please suggest better names and 
I'll refactor it. Or maybe just keep one.

> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.7.0, 2.7.1
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
> Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, 
> HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-20 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11827:

Attachment: HADOOP-11827-03.patch

> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.7.0, 2.7.1
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
> Attachments: HADOOP-11827-02.patch, HADOOP-11827-03.patch, 
> HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491205#comment-14491205
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
-

Speedup for s3n source tree of 1.5M/50k dirs:
current distcp: 36min
2 threads:   17min
5 threads:   7min
10 threads: 3.5 min
20 threads: 2.3 min

For same source dataset on hdfs:
current distcp: 206 seconds
1 thread:  204 sec
2 threads: 257 sec (not yet sure why, will repeat the experiment)
3 threads: 154 sec
5 threads: 94 sec
10 threads: 51 sec
20 threads: 45 sec


> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-11827-02.patch, HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491188#comment-14491188
 ] 

Zoran Dimitrijevic commented on HADOOP-11785:
-

Sorry, updated wrong jira ticket. Please ignore previous comment.

> Reduce number of listStatus operation in distcp buildListing()
> --
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: distcp-liststatus.patch, distcp-liststatus2.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11827:

Attachment: HADOOP-11827-02.patch

small change to handle all exceptions in worker.

{color:green}+1 overall{color}.  

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.



> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-11827-02.patch, HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11785:

Attachment: HADOOP-11827-02.patch

Slight change to handle all exceptions in worker to handle occasional 
unexpected non-io exceptions.


{color:green}+1 overall{color}.  

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.




> Reduce number of listStatus operation in distcp buildListing()
> --
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: distcp-liststatus.patch, distcp-liststatus2.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11785:

Attachment: (was: HADOOP-11827-02.patch)

> Reduce number of listStatus operation in distcp buildListing()
> --
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: distcp-liststatus.patch, distcp-liststatus2.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491170#comment-14491170
 ] 

Zoran Dimitrijevic commented on HADOOP-11827:
-

Performance results and charts for dataset I used (1.5M files and approx 50K 
dirs): 

https://docs.google.com/spreadsheets/d/1qJfO9ZhPXuGCpHyfX1NLE0Zm_NB39gn-cELECShd_zk/edit#gid=0

Please note that there are two sheets (s3n -> hdfs and hdfs -> hdfs). Main 
improvement is when source is in s3. Improvements when source is hdfs is good 
as well, but since current distcp has to sort input file total improvement is 
not as important). 

TODO: We can sort only directories which would further improve startup time.

> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-11 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11827:

Attachment: HADOOP-11827.patch

test patch report from my laptop:

{color:green}+1 overall{color}.  

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.



> Speed-up distcp buildListing() using threadpool
> ---
>
> Key: HADOOP-11827
> URL: https://issues.apache.org/jira/browse/HADOOP-11827
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file 
> listing (client code, before starting mappers). For a dataset I used (1.5M 
> files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 
> 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-11 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-11827:
---

 Summary: Speed-up distcp buildListing() using threadpool
 Key: HADOOP-11827
 URL: https://issues.apache.org/jira/browse/HADOOP-11827
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Minor


For very large source trees on s3 distcp is taking long time to build file 
listing (client code, before starting mappers). For a dataset I used (1.5M 
files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 36 
minutes after the fix).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-01 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11785:

Attachment: distcp-liststatus2.patch

removed white space diffs.
reused FileSystem sourceFS in traverse as suggested.

> Reduce number of listStatus operation in distcp buildListing()
> --
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: distcp-liststatus.patch, distcp-liststatus2.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-01 Thread Zoran Dimitrijevic (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391205#comment-14391205
 ] 

Zoran Dimitrijevic commented on HADOOP-11785:
-

I did not change any tests since this is a performance-bug fix and all existing 
tests pass.

Should I mark this as a bug fix instead of improvement?



> Reduce number of listStatus operation in distcp buildListing()
> --
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Attachments: distcp-liststatus.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-01 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-11785:
---

 Summary: Reduce number of listStatus operation in distcp 
buildListing()
 Key: HADOOP-11785
 URL: https://issues.apache.org/jira/browse/HADOOP-11785
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Minor
 Fix For: 3.0.0
 Attachments: distcp-liststatus.patch

Distcp was taking long time in copyListing.buildListing() for large source 
trees (I was using source of 1.5M files in a tree of about 50K directories). 
For input at s3 buildListing was taking more than one hour. I've noticed a 
performance bug in the current code which does listStatus twice for each 
directory which doubles number of RPCs in some cases (if most directories do 
not contain >1000 files).

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-01 Thread Zoran Dimitrijevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoran Dimitrijevic updated HADOOP-11785:

Attachment: distcp-liststatus.patch

> Reduce number of listStatus operation in distcp buildListing()
> --
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Zoran Dimitrijevic
>Assignee: Zoran Dimitrijevic
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: distcp-liststatus.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source 
> trees (I was using source of 1.5M files in a tree of about 50K directories). 
> For input at s3 buildListing was taking more than one hour. I've noticed a 
> performance bug in the current code which does listStatus twice for each 
> directory which doubles number of RPCs in some cases (if most directories do 
> not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)