[jira] [Created] (HADOOP-13587) distcp.map.bandwidth.mb is overwritten even when -bandwidth flag isn't set

2016-09-07 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-13587:
---

 Summary: distcp.map.bandwidth.mb is overwritten even when 
-bandwidth flag isn't set
 Key: HADOOP-13587
 URL: https://issues.apache.org/jira/browse/HADOOP-13587
 Project: Hadoop Common
  Issue Type: Bug
  Components: tools/distcp
Affects Versions: 3.0.0-alpha1
Reporter: Zoran Dimitrijevic
Priority: Minor


distcp.map.bandwidth.mb exists in distcp-defaults.xml config file, but it is 
not honored even when it is . Current code always overwrites it with either 
default value (java const) or with -bandwidth command line option.

The expected behavior (at least how I would expect it) is to honor the value 
set in distcp-defaults.xml unless user explicitly specify -bandwidth command 
line flag. If there is no value set in .xml file or as a command line flag, 
then the constant from java code should be used.

Additionally, I would expect that we also try to get values from 
distcp-site.xml, similar to other hadoop systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org



[jira] [Created] (HADOOP-11876) Refactor code to make it more readable, minor maybePrintStats bug

2015-04-24 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-11876:
---

 Summary: Refactor code to make it more readable, minor 
maybePrintStats bug
 Key: HADOOP-11876
 URL: https://issues.apache.org/jira/browse/HADOOP-11876
 Project: Hadoop Common
  Issue Type: Bug
  Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Trivial


This is related to HADOOP-11827 patch from few days ago. I've noticed a minor 
bug in maybePrintStats logic which is called only when new directory is 
processed, and prints every 100K objects (effectively, there is a very low 
probability it'll ever print stats). The reason for this bug is that I was 
previously printing stats for every new directory, and later decided it's nicer 
to print stats for large number of new objects (files or directories) 
instead. 

This is a minor issue - and since I'm refactoring this I've also changed the 
minor retry logic to make code nicer and more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11827) Speed-up distcp buildListing() using threadpool

2015-04-11 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-11827:
---

 Summary: Speed-up distcp buildListing() using threadpool
 Key: HADOOP-11827
 URL: https://issues.apache.org/jira/browse/HADOOP-11827
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Minor


For very large source trees on s3 distcp is taking long time to build file 
listing (client code, before starting mappers). For a dataset I used (1.5M 
files, 50K dirs) it was taking 65 minutes before my fix in HADOOP-11785 and 36 
minutes after the fix).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()

2015-04-01 Thread Zoran Dimitrijevic (JIRA)
Zoran Dimitrijevic created HADOOP-11785:
---

 Summary: Reduce number of listStatus operation in distcp 
buildListing()
 Key: HADOOP-11785
 URL: https://issues.apache.org/jira/browse/HADOOP-11785
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Minor
 Fix For: 3.0.0
 Attachments: distcp-liststatus.patch

Distcp was taking long time in copyListing.buildListing() for large source 
trees (I was using source of 1.5M files in a tree of about 50K directories). 
For input at s3 buildListing was taking more than one hour. I've noticed a 
performance bug in the current code which does listStatus twice for each 
directory which doubles number of RPCs in some cases (if most directories do 
not contain 1000 files).

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)