[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated MAPREDUCE-7309:
------------------------------------
    Description: 
This is an issue could affect all the releases which includes YARN-6927. 

Basically, we use regex match repeatedly when we read mapper/reducer resource 
request from config files. When we have large config file, and large number of 
splits, it could take a long time.  

We saw AM could take hours to parse config when we have 200k+ splits, with a 
large config file (hundreds of kbs). 

The problematic part is this:
{noformat}
  private void populateResourceCapability(TaskType taskType) {
    String resourceTypePrefix =
        getResourceTypePrefix(taskType);
    boolean memorySet = false;
    boolean cpuVcoresSet = false;

    if (resourceTypePrefix != null) {
      List<ResourceInformation> resourceRequests =
          ResourceUtils.getRequestedResourcesFromConfig(conf,
              resourceTypePrefix);
{noformat}

Inside {{ResourceUtils.getRequestedResourcesFromConfig()}}, we call 
{{Configuration.getValByRegex()}} which goes through all property keys that 
come from the MapReduce job configuration (jobconf.xml). If the job config is 
large (eg. due to being part of an MR pipeline and it was populated by an 
earlier job), then this results in running a regexp match unnecessarily for all 
properties over and over again. This is not necessary, because all mappers and 
reducers will have the same config, respectively.

We should do proper caching for pre-configured resource requests.

  was:
This is an issue could affect all the releases which includes YARN-6927. 

Basically, we use regex match repeatedly when we read mapper/reducer resource 
request from config files. When we have large config file, and large number of 
splits, it could take a long time.  

We saw AM could take hours to parse config when we have 200k+ splits, with a 
large config file (hundreds of kbs). 

The problamtic part is this:
{noformat}
  private void populateResourceCapability(TaskType taskType) {
    String resourceTypePrefix =
        getResourceTypePrefix(taskType);
    boolean memorySet = false;
    boolean cpuVcoresSet = false;

    if (resourceTypePrefix != null) {
      List<ResourceInformation> resourceRequests =
          ResourceUtils.getRequestedResourcesFromConfig(conf,
              resourceTypePrefix);
{noformat}

Inside {{ResourceUtils.getRequestedResourcesFromConfig()}}, we call 
{{Configuration.getValByRegex()}} which goes through all property keys that 
come from the MapReduce job configuration (jobconf.xml). If the job config is 
large (eg. due to being part of an MR pipeline and it was populated by an 
earlier job), then this results in running a regexp match unnecessarily for all 
properties over and over again. This is not necessary, because all mappers and 
reducers will have the same config, respectively.

We should do proper caching for pre-configured resource requests.


> Improve performance of reading resource request for mapper/reducers from 
> config
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7309
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7309
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: applicationmaster
>    Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>            Reporter: Wangda Tan
>            Assignee: Peter Bacsko
>            Priority: Major
>             Fix For: 3.2.2, 3.4.0, 3.1.5, 3.3.1
>
>         Attachments: MAPREDUCE-7309-003.patch, MAPREDUCE-7309-004.patch, 
> MAPREDUCE-7309-005.patch, MAPREDUCE-7309-branch-3.1-001.patch, 
> MAPREDUCE-7309-branch-3.2-001.patch, MAPREDUCE-7309-branch-3.3-001.patch, 
> MAPREDUCE-7309.001.patch, MAPREDUCE-7309.002.patch
>
>
> This is an issue could affect all the releases which includes YARN-6927. 
> Basically, we use regex match repeatedly when we read mapper/reducer resource 
> request from config files. When we have large config file, and large number 
> of splits, it could take a long time.  
> We saw AM could take hours to parse config when we have 200k+ splits, with a 
> large config file (hundreds of kbs). 
> The problematic part is this:
> {noformat}
>   private void populateResourceCapability(TaskType taskType) {
>     String resourceTypePrefix =
>         getResourceTypePrefix(taskType);
>     boolean memorySet = false;
>     boolean cpuVcoresSet = false;
>     if (resourceTypePrefix != null) {
>       List<ResourceInformation> resourceRequests =
>           ResourceUtils.getRequestedResourcesFromConfig(conf,
>               resourceTypePrefix);
> {noformat}
> Inside {{ResourceUtils.getRequestedResourcesFromConfig()}}, we call 
> {{Configuration.getValByRegex()}} which goes through all property keys that 
> come from the MapReduce job configuration (jobconf.xml). If the job config is 
> large (eg. due to being part of an MR pipeline and it was populated by an 
> earlier job), then this results in running a regexp match unnecessarily for 
> all properties over and over again. This is not necessary, because all 
> mappers and reducers will have the same config, respectively.
> We should do proper caching for pre-configured resource requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to