[jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex

2009-08-17 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744331#action_12744331
 ] 

Koji Noguchi commented on MAPREDUCE-865:


Simple testing.
Created har file with 
/a/b/2000files/xa to xaadnj
and /a/b/2000files/2000files/xa to xaadnj

Created har archive called myarchive.har.

About 4500 files. 

Withot the patch, 
/usr/bin/time hadoop dfs -lsr har:///user/knoguchi/myarchive.har > /dev/null
  
31.72user 5.23system *1:13.19* elapsed 50%CPU (0avgtext+0avgdata 0maxresident)

with 9000 open calls to Namenode. (_masterindex and _index) and also 4500 
filestatus calls to _index (I think).

With the patch, 
23.59user 0.58system *0:22.97* elapsed 105%CPU (0avgtext+0avgdata 0maxresident)

with one _master open call and five _index open calls.
Setting -Dfs.har.indexcache.num=1 changed the number of _index open calls  to 
10 times, but elapsed  time didn't change much.


The goal of the patch is more for reducing the load/calls to the namenode than 
speeding up the 'ls' commands.

Note that since client caches the entire _masterindex and also caches each 
STORE(cache range) it reads, initial call would be slower.



> harchive: Reduce the number of open calls  to _index and _masterindex 
> --
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: mapreduce-865-0.patch
>
>
> When I have har file with 1000 files in it, 
>% hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex

2009-08-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743041#action_12743041
 ] 

Koji Noguchi commented on MAPREDUCE-865:


I believe  _masterindex is probably small enough to fit in memory(cache)
For  _index file, 1 million files can correspond to _index size of 100MBytes. 
(It depend on the path length)
Creating a local copy could be costly.

In our clusters, most of the files are mapreduce output files. 
/a/b/part-0
/a/b/part-1
/a/b/part-2
...
These show up as a set in _index file in this order since 
HarFileSystem.getHarHash is written that way.
So instead of open->read->close _index for each part file, thinking of  keeping 
the index file open when possible.


> harchive: Reduce the number of open calls  to _index and _masterindex 
> --
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: harchive
>Reporter: Koji Noguchi
>Priority: Minor
>
> When I have har file with 1000 files in it, 
>% hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.