[ 
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743041#action_12743041
 ] 

Koji Noguchi commented on MAPREDUCE-865:
----------------------------------------

I believe  _masterindex is probably small enough to fit in memory(cache)
For  _index file, 1 million files can correspond to _index size of 100MBytes. 
(It depend on the path length)
Creating a local copy could be costly.

In our clusters, most of the files are mapreduce output files. 
/a/b/part-00000
/a/b/part-00001
/a/b/part-00002
...
These show up as a set in _index file in this order since 
HarFileSystem.getHarHash is written that way.
So instead of open->read->close _index for each part file, thinking of  keeping 
the index file open when possible.


> harchive: Reduce the number of open calls  to _index and _masterindex 
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-865
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Koji Noguchi
>            Priority: Minor
>
> When I have har file with 1000 files in it, 
>    % hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to