[ https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743041#action_12743041 ]
Koji Noguchi commented on MAPREDUCE-865: ---------------------------------------- I believe _masterindex is probably small enough to fit in memory(cache) For _index file, 1 million files can correspond to _index size of 100MBytes. (It depend on the path length) Creating a local copy could be costly. In our clusters, most of the files are mapreduce output files. /a/b/part-00000 /a/b/part-00001 /a/b/part-00002 ... These show up as a set in _index file in this order since HarFileSystem.getHarHash is written that way. So instead of open->read->close _index for each part file, thinking of keeping the index file open when possible. > harchive: Reduce the number of open calls to _index and _masterindex > ---------------------------------------------------------------------- > > Key: MAPREDUCE-865 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-865 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: harchive > Reporter: Koji Noguchi > Priority: Minor > > When I have har file with 1000 files in it, > % hadoop dfs -lsr har:///user/knoguchi/myhar.har/ > would open/read/close the _index/_masterindex files 1000 times. > This makes the client slow and add some load to the namenode as well. > Any ways to reduce this number? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.