[
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744331#action_12744331
]
Koji Noguchi commented on MAPREDUCE-865:
Simple testing.
Created har file with
/a/b/2000files/xa to xaadnj
and /a/b/2000files/2000files/xa to xaadnj
Created har archive called myarchive.har.
About 4500 files.
Withot the patch,
/usr/bin/time hadoop dfs -lsr har:///user/knoguchi/myarchive.har > /dev/null
31.72user 5.23system *1:13.19* elapsed 50%CPU (0avgtext+0avgdata 0maxresident)
with 9000 open calls to Namenode. (_masterindex and _index) and also 4500
filestatus calls to _index (I think).
With the patch,
23.59user 0.58system *0:22.97* elapsed 105%CPU (0avgtext+0avgdata 0maxresident)
with one _master open call and five _index open calls.
Setting -Dfs.har.indexcache.num=1 changed the number of _index open calls to
10 times, but elapsed time didn't change much.
The goal of the patch is more for reducing the load/calls to the namenode than
speeding up the 'ls' commands.
Note that since client caches the entire _masterindex and also caches each
STORE(cache range) it reads, initial call would be slower.
> harchive: Reduce the number of open calls to _index and _masterindex
> --
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: harchive
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: mapreduce-865-0.patch
>
>
> When I have har file with 1000 files in it,
>% hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.