Hello, I'm writing a program which will finish lucene searching in about 12 index directorys, all of them are stored in HDFS. It is done like this: 1. We get about 12 index Directorys through lucene index functionality, each of which about 100M size, 2. We store these 12 index directorys on hadoop HDFS , and this hadoop cluster is made up of one namenode and five datanodes,totally 6 computers. 3. And then I will do lucene searching for these 12 index directorys, The mapreduce methods are as follows: Map Procedure: 12 index directory will be splitted into numOfMapTasks,for example, if numOfMapTasks=3, then each map we will get 4 indexDirs and store them in an Intermediate Result. Combine Procedure: for a intermediate Result locally, we will do really lucene search in its containing index directory. and then store these hit result in the intermediate Result. Reduce Procedure: Reduce the Intermediate Results' hit result. and get the search Result.
But when I implement like this, I have a performance problem, I set numOfMapTasks and numOfReduceTasks to any value,such as numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will spend about 28 seconds, and Obviously It is unacceptable. So I'm confused whether I did wrong map-reduce procedure or set wrong num of map or reduce tasks. and generally where the overhead of mapreduce proceduce will take place. Any suggestion will be appreciated. Thanks.