Dear Hailong, The reason you get I/O activity is due to the fact that the segments don't fit into the memory.
I would recommend reducing the size of your index so that indexes+segments occupy roughly 16GB. This is relatively easy to do in case you used multiple reducer tasks (during the crawling phase) to create multiple partitions. (see Notes at http://parsa.epfl.ch/cloudsuite/search.html: The mapred.reduce.tasks property determines how many index and segment partitions will be created.) Regards, -Stavros. ________________________________________ From: Hailong Yang [[email protected]] Sent: Friday, October 19, 2012 8:03 PM To: Volos Stavros Cc: [email protected]; Lingjia Tang; Jason Mars Subject: Re: How to fit the index into the memory for the web search benchmark Dear Stavros, Thank you for your reply. I understand the data structures required during the search. The 6GB is only the size of the actual index ( the directory of indexes). The whole data including the segments accounts for 30GB. Best Hailong On Fri, Oct 19, 2012 at 9:03 AM, Volos Stavros <[email protected]<mailto:[email protected]>> wrote: Dear Hailong, There are two components that are used when performing a query against the index serving node: (a) the actual index (under indexes) (b) segments (under segments) What exactly is 6GB? Are you including the segments as well? Regards, -Stavros. ________________________________________ From: Hailong Yang [[email protected]<mailto:[email protected]>] Sent: Wednesday, October 17, 2012 4:51 AM To: [email protected]<mailto:[email protected]> Cc: Lingjia Tang; Jason Mars Subject: How to fit the index into the memory for the web search benchmark Hi CloudSuite, I am experimenting with the web search benchmark. However, I am wondering how to fit the index into the memory in order to avoid unnecessary disk access. I have a 6GB index crawled from wikipedia and the RAM is 16GB. During the workload execution, I noticed there were periodical 2% I/O utilization increase and the memory used by nutch server was always less than 500MB. So I guess the whole index is not brought into the memory by default before serving the search queries, right? Could you tell me how to do that exactly as you did in the clearing cloud paper. Thanks! Best Hailong
