RE: How to fit the index into the memory for the web search benchmark

Volos Stavros Mon, 22 Oct 2012 04:07:57 -0700

Dear Hailong,

The frontend will ask the summary for the top documents. A backend node will 
receive a getSummary request for every top document it owns. You can go through 
the logs of the backend node and verify that the node does receive getSummary 
requests.

Regards,
-Stavros.
________________________________________
From: Hailong Yang [[email protected]]
Sent: Monday, October 22, 2012 10:38 AM
To: Volos Stavros
Cc: [email protected]; Lingjia Tang; Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark

Dear Stavros,

I am confused why we need to bring the segments into memory. I examined the log 
file from the front end server which recorded the queries sent to and responses 
received from the nutch server. The log file showed the nutch server only 
replied how many hits were found in the crawled dataset without being asked for 
the details of the page contents. So that means when orchestrating the 
searching, the object NutchBean never needs to call the method getSummary that 
accesses the segments to retrieve the page contents. That is also to say we 
don't need to care about whether the size of the segments could be able to fit 
into the memory for this specific web search workload in CloudSuite, right? 
Please Correct me if I am wrong.

Best

Hailong

On Sun, Oct 21, 2012 at 9:04 AM, Volos Stavros 
<[email protected]<mailto:[email protected]>> wrote:
Dear Hailong,

The reason you get I/O activity is due to the fact that the segments don't fit 
into the memory.

I would recommend reducing the size of your index so that indexes+segments 
occupy roughly 16GB.

This is relatively easy to do in case you used multiple reducer tasks (during 
the crawling phase) to create
multiple partitions.

(see Notes at http://parsa.epfl.ch/cloudsuite/search.html: The 
mapred.reduce.tasks property
determines how many index and segment partitions will be created.)

Regards,
-Stavros.
________________________________________
From: Hailong Yang [[email protected]<mailto:[email protected]>]
Sent: Friday, October 19, 2012 8:03 PM
To: Volos Stavros
Cc: [email protected]<mailto:[email protected]>; Lingjia Tang; 
Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark

Dear Stavros,

Thank you for your reply. I understand the data structures required during the 
search. The 6GB is only the size of the actual index ( the directory of 
indexes). The whole data including the segments accounts for 30GB.

Best

Hailong

On Fri, Oct 19, 2012 at 9:03 AM, Volos Stavros 
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
 wrote:
Dear Hailong,

There are two components that are used when performing a query against the 
index serving node:
(a) the actual index (under indexes)
(b) segments (under segments)

What exactly is 6GB? Are you including the segments as well?

Regards,
-Stavros.

________________________________________
From: Hailong Yang 
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>]
Sent: Wednesday, October 17, 2012 4:51 AM
To: 
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
Cc: Lingjia Tang; Jason Mars
Subject: How to fit the index into the memory for the web search benchmark

Hi CloudSuite,

I am experimenting with the web search benchmark. However, I am wondering how 
to fit the index into the memory in order to avoid unnecessary disk access. I 
have a 6GB index crawled from wikipedia and the RAM is 16GB. During the 
workload execution, I noticed there were periodical 2% I/O utilization increase 
and the memory used by nutch server was always less than 500MB. So I guess the 
whole index is not brought into the memory by default before serving the search 
queries, right? Could you tell me how to do that exactly as you did in the 
clearing cloud paper. Thanks!

Best

Hailong

RE: How to fit the index into the memory for the web search benchmark

Reply via email to